Tablet transition handlers prepare two mutations -- one for tablets
table, that sets transition state, transition mode and few others; and
another one for topology table that "activates" the tablet_migration
state for topology coordinator.
The latter is common to all three handlers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Continuation of the previous patch -- there's a common sanity check of
tablet transition API handlers, namely that this tablet is not in
transition already.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are three transition calls -- move, add replica and del replica --
and all three work similarly. In a loop they try to get guard for raft
operation, then perform sanity checks on topology state, then prepare
mutations and then try to apply them to raft. After the loop finishes
all three wait for transition for the given tablet to complete.
This patch generalizes the raft kicking loop and the transition
completion waiting code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The code here looks like this
if src.host == dst.host
throw "Local migration not possible"
if src == dst
co_return;
The 2nd check is apparently never satisfied -- if src == dst this means
that src.host == dst.host and it should have thrown already
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It tries to call container().invoke_on_all() the hard way.
Calling it directly is not possible, because there's no
sharded::invoke_on_all() const overload
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18202
The cql-pytest framework allows running tests also against Cassandra,
but developers need to install Cassandra on their own because modern
distributions such as Fedora no longer carry a Cassandra package.
This patch adds clear and easy to follow (I think) instructions on how
to download a pre-compiled Cassadra, or alternatively how to download
and build Cassandra from source - and how either can be used with the
test/cql-pytest/run-cassandra script.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18138
For view builder draining there's dedicated deferred action in main while all other services that need to be drained do it via storage_service. The latter is to unify shutdown for services and to make `nodetool drain` drain everything, not just some part of those. This PR makes view builder drain look the same. As a side effect it also moves `mark_existing_views_as_built` from storage service to view builder and generalizes this marking code inside view builder itself.
refs: #2737
refs: #2795Closesscylladb/scylladb#16558
* github.com:scylladb/scylladb:
storage_service: Drain view builder on drain too
view_builder: Generalize mark_as_built(view_ptr) method
view_builder: Move mark_existing_views_as_built from storage service
storage_service: Add view_builder& reference
main,cql_test_env: Move view_builder start up (and make unconditional)
The partitions_bigger_than_threshold is incremented only if the previous
check detects that the partition exceeds a threshold by its size. It's
done with an extra if, but it can be done without (explicit) condition
as bool type is guaranteed by the standard to convert into integers as
true = 1 and false = 0
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18217
Currentl code counts the number of keys in it just to see if this number
is non-zero. Using .contains() method is better fit here
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18219
The sharded<database> is used as a invoke_in_all() method provider,
there's no real need in database itself. Simple smp::invoke_on_all()
would work just as good.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18221
Some tests want to ignore out_of_range exception in continuation and go
the longer route for that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18216
When constructing a vector with partition key data, the size of that
vector is known beforehand
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18239
There's a helper map-reducer that accepts a function to call on
commitlog. All callers accumulate statistics with it, so the commitlog
argument is const pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18238
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
The PR also adds a regression test.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720Closesscylladb/scylladb#18184
* github.com:scylladb/scylladb:
test: reproducer for missing gossiper updates
gossiper: lock local endpoint when updating heart_beat
By default the suitename in the junit files generated by pytest
is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft`
With this change, the junit files will use the real suitename
This change doesn't affect the Test Report in Jenkins, but it
raised part of the other task of publishing the test results to
elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950
where we parse the XMLs and we need the correct suitename
Closesscylladb/scylladb#18172
When altering rf for a keyspace, all tablets in this ks may have less replicas. Part of this process is removing replicas from some node(s). This PR extends the tablets rebuild transition to handle this case by making pending_replica optional.
fixes: #18176Closesscylladb/scylladb#18203
* github.com:scylladb/scylladb:
test: Tune up tablet-transition test to check del_replica
api: Add method to delete replica from tablet
tablet: Make pending replica optional
For that the test case is modified to have 3 nodes and 2 replicas on
start. Existing test cases are changed slightly in the way "from" host
is detected.
Also, the final check for data presense is modified to check that hosts
in "replicas" have data and other hosts don't have it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Copied from the add_replica counterpart
TODO: Generalize common parts of move_tablet and add_|del_tablet_replica
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just like leaving replica could be optional when adding replica to
tablet, the pending replica can be optional too if we're removing a
replica from tablet
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixesscylladb/scylladb#18083
* seastar cd8a9133...f3058414 (18):
> src/core/metrics.cc: rewrite set_metric_family_configs
> include/seastar/core/metrics_api.hh: Revert d2929c2ade5bd0125a73d53280c82ae5da86218e
> sstring: include <fmt/format.h> instead of <fmt/ostream.h>
> seastar.cc: include used header
> tls: include used header of <unordered_set>
> docs: remove unused parameter from handle_connection function of echo-HTTP-server tutorial example
> stall-analyser: use 0 for the default value of --width
> http: Move parsed params and urls
> scripts: use raw string to avoid invalid escape sequences
> timed_out_error: add fmt::formatter for timed_out_error
> scripts/stall-analyser: change default branch-threshold to 3%
> scripts/stall-analyser: resolve string escape sequence warning
> io_queue: Use static vector for fair groups too
> io_queue: Use static vector to store fair queues
> stall-analyser: add space around '=' in param list
> stall-analyser: add a space between 'var: Type' in type annotation
> stall-analyser: move variables closer to where they are used
> memory: drop support for compilers that don't support aligned new
Closesscylladb/scylladb#18235
this header was previously brought in by seastar's sstring.hh. but
since sstring.hh does not include <fmt/ostream.h> anymore,
`gms/application_state.cc` does not have access to this header.
also, `gms/application_state.cc` should `#include` the used header
by itself.
so, in this change, let's include <fmt/ostream.h> in `gms/application_state.cc`.
this change addresses the FTBFS with the latest seastar.
the same applies to other places changed in this commit.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18193
This gets rid of dangling deferred drin on stop and makes nodetool drain
more "consistent" by stopping one more unneeded background activity
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Storage service will need to drain v.b. on its drain. Also on cluster
join it marks existing views as built while it's v.b.'s job to do it.
Both will be fixed by next patching and this is prerequisite.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Just starting sharded<view_builder> is lightweight, its constructor does
nothing but initializes on-board variables. Real work takes off on
view_builder::start() which is not moved.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The current test in boost/cql_query_large_test::test_large_data only checks whether notifications for large rows and cells are written into the system keyspace. It doesn't check this for partitions.
This change adds this check for partitions.
Closesscylladb/scylladb#18189
* github.com:scylladb/scylladb:
test/boost: added test for large row count warning
test/boost: add test for writing large partition notifications
Repair memory limit includes only the size of frozen mutation
fragments in repair row. The size of other members of repair
row may grow uncontrollably and cause out of memory.
Modify what's counted to repair memory limit.
Fixes: #16710.
Closesscylladb/scylladb#17785
* github.com:scylladb/scylladb:
test: add test for repair_row::size()
repair: fix memory accounting in repair_row
When altering rf for a keyspace, all tablets in this ks will get more replicas. Part of this process is rebuilding tablets' onto new node(s). This PR extends the tablets transition code to support rebuilding of tablet on new replica.
fixes: #18030Closesscylladb/scylladb#18082
* github.com:scylladb/scylladb:
test: Check data presense as well
test: Test how tablets are copied between nodes
test: Add sanity test for tablet migration
api: Add method to add replica to a tablet
tablet: Make leaving replica optional
The formatted_sstables_list is auxiliary class that collects a bunch of
sstables::to_string(shared_sstable)-generated strings. One of bad side
effects of this helper is that it allocates memory for the vector of
strings.
This patch achieves the same goal with the help of fmt::join() equipped
with transformed boost adaptor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18160
In testing, we've observed multiple cases where nodes would fail to
observe updated application states of other nodes in gossiper.
For example:
- in scylladb/scylladb#16902, a node would finish bootstrapping and enter
NORMAL state, propagating this information through gossiper. However,
other nodes would never observe that the node entered NORMAL state,
still thinking that it is in joining state. This would lead to further
bad consequences down the line.
- in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for
schema versions to converge. Convergence would never be achieved and the
test eventually timed out. The node was observing outdated schema state
of some existing node in gossip.
I created a test that would bootstrap 3 nodes, then wait until they all
observe each other as NORMAL, with timeout. Unfortunately, thousands of
runs of this test on different machines failed to reproduce the problem.
After banging my head against the wall failing to reproduce, I decided
to sprinkle randomized sleeps across multiple places in gossiper code
and finally: the test started catching the problem in about 1 in 1000
runs.
With additional logging and additional head-banging, I determined
the root cause.
The following scenario can happen, 2 nodes are sufficient, let's call
them A and B:
- Node B calls `add_local_application_state` to update its gossiper
state, for example, to propagate its new NORMAL status.
- `add_local_application_state` takes a copy of the endpoint_state, and
updates the copy:
```
auto local_state = *ep_state_before;
for (auto& p : states) {
auto& state = p.first;
auto& value = p.second;
value = versioned_value::clone_with_higher_version(value);
local_state.add_application_state(state, value);
}
```
`clone_with_higher_version` bumps `version` inside
gms/version_generator.cc.
- `add_local_application_state` calls `gossiper.replicate(...)`
- `replicate` works in 2 phases to achieve exception safety: in first
phase it copies the updated `local_state` to all shards into a
separate map. In second phase the values from separate map are used to
overwrite the endpoint_state map used for gossiping.
Due to the cross-shard calls of the 1 phase, there is a yield before
the second phase. *During this yield* the following happens:
- `gossiper::run()` loop on B executes and bumps node B's `heart_beat`.
This uses the monotonic version_generator, so it uses a higher version
then the ones we used for states added above. Let's call this new version
X. Note that X is larger than the versions used by application_states
added above.
- now node B handles a SYN or ACK message from node A, creating
an ACK or ACK2 message in response. This message contains:
- old application states (NOT including the update described above,
because `replicate` is still sleeping before phase 2),
- but bumped heart_beat == X from `gossiper::run()` loop,
and sends the message.
- node A receives the message and remembers that the max
version across all states (including heart_beat) of node B is X.
This means that it will no longer request or apply states from node B
with versions smaller than X.
- `gossiper.replicate(...)` on B wakes up, and overwrites
endpoint_state with the ones it saved in phase 1. In particular it
reverts heart_beat back to smaller value, but the larger problem is that it
saves updated application_states that use versions smaller than X.
- now when node B sends the updated application_states in ACK or ACK2
message to node A, node A will ignore them, because their versions are
smaller than X. Or node B will never send them, because whenever node
A requests states from node B, it only requests states with versions >
X. Either way, node A will fail to observe new states of node B.
If I understand correctly, this is a regression introduced in
38c2347a3c, which introduced a yield in
`replicate`. Before that, the updated state would be saved atomically on
shard 0, there could be no `heart_beat` bump in-between making a copy of
the local state, updating it, and then saving it.
With the description above, it's easy to make a consistent
reproducer for the problem -- introduce a longer sleep in
`add_local_application_state` before second phase of replicate, to
increase the chance that gossiper loop will execute and bump heart_beat
version during the yield. Further commit adds a test based on that.
The fix is to bump the heart_beat under local endpoint lock, which is
also taken by `replicate`.
Fixes: scylladb/scylladb#15393Fixes: scylladb/scylladb#15602Fixes: scylladb/scylladb#16668Fixes: scylladb/scylladb#16902Fixes: scylladb/scylladb#17493Fixes: scylladb/scylladb#18118
Ref: scylladb/scylla-enterprise#3720
The current test in boost/cql_query_large_test::test_large_data only checks whether notifications for large rows and cells are written into the system keyspace. It doesn't check this for partitions.
This change adds this check for partitions.
Other than making sure that system.tablets is updated with correct
replica set, it's also good to check that the data is present on the
repsective nodes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In repair, only the size of frozen mutation fragments of repair row
is counted to the memory limit. So, huge keys of repair rows may
lead to OOM.
Include other repair_row's members' memory size in repair memory
limit.
A few months ago, in merge d3c1be9107,
we decided that if Scylla has the experimental "tablets" feature enabled,
new Alternator tables should use this feature by default - exactly like
this is the default for new CQL tables.
Sadly, it was now decided to reverse this decision: We do not yet trust
enough LWT on tablets, and since Alternator often (if not always) relies
on LWT, we want Alternator tables to continue to use vnodes - not tablets.
The fix is trivial - just changing the default. No test needed to change
because anyway, all Alternator tests work correctly on Scylla with the
tablets experimental feature disabled. I added a new test to enshrine
the fact that Alternator does not use tablets.
An unfortunate result of this patch will be that Alternator tables
created on versions with this patch (e.g., Scylla 6.0) will not use
tablets and will continue to not use tablets even if Scylla is upgraded
(currently, the use of tablets is decided at table creation time, and
there is no way to "upgrade" a vnode-based table to be tablet based).
This patch should be reverted as soon as LWT support matures on tablets.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#18157
Loader wants to print set of sstables' names. For that it collects names
into a dedicated vector, then prints it using fmt/ranges facility.
There's a way to achieve the same goal without allocating extra vector
with names -- use fmt::format() and pass it a range converting sstables
into their names.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18159
Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly.
This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables.
The logged data will look like this:
Large cells:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db
Large rows:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db
Large partitions:
WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db
Fixes#18041Closesscylladb/scylladb#18166
in seastar's b28342fa5a301de3facf5e83dc691524a6b20604, we switched
* `io_queue::_streams` from
`boost::container::small_vector<fair_queue, 2>` to
`boost::container::static_vector<fair_queue, 2>`
* `io_queue::_fgs` from
`std::vector<std::unique_ptr<fair_group>>` to
`boost::container::static_vector<fair_group, 2>`
so we need to update the gdb script accordingly to reflect this
change, and to avoid the nested try-except blocks, we switch to
a `while` statement to simplify the code structure.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18165