Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
- (cherry picked from commit 8b756ea837)
- (cherry picked from commit ab96c703ff)
Parent PR: #24497Closesscylladb/scylladb#24739
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.
This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.
Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.
Fixes: #24581
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24640
(cherry picked from commit 279253ffd0)
Closesscylladb/scylladb#24691
Consider the following scenario:
1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)
We have two possible outcomes:
With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]
With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.
This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.
Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.
Fixes: https://github.com/scylladb/scylladb/issues/23313
This change needs to be backported to all supported versions which implement tablet merge.
- (cherry picked from commit d0329ca370)
- (cherry picked from commit 1f9f724441)
- (cherry picked from commit 53df911145)
Parent PR: #24287Closesscylladb/scylladb#24338
* github.com:scylladb/scylladb:
replica: Fix range reads spanning sibling tablets
test: add reproducer and test for mutation source refresh after merge
tablets: trigger mutation source refresh on tablet count change
After paxos state is repaired in begin_and_repair_paxos we need to
re-check the state regardless if write back succeeded or not. This
is how the code worked originally but it was unintentionally changed
when co-routinized in 61b2e41a23.
Fixes#24630Closesscylladb/scylladb#24651
(cherry picked from commit 5f953eb092)
Closesscylladb/scylladb#24701
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
(cherry picked from commit ee6d7c6ad9)
Closesscylladb/scylladb#24741
We don't guarantee that coordinators will only emit range reads that
span only one tablet.
Consider this scenario:
1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.
We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.
(cherry picked from commit 1f9f724441)
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.
The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.
Fixes: https://github.com/scylladb/scylladb/issues/24506
(cherry picked from commit ab96c703ff)
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.
This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.
A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#23536Closesscylladb/scylladb#23829
(cherry picked from commit 5ff693eff6)
Closesscylladb/scylladb#24627
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).
When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1
- (cherry picked from commit 97c60b8153)
- (cherry picked from commit dd01852341)
Parent PR: #24527Closesscylladb/scylladb#24569
* github.com:scylladb/scylladb:
test: add test for live updates of permissions cache config
main: don't start maintenance auth service if not enabled
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.
Wait until at least one child of a root repair task is created.
Fixes: #24556.
Closesscylladb/scylladb#24560
(cherry picked from commit 0deb9209a0)
Closesscylladb/scylladb#24654
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).
When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
(cherry picked from commit 97c60b8153)
Token metadata api is initialized before gossiper is started.
get_host_id_map REST endpoint cannot function without the fully
initialized gossiper though. The gossiper is started deep in
the join_cluster call chain, but if we move token_metadata api
initialization after the call it means that no api will be available
during bootstrap. This is not what we want.
Make a simple fix by returning an error from the api if the gossiper is
not initialized yet.
Fixes: #24479Closesscylladb/scylladb#24575
(cherry picked from commit e364995e28)
Closesscylladb/scylladb#24584
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
Fixes scylladb/scylladb#21413
Backport candidate because it fixes a crash that can happen in existing stable branches.
- (cherry picked from commit 7d551f99be)
- (cherry picked from commit 975e7e405a)
Parent PR: #21638Closesscylladb/scylladb#24602
* github.com:scylladb/scylladb:
memtable: ensure _flushed_memory doesn't grow above total memory usage
replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.
Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.
I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.
Fixes scylladb/scylladb#23959
Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.
- (cherry picked from commit dbb8835fdf)
- (cherry picked from commit 9281bff0e3)
- (cherry picked from commit 5b83304b38)
- (cherry picked from commit 73b22d4f6b)
- (cherry picked from commit 2882b7e48a)
- (cherry picked from commit 4c46551c6b)
- (cherry picked from commit 92f7d5bf10)
- (cherry picked from commit 5d1bb8ebc5)
- (cherry picked from commit d3c0cd6d9d)
- (cherry picked from commit 04567c28a3)
- (cherry picked from commit c8c28dae92)
- (cherry picked from commit c4b32c38a3)
- (cherry picked from commit ee96f8dcfc)
Parent PR: #23661Closesscylladb/scylladb#24120
* github.com:scylladb/scylladb:
test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites
test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests
test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity
test/topology_custom/test_multidc.py: Adjust to RF-rack-validity
test/object_store/test_backup.py: Adjust to RF-rack-validity
test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity
dirty_memory_manager tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
dirty_memory_manager controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in ~flush_memory_accounter which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the mutation_cleaner later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in mutation_cleaner intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
(cherry picked from commit 975e7e405a)
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).
Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.
But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.
But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.
To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.
This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
(cherry picked from commit 7d551f99be)
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.
(cherry picked from commit ee96f8dcfc)
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
(cherry picked from commit c4b32c38a3)
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
(cherry picked from commit c8c28dae92)
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
(cherry picked from commit 04567c28a3)
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.
The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.
(cherry picked from commit d3c0cd6d9d)
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.
(cherry picked from commit 5d1bb8ebc5)
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.
(cherry picked from commit 92f7d5bf10)
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.
(cherry picked from commit 4c46551c6b)
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.
Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.
(cherry picked from commit 2882b7e48a)
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.
(cherry picked from commit 73b22d4f6b)
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.
(cherry picked from commit 5b83304b38)
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.
Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.
(cherry picked from commit 9281bff0e3)
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:
* Using `pytest.mark.prepare_3_racks_cluster` instead of
`pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
`ManagerClient::servers_add()`.
In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
(cherry picked from commit dbb8835fdf)
The contract in mutation_reader.hh says:
```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
future<> fast_forward_to(const dht::partition_range& pr) {
```
`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.
Fix that.
Fixesscylladb/scylladb#24542Closesscylladb/scylladb#24543
(cherry picked from commit 27f66fb110)
Closesscylladb/scylladb#24547
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixes https://github.com/scylladb/scylladb/issues/23481
backport to all relevant releases since it's a bug that results in a crash
- (cherry picked from commit 34f15ca871)
- (cherry picked from commit fb18fc0505)
- (cherry picked from commit bd88ca92c8)
Parent PR: #24393Closesscylladb/scylladb#24487
* github.com:scylladb/scylladb:
test/cluster/test_tablets: test restart during tablet cleanup
test: tablets: add get_tablet_info helper
tablets: deallocate storage state on end_migration
Add a test that reproduces issue scylladb/scylladb#23481.
The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.
This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.
(cherry picked from commit bd88ca92c8)
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it was already cleaned up
4. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixesscylladb/scylladb#23481
(cherry picked from commit 34f15ca871)
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.
In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.
To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.
Refs: #24163Closesscylladb/scylladb#24185
(cherry picked from commit 555d897a15)
Closesscylladb/scylladb#24519
Consider the following scenario:
- let's assume tablet 0 has range [1, 5] (pre merge)
- tablet merge happens, tablet 0 has now range [1, 10]
- tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet
0 still has range [1, 5]
- during a full scan, forward service will intersect the full range with
tablet ranges and consume one tablet at a time
- replica service is asked to consume range [1, 10] of tablet 0 (post merge)
We have two possible outcomes:
With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly
all sstables intersecting with range [1, 10]
With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to
range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable,
EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no
premature EOS is signalled, otherwise preventing fast forward to from
happening and all data from being properly captured in the read.
This change fixes the bug and triggeres a mutation source refresh whenever
the number of tablets for the table has changed, not only when we have
incoming tablets.
Fixes: #23313
(cherry picked from commit d0329ca370)
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.
Fixes: scylladb/scylladb#23968
Needs backport to 2025.1 and 6.2, both have the flaky test
- (cherry picked from commit 51025de755)
- (cherry picked from commit 29eedaa0e5)
Parent PR: #23989Closesscylladb/scylladb#24049
* github.com:scylladb/scylladb:
test/cluster/test_read_repair.py: improve trace logging test (again)
test/cluster: extract execute_with_tracing() into pylib/util.py