This test asserts that a read repair really happened. To ensure this
happens it writes a single partition after enabling the database_apply
error injection point. For some reason, the write is sometimes reordered
with the error injection and the write will get replicated to both nodes
and no read repair will happen, failing the test.
To make the test less sensitive to such rare reordering, add a
clustering column to the table and write a 100 rows. The chance of *all*
100 of them being reordered with the error injection should be low
enough that it doesn't happen again (famous last words).
Fixes: #24330Closesscylladb/scylladb#24403
(cherry picked from commit 495f607e73)
Closesscylladb/scylladb#24972
2025.1 only is susceptible. Merge has slightly different logic in
master, test had to be adjusted for 2025.1 but is flaky.
Can happen two successive merges cause the merge waiting to never
finish.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Fixesscylladb/scylladb#24821Closesscylladb/scylladb#24936
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.
Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.
backport to relevant versions since the issue can cause node shutdown to hang
Fixes scylladb/scylladb#24599
- (cherry picked from commit 8d48b27062)
- (cherry picked from commit fc5ba4a1ea)
- (cherry picked from commit 7150632cf2)
- (cherry picked from commit 74a3fa9671)
- (cherry picked from commit a9b476e057)
- (cherry picked from commit d7af26a437)
Parent PR: #24595Closesscylladb/scylladb#24878
* github.com:scylladb/scylladb:
test: test_batchlog_manager: batchlog replay includes cdc
test: test_batchlog_manager: test batch replay when a node is down
batchlog_manager: set timeout on writes
batchlog_manager: abort writes on shutdown
batchlog_manager: create cancellable write response handler
storage_proxy: add write type parameter to mutate_internal
Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed
a preexisting fragility which I missed.
1) truncate gets RP mark X, truncated_at = second T
2) new sstable written during snapshot or later, also at second T (difference of MS)
3) discard_sstables() get RP Y > saved RP X, since creation time of sstable
with RP Y is equal to truncated_at = second T.
So the problem is that truncate is using a clock of second granularity for
filtering out sstables written later, and after we got low mark and truncate time,
it can happen that a sstable is flushed later within the same second, but at a
different millisecond.
By switching to a millisecond clock (db_clock), we allow sstables written later
within the same second from being filtered out. It's not perfect but
extremely unlikely a new write lands and get flushed in the same
millisecond we recorded truncated_at timepoint. In practice, truncate
will not be used concurrently to writes, so this should be enough for
our tests performing such concurrent actions.
We're moving away from gc_clock which is our cheap lowres_clock, but
time is only retrieved when creating sstable objects, which frequency of
creation is low enough for not having significant consequences, and also
db_clock should be cheap enough since it's usually syscall-less.
Fixes#23771.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24426
(cherry picked from commit 2d716f3ffe)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24875
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.
This is done in order to verify that it works currently as expected and
doesn't break in the future.
(cherry picked from commit d7af26a437)
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.
The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.
It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.
(cherry picked from commit a9b476e057)
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
- (cherry picked from commit 92b5fe8983)
- (cherry picked from commit 0753643606)
- (cherry picked from commit b0d5462440)
- (cherry picked from commit 093d4f8d69)
- (cherry picked from commit 678deece88)
- (cherry picked from commit 64f8500367)
- (cherry picked from commit b931145a26)
- (cherry picked from commit 3e1c50e9a7)
- (cherry picked from commit 46ff7f9c12)
- (cherry picked from commit ebd9420687)
- (cherry picked from commit aae212a87c)
- (cherry picked from commit 592ca789e2)
- (cherry picked from commit edc2906892)
Parent PR: #24492Closesscylladb/scylladb#24740
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
* create a table with random schema
* generate data: random mutations + one row with bad key
* write data to sstable
* check that only good data is written to sstable
* check that the bad data was saved to system.corrupt_data
(cherry picked from commit edc2906892)
Similar to how large_data_handler is handled, propagate through
sstables::sstables_manager and store its owner: replica::database.
Tests and tools are also patched. Mostly mechanical changes, updating
constructors and patching callers.
(cherry picked from commit ebd9420687)
Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations.
Fixes: https://github.com/scylladb/scylladb/issues/24506
Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters.
- (cherry picked from commit 8b756ea837)
- (cherry picked from commit ab96c703ff)
Parent PR: #24497Closesscylladb/scylladb#24739
* github.com:scylladb/scylladb:
mutation: check key of inserted rows
compound: optimize is_full() for single-component types
The exponent of a big decimal string is parsed as an int32, adjusted for
the removed fractional part, and stored as an int32. When parsing values
like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale
is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32
limit, and since the scale is stored as an int32, it overflows and wraps
around, losing the value.
This patch fixes that the by parsing the exponent as an int64 value and
then adjusting it for the fractional part. The adjusted scale is then
checked to see if it is still within int32 limits before storing. An
exception is thrown if it is not within the int32 limits.
Note that strings with exponents that exceed the int32 range, like
`0.01E2147483650`, were previously not parseable as a big decimal. They
are now accepted if the final adjusted scale fits within int32 limits.
For the above value, unscaled_value = 1 and scale = -2147483648, so it
is now accepted. This is in line with how Java's `BigDecimal` parses
strings.
Fixes: #24581
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#24640
(cherry picked from commit 279253ffd0)
Closesscylladb/scylladb#24691
Consider the following scenario:
1) let's assume tablet 0 has range [1, 5] (pre merge)
2) tablet merge happens, tablet 0 has now range [1, 10]
3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5]
4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time
5) replica service is asked to consume range [1, 10] of tablet 0 (post merge)
We have two possible outcomes:
With cache bypass:
1) cache reader is bypassed
2) sstable reader is created on range [1, 10]
3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10]
With cache:
1) cache reader is created
2) finds partition with token 5 is cached
3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0)
4) incremental selector consumes the pre-merge sstable spanning range [1, 5]
4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached
4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed.
So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read.
This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets.
Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution.
Fixes: https://github.com/scylladb/scylladb/issues/23313
This change needs to be backported to all supported versions which implement tablet merge.
- (cherry picked from commit d0329ca370)
- (cherry picked from commit 1f9f724441)
- (cherry picked from commit 53df911145)
Parent PR: #24287Closesscylladb/scylladb#24338
* github.com:scylladb/scylladb:
replica: Fix range reads spanning sibling tablets
test: add reproducer and test for mutation source refresh after merge
tablets: trigger mutation source refresh on tablet count change
Currently the test indiscriminately injects failures into the flushes of
any table, via the IO extension mechanism. The tests want to check that
the node correctly handles the IO error by self isolating, however the
indiscriminate IO errors can have unintended consequences when they hit
raft, leading to disorderly shutdown and failure of the tests. Testing
raft's resiliency to IO errors if of course worth doing, but it is not
the goal of this particular test, so to avoid the fallout, the IO errors
are limited to the test tables only.
Fixes: https://github.com/scylladb/scylladb/issues/24637Closesscylladb/scylladb#24638
(cherry picked from commit ee6d7c6ad9)
Closesscylladb/scylladb#24741
We don't guarantee that coordinators will only emit range reads that
span only one tablet.
Consider this scenario:
1) split is about to be finalized, barrier is executed, completes.
2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet)
3) split is committed to group0, all replicas switch storage.
4) replica-side read is executed, uses a range which spans tablets.
We could fix it with two-phase split execution. Rather than pushing the
complexity to higher levels, let's fix incremental selector which should
be able to serve all the tokens owned by a given shard. During split
execution, either of sibling tablets aren't going anywhere since it
runs with state machine locked, so a single read spanning both
sibling tablets works as long as the selector works across tablet
boundaries.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 53df911145)
This change adds a reproducer and test for the fix where the local mutation
source is not always refreshed after a tablet merge.
(cherry picked from commit 1f9f724441)
Make sure the keys are full prefixes as it is expected to be the case
for rows. At severeal occasions we have seen empty row keys make their
ways into the sstables, despite the fact that they are not allowed by
the CQL frontend. This means that such empty keys are possibly results
of memory corruption or use-after-{free,copy} errors. The source of the
corruption is impossible to pinpoint when the empty key is discovered in
the sstable. So this patch adds checks for such keys to places where
mutations are built: when building or unserializing mutations.
The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to
work with the changes: it has to make the schema use compact storage,
otherwise the non-full changes used by this tests are rejected by the
new checks.
Fixes: https://github.com/scylladb/scylladb/issues/24506
(cherry picked from commit ab96c703ff)
In the present scenario, the bootstrapping node undergoes synchronize phase after
initialization of group0, then enters post_raft phase and becomes fully ready for
group0 operations. The topology coordinator is agnostic of this and issues stream
ranges command as soon as the node successfully completes `join_group0`. Although for
a node booting into an already upgraded cluster, the time duration for which, node
remains in synchronize phase is negligible but this race condition causes trouble in a
small percentage of cases, since the stream ranges operation fails and node fails to bootstrap.
This commit addresses this issue and updates the error throw logic to account for this
edge case and lets the node wait (with timeouts) for synchronize phase to get over instead of throwing
error.
A regression test is also added to confirm the working of this code change. The test adds a
wait in synchronize phase for newly joining node and releases only after the program counter
reaches the synchronize case in the `start_operation` function. Hence it indicates that in the
updated code, the start_operation will wait for the node to get done with the
synchronize phase instead of throwing error.
This PR fixes a bug. Hence we need to backport it.
Fixes: scylladb/scylladb#23536Closesscylladb/scylladb#23829
(cherry picked from commit 5ff693eff6)
Closesscylladb/scylladb#24627
In f96d30c2b5
we introduced the maintenance service, which is an additional
instance of auth::service. But this service has a somewhat
confusing 2-level startup mechanism: it's initialized with
sharded<Service>::start and then auth::service::start
(different method with the same name to confuse even more).
When maintenance_socket was disabled (default setting), the code
did only the first part of the startup. This registered a config
observer but didn't create a permission_cache instance.
As a result, a crash on SIGHUP when config is reloaded can occur.
Fixes: https://github.com/scylladb/scylladb/issues/24528
Backport: all not eol versions since 6.0 and 2025.1
- (cherry picked from commit 97c60b8153)
- (cherry picked from commit dd01852341)
Parent PR: #24527Closesscylladb/scylladb#24569
* github.com:scylladb/scylladb:
test: add test for live updates of permissions cache config
main: don't start maintenance auth service if not enabled
test_repair_task_progress checks the progress of children of root
repair task. However, nothing ensures that the children are
already created.
Wait until at least one child of a root repair task is created.
Fixes: #24556.
Closesscylladb/scylladb#24560
(cherry picked from commit 0deb9209a0)
Closesscylladb/scylladb#24654
`dirty_memory_manager` tracks two quantities about memtable memory usage:
"real" and "unspooled" memory usage.
"real" is the total memory usage (sum of `occupancy().total_space()`)
by all memtable LSA regions, plus a upper-bound estimate of the size of
memtable data which has already moved to the cache region but isn't
evictable (merged into the cache) yet.
"unspooled" is the difference between total memory usage by all memtable
LSA regions, and the total flushed memory (sum of `_flushed_memory`)
of memtables.
`dirty_memory_manager` controls the shares of compaction and/or blocks
writes when these quantities cross various thresholds.
"Total flushed memory" isn't a well defined notion,
since the actual consumption of memory by the same data can vary over
time due to LSA compactions, and even the data present in memtable can
change over the course of the flush due to removals of outdated MVCC versions.
So `_flushed_memory` is merely an approximation computed by `flush_reader`
based on the data passing through it.
This approximation is supposed to be a conservative lower bound.
In particular, `_flushed_memory` should be not greater than
`occupancy().total_space()`. Otherwise, for example, "unspooled" memory
could become negative (and/or wrap around) and weird things could happen.
There is an assertion in `~flush_memory_accounter` which checks that
`_flushed_memory < occupancy().total_space()` at the end of flush.
But it can fail. Without additional treatment, the memtable reader sometimes emits
data which is already deleted. (In particular, it emites rows covered by
a partition tombstone in a newer MVCC version.)
This data is seen by `flush_reader` and accounted in `_flushed_memory`.
But this data can be garbage-collected by the `mutation_cleaner` later during the
flush and decrease `total_memory` below `_flushed_memory`.
There is a piece of code in `mutation_cleaner` intended to prevent that.
If `total_memory` decreases during a `mutation_cleaner` run,
`_flushed_memory` is lowered by the same amount, just to preserve the
asserted property. (This could also make `_flushed_memory` quite inaccurate,
but that's considered acceptable).
But that only works if `total_memory` is decreased during that run. It doesn't
work if the `total_memory` decrease (enabled by the new allocator holes made
by `mutation_cleaner`'s garbage collection work) happens asynchronously
(due to memory reclaim for whatever reason) after the run.
This patch fixes that by tracking the decreases of `total_memory` closer to the
source. Instead of relying on `mutation_cleaner` to notify the memtable if it
lowers `total_memory`, the memtable itself listens for notifications about
LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's
estimate of flushed memory decreased by the change in `total_memory` since the
beginning of flush (if it was positive), and it keeps the amount of "spooled"
memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`.
Fixes scylladb/scylladb#21413
Backport candidate because it fixes a crash that can happen in existing stable branches.
- (cherry picked from commit 7d551f99be)
- (cherry picked from commit 975e7e405a)
Parent PR: #21638Closesscylladb/scylladb#24602
* github.com:scylladb/scylladb:
memtable: ensure _flushed_memory doesn't grow above total memory usage
replica/memtable: move region_listener handlers from dirty_memory_manager to memtable
In this PR, we're adjusting most of the cluster tests so that they pass
with the `rf_rack_valid_keyspaces` configuration option enabled. In most
cases, the changes are straightforward and require little to no additional
insight into what the tests are doing or verifying. In some, however, doing
that does require a deeper understanding of the tests we're modifying.
The justification for those changes and their correctness is included in
the commit messages corresponding to them.
Note that this PR does not cover all of the cluster tests. There are few
remaining ones, but they require a bit more effort, so we delegate that
work to a separate PR.
I tested all of the modified tests locally with `rf_rack_valid_keyspaces`
set to true, and they all passed.
Fixes scylladb/scylladb#23959
Backport: we want to backport these changes to 2025.1 since that's the version where we introduced RF-rack-valid keyspaces in. Although the tests are not, by default, run with `rf_rack_valid_keyspaces` enabled yet, that will most likely change in the near future and we'll also want to backport those changes too. The reason for this is that we want to verify that Scylla works correctly even with that constraint.
- (cherry picked from commit dbb8835fdf)
- (cherry picked from commit 9281bff0e3)
- (cherry picked from commit 5b83304b38)
- (cherry picked from commit 73b22d4f6b)
- (cherry picked from commit 2882b7e48a)
- (cherry picked from commit 4c46551c6b)
- (cherry picked from commit 92f7d5bf10)
- (cherry picked from commit 5d1bb8ebc5)
- (cherry picked from commit d3c0cd6d9d)
- (cherry picked from commit 04567c28a3)
- (cherry picked from commit c8c28dae92)
- (cherry picked from commit c4b32c38a3)
- (cherry picked from commit ee96f8dcfc)
Parent PR: #23661Closesscylladb/scylladb#24120
* github.com:scylladb/scylladb:
test/{topology,topology_custom,object_store}/suite.yaml: Enable rf_rack_valid_keyspaces in suites
test/topology, test/topology_custom: Disable rf_rack_valid_keyspaces in problematic tests
test/topology_custom/test_tablets: Divide rack into two to adjust tests to RF-rack-validity
test/topology_custom/test_tablets: Adjust test_tablet_rf_change to RF-rack-validity
test/topology_custom/test_tablet_repair_scheduler.py: Adjust to RF-rack-validity
test/pylib/repair.py: Assign nodes to multiple racks in create_table_insert_data_for_repair
test/topology_custom/test_zero_token_nodes_topology_ops: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_no_replication.py: Adjust to RF-rack-validity
test/topology_custom/test_zero_token_nodes_multidc.py: Adjust to RF-rack-validity
test/topology_custom/test_not_enough_token_owners.py: Adjust to RF-rack-validity
test/topology_custom/test_multidc.py: Adjust to RF-rack-validity
test/object_store/test_backup.py: Adjust to RF-rack-validity
test/topology, test/topology_custom: Adjust simple tests to RF-rack-validity
The memtable wants to listen for changes in its `total_memory` in order
to decrease its `_flushed_memory` in case some of the freed memory has already
been accounted as flushed. (This can happen because the flush reader sees
and accounts even outdated MVCC versions, which can be deleted and freed
during the flush).
Today, the memtable doesn't listen to those changes directly. Instead,
some calls which can affect `total_memory` (in particular, the mutation cleaner)
manually check the value of `total_memory` before and after they run, and they
pass the difference to the memtable.
But that's not good enough, because `total_memory` can also change outside
of those manually-checked calls -- for example, during LSA compaction, which
can occur anytime. This makes memtable's accounting inaccurate and can lead
to unexpected states.
But we already have an interface for listening to `total_memory` changes
actively, and `dirty_memory_manager`, which also needs to know it,
does just that. So what happens e.g. when `mutation_cleaner` runs
is that `mutation_cleaner` checks the value of `total_memory` before it runs,
then it runs, causing several changes to `total_memory` which are picked up
by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of
`total_memory` and passes the difference to `memtable`, which corrects
whatever was observed by `dirty_memory_manager`.
To allow memtable to modify its `_flushed_memory` correctly, we need
to make `memtable` itself a `region_listener`. Also, instead of
the situation where `dirty_memory_manager` receives `total_memory`
change notifications from `logalloc` directly, and `memtable` fixes
the manager's state later, we want to only the memtable listen
for the notifications, and pass them already modified accordingl
to the manager, so there is no intermediate wrong states.
This patch moves the `region_listener` callbacks from the
`dirty_memory_manager` to the `memtable`. It's not intended to be
a functional change, just a source code refactoring.
The next patch will be a functional change enabled by this.
(cherry picked from commit 7d551f99be)
Almost all of the tests have been adjusted to be able to be run with
the `rf_rack_valid_keyspaces` configuration option enabled, while
the rest, a minority, create nodes with it disabled. Thanks to that,
we can enable it by default, so let's do that.
(cherry picked from commit ee96f8dcfc)
Some of the tests in the test suite have proven to be more problematic
in adjusting to RF-rack-validity. Since we'd like to run as many tests
as possible with the `rf_rack_valid_keyspaces` configuration option
enabled, let's disable it in those. In the following commit, we'll enable
it by default.
(cherry picked from commit c4b32c38a3)
Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
(cherry picked from commit c8c28dae92)
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
(cherry picked from commit 04567c28a3)
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.
The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.
(cherry picked from commit d3c0cd6d9d)
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.
(cherry picked from commit 5d1bb8ebc5)
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.
(cherry picked from commit 92f7d5bf10)
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.
(cherry picked from commit 4c46551c6b)
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.
Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.
(cherry picked from commit 2882b7e48a)
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.
(cherry picked from commit 73b22d4f6b)
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.
(cherry picked from commit 5b83304b38)
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.
Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.
(cherry picked from commit 9281bff0e3)
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:
* Using `pytest.mark.prepare_3_racks_cluster` instead of
`pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
`ManagerClient::servers_add()`.
In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
(cherry picked from commit dbb8835fdf)
The contract in mutation_reader.hh says:
```
// pr needs to be valid until the reader is destroyed or fast_forward_to()
// is called again.
future<> fast_forward_to(const dht::partition_range& pr) {
```
`test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates
this by passing a temporary to `fast_forward_to`.
Fix that.
Fixesscylladb/scylladb#24542Closesscylladb/scylladb#24543
(cherry picked from commit 27f66fb110)
Closesscylladb/scylladb#24547
When a tablet is migrated and cleaned up, deallocate the tablet storage
group state on `end_migration` stage, instead of `cleanup` stage:
* When the stage is updated from `cleanup` to `end_migration`, the
storage group is removed on the leaving replica.
* When the table is initialized, if the tablet stage is `end_migration`
then we don't allocate a storage group for it. This happens for
example if the leaving replica is restarted during tablet migration.
If it's initialized in `cleanup` stage then we allocate a storage
group, and it will be deallocated when transitioning to
`end_migration`.
This guarantees that the storage group is always deallocated on the
leaving replica by `end_migration`, and that it is always allocated if
the tablet wasn't cleaned up fully yet.
It is a similar case also for the pending replica when the migration is
aborted. We deallocate the state on `revert_migration` which is the
stage following `cleanup_target`.
Previously the storage group would be allocated when the tablet is
initialized on any of the tablet replicas - also on the leaving replica,
and when the tablet stage is `cleanup` or `end_migration`, and
deallocated during `cleanup`.
This fixes the following issue:
1. A migrating tablet enters cleanup stage
2. the tablet is cleaned up successfuly
3. The leaving replica is restarted, and allocates storage group
4. tablet cleanup is not called because it's already cleaned up
5. the storage group remains allocated on the leaving replica after the
migration is completed - it's not cleaned up properly.
Fixes https://github.com/scylladb/scylladb/issues/23481
backport to all relevant releases since it's a bug that results in a crash
- (cherry picked from commit 34f15ca871)
- (cherry picked from commit fb18fc0505)
- (cherry picked from commit bd88ca92c8)
Parent PR: #24393Closesscylladb/scylladb#24487
* github.com:scylladb/scylladb:
test/cluster/test_tablets: test restart during tablet cleanup
test: tablets: add get_tablet_info helper
tablets: deallocate storage state on end_migration
Add a test that reproduces issue scylladb/scylladb#23481.
The test migrates a tablet from one node to another, and while the
tablet is in some stage of cleanup - either before or right after,
depending on the parameter - the leaving replica, on which the tablet is
cleaned, is restarted.
This is interesting because when the leaving replica starts and loads
its state, the tablet could be in different stages of cleanup - the
SSTables may still exist or they may have been cleaned up already, and
we want to make sure the state is loaded correctly.
(cherry picked from commit bd88ca92c8)
By default, cluster tests have skip_wait_for_gossip_to_settle=0 and
ring_delay_ms=0. In tests with gossip topology, it may lead to a race,
where nodes see different state of each other.
In case of test_auth_v2_migration, there are three nodes. If the first
node already knows that the third node is NORMAL, and the second node
does not, the system_auth tables can return incomplete results.
To avoid such a race, this commit adds a check that all nodes see other
nodes as NORMAL before any writes are done.
Refs: #24163Closesscylladb/scylladb#24185
(cherry picked from commit 555d897a15)
Closesscylladb/scylladb#24519
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.
Fixes: scylladb/scylladb#23968
Needs backport to 2025.1 and 6.2, both have the flaky test
- (cherry picked from commit 51025de755)
- (cherry picked from commit 29eedaa0e5)
Parent PR: #23989Closesscylladb/scylladb#24049
* github.com:scylladb/scylladb:
test/cluster/test_read_repair.py: improve trace logging test (again)
test/cluster: extract execute_with_tracing() into pylib/util.py
In ScyllaDB, schema modification operations use "optimistic locking": A schema operation reads the current schema, decides what it wants to do and prepares changes to the schema, and then attempts to commit those changes - but only if the schema hasn't changed since the first read. If the schema has already been changed by some other node - we need to try again. In a loop.
In Alternator, there are six operations that perform schema modification: CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and UpdateTimeToLive. All of them were missing this loop. We knew about this - and even had FIXME in all places. So all these operations, when facing contention of concurrent schema modifications on different nodes may fail one of these operations with an error like:
Internal server error: service::group0_concurrent_modification
(Failed to apply group 0 change due to concurrent modification).
This problem had very minor effect, if any, on real users because the DynamoDB SDK automatically retries operations that fail with retryable errors - like this "Internal server error" - and most likely the schema operation will succeed upon retry. However, as shown in issue #13152 these failures were annoying in our CI, where tests - which disable request retries - failed on these errors.
This patch fixes all six operations (the last three operations all use one common function, db::modify_tags(), so are fixed by one change) to add the missing loop.
The patch also includes reproducing tests for all these operations - the new tests all fail before this patch, and pass with it.
These new tests are much more reliable reproducers than the dtests we had that only sometimes - very rarely - reproduced the problem. Moreover, the new tests reproduces the bug seperately for each of the six operations, so if we forget to fix one of the six operations, one of the tests would have continued to fail. Of course I checked this during development.
The new tests are in the test/cluster framework, not test/alternator, because this problem can only be reproduced in a multi-node cluster: On a single node, it serializes its schema modifications on its own; The collisions only happen when more than one node attempts schema modifications at the same time.
Fixes#13152
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23827
(cherry picked from commit 3ce7e250cc)
Closesscylladb/scylladb#24494
* github.com:scylladb/scylladb:
alternator: fix indentation
alternator: fix schema "concurrent modification" errors
This PR adjusts existing Boost tests so they respect the invariant
introduced by enabling `rf_rack_valid_keyspaces` configuration option.
We disable it explicitly in more problematic tests. After that, we
enable the option by default in the whole test suite.
Fixes scylladb/scylladb#23958
Backport: backporting to 2025.1 to be able to test the implementation there too.
- (cherry picked from commit 6e2fb79152)
- (cherry picked from commit e4e3b9c3a1)
- (cherry picked from commit 1199c68bac)
- (cherry picked from commit cd615c3ef7)
- (cherry picked from commit fa62f68a57)
- (cherry picked from commit 22d6c7e702)
- (cherry picked from commit 237638f4d3)
- (cherry picked from commit c60035cbf6)
Parent PR: #23802Closesscylladb/scylladb#24367
* github.com:scylladb/scylladb:
test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default
test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests
test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load
test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity
test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity
There are two semaphores in table for synchronizing changes to sstable list:
sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.
sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.
they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.
problem:
A = tablet cleanup
B = take_snapshot()
1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk
Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.
But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.
That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.
The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.
So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.
Fixes#23049.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23117
(cherry picked from commit fedd838b9d)
Closesscylladb/scylladb#23279
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.
In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:
Internal server error: service::group0_concurrent_modification
(Failed to apply group 0 change due to concurrent modification).
This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.
This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.
The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.
These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.
The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.
Fixes#13152
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23827
(cherry picked from commit 3ce7e250cc)
We've adjusted all of the Boost tests so they respect the invariant
enforced by the `rf_rack_valid_keyspaces` configuration option, or
explicitly disabled the option in those that turned out to be more
problematic and will require more attention. Thanks to that, we can
now enable it by default in the test suite.
(cherry picked from commit c60035cbf6)
Some of the tests in the file verify more subtle parts of the behavior
of tablets and rely on topology layouts or using keyspaces that violate
the invariant the `rf_rack_valid_keyspaces` configuration option is
trying to enforce. Because of that, we explicitly disable the option
to be able to enable it by default in the rest of the test suite in
the following commit.
(cherry picked from commit 237638f4d3)