Three tests in the file use a multi-DC cluster. Unfortunately, they put
all of the nodes in a DC in the same rack and because of that, they fail
when run with the `rf_rack_valid_keyspaces` configuration option enabled.
Since the tests revolve mostly around zero-token nodes and how they
affect replication in a keyspace, this change should have zero impact on
them.
(cherry picked from commit c8c28dae92)
We reduce the number of nodes and the RF values used in the test
to make sure that the test can be run with the `rf_rack_valid_keyspaces`
configuration option. The test doesn't seem to be reliant on the
exact number of nodes, so the reduction should not make any difference.
(cherry picked from commit 04567c28a3)
The change boils down to matching the number of created racks to the number
of created nodes in each DC in the auxiliary function `prepare_multi_dc_repair`.
This way, we ensure that the created keyspace will be RF-rack-valid and so
we can run the test file even with the `rf_rack_valid_keyspaces` configuration
option enabled.
The change has no impact on the tests that use the function; the distribution
of nodes across racks does not affect how repair is performed or what the
tests do and verify. Because of that, the change is correct.
(cherry picked from commit d3c0cd6d9d)
We assign the newly created nodes to multiple racks. If RF <= 3,
we create as many racks as the provided RF. We disallow the case
of RF > 3 to avoid trying to create an RF-rack-invalid keyspace;
note that no existing test calls `create_table_insert_data_for_repair`
providing a higher RF. The rationale for doing this is we want to ensure
that the tests calling the function can be run with the
`rf_rack_valid_keyspaces` configuration option enabled.
(cherry picked from commit 5d1bb8ebc5)
We assign the nodes to the same DC, but multiple racks to ensure that
the created keyspace is RF-rack-valid and we can run the test with
the `rf_rack_valid_keyspaces` configuration option enabled. The changes
do not affect what the test does and verifies.
(cherry picked from commit 92f7d5bf10)
We simply assign the nodes used in the test to seprate racks to
ensure that the created keyspace is RF-rack-valid to be able
to run the test with the `rf_rack_valid_keyspaces` configuration
option set to true. The change does not affect what the test
does and verifies -- it only depends on the type of nodes,
whether they are normal token owners or not -- and so the changes
are correct in that sense.
(cherry picked from commit 4c46551c6b)
We parameterize the test so it's run with and without enforced
RF-rack-valid keyspaces. In the test itself, we introduce a branch
to make sure that we won't run into a situation where we're
attempting to create an RF-rack-invalid keyspace.
Since the `rf_rack_valid_keyspaces` option is not commonly used yet
and because its semantics will most likely change in the future, we
decide to parameterize the test rather than try to get rid of some
of the test cases that are problematic with the option enabled.
(cherry picked from commit 2882b7e48a)
We simply assign DC/rack properties to every node used in the test.
We put all of them in the same DC to make sure that the cluster behaves
as closely to how it would before these changes. However, we distribute
them over multiple racks to ensure that the keyspace used in the test
is RF-rack-valid, so we can also run it with the `rf_rack_valid_keyspaces`
configuration option set to true. The distribution of nodes between racks
has no effect on what the test does and verifies, so the changes are
correct in that sense.
(cherry picked from commit 73b22d4f6b)
Instead of putting all of the nodes in a DC in the same rack
in `test_putget_2dc_with_rf`, we assign them to different racks.
The distribution of nodes in racks is orthogonal to what the test
is doing and verifying, so the change is correct in that sense.
At the same time, it ensures that the test never violates the
invariant of RF-rack-valid keyspaces, so we can also run it
with `rf_rack_valid_keyspaces` set to true.
(cherry picked from commit 5b83304b38)
We modify the parameters of `test_restore_with_streaming_scopes`
so that it now represents a pair of values: topology layout and
the value `rf_rack_valid_keyspaces` should be set to.
Two of the already existing parameters violate RF-rack-validity
and so the test would fail when run with `rf_rack_valid_keyspaces: true`.
However, since the option isn't commonly used yet and since the
semantics of RF-rack-valid keyspaces will most likely change in
the future, let's keep those cases and just run them with the
option disabled. This way, we still test everything we can
without running into undesired failures that don't indicate anything.
(cherry picked from commit 9281bff0e3)
We adjust all of the simple cases of cluster tests so they work
with `rf_rack_valid_keyspaces: true`. It boils down to assigning
nodes to multiple racks. For most of the changes, we do that by:
* Using `pytest.mark.prepare_3_racks_cluster` instead of
`pytest.mark.prepare_3_nodes_cluster`.
* Using an additional argument -- `auto_rack_dc` -- when calling
`ManagerClient::servers_add()`.
In some cases, we need to assign the racks manually, which may be
less obvious, but in every such situation, the tests didn't rely
on that assignment, so that doesn't affect them or what they verify.
(cherry picked from commit dbb8835fdf)
Used host id to check if the update is for the node itself. Using IP is unreliable since if a node is restarted with different IP a gossiper message with previous IP can be misinterpreted as belonging to a different node.
Fixes: #22777
Backport to 2025.1 since this fixes a crash. Older version do not have the code.
- (cherry picked from commit a2178b7c31)
- (cherry picked from commit ecd14753c0)
- (cherry picked from commit 7403de241c)
Parent PR: #24000Closesscylladb/scylladb#24089
* https://github.com/scylladb/scylladb:
test: add reproducer for #22777
storage_service: Do not remove gossiper entry on address change
storage_service: use id to check for local node
Add sleep before starting gossiper to increase a chance of getting old
gossiper entry about yourself before updating local gossiper info with
new IP address.
(cherry picked from commit 7403de241c)
When gossiper indexed entries by ip an old entry had to be removed on an
address change, but the index is id based, so even if ip was change the
entry should stay. Gossiper simply updates an ip address there.
(cherry picked from commit ecd14753c0)
The test checks that merging the partition versions on-the-fly using the
cursor gives the same results as merging them destructively with apply_monotonically.
In particular, it tests that the continuity of both results is equal.
However, there's a subtlety which makes this not true.
The cursor puts empty dummy rows (i.e. dummies shadowed by the partition
tombstone) in the output.
But the destructive merge is allowed (as an expection to the general
rule, for optimization reasons), to remove those dummies and thus reduce
the continuity.
So after this patch we instead check that the output of the cursor
has continuity equal to the merged continuities of version.
(Rather than to the continuity of merged versions, which can be
smaller as described above).
Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did
the same in a different test.
Fixes https://github.com/scylladb/scylladb/issues/13642Closesscylladb/scylladb#24044
(cherry picked from commit 746ec1d4e4)
Closesscylladb/scylladb#24083
The test test_read_repair_with_trace_logging wants to test read repair with trace logging. Turns out that node restart + trace-level logging + debug mode is too much and even with 1 minute timeout, the read repair times out sometimes. Refactor the test to use injection point instead of restart. To make sure the test still tests what it supposed to test, use tracing to assert that read repair did indeed happen.
Fixes: scylladb/scylladb#23968
Needs backport to 2025.1 and 6.2, both have the flaky test
- (cherry picked from commit 51025de755)
- (cherry picked from commit 29eedaa0e5)
Parent PR: #23989Closesscylladb/scylladb#24051
* github.com:scylladb/scylladb:
test/cluster/test_read_repair.py: improve trace logging test (again)
test/cluster: extract execute_with_tracing() into pylib/util.py
Currently, stream_session::prepare throws when a table in requests
or summaries is dropped. However, we do not want to fail streaming
if the table is dropped.
Delete table checks from stream_session::prepare. Further streaming
steps can handle the dropped table and finish the streaming successfully.
Fixes: #15257.
Closesscylladb/scylladb#23915
(cherry picked from commit 20c2d6210e)
Closesscylladb/scylladb#24053
The test test_read_repair_with_trace_logging wants to test read repair
with trace logging. Turns out that node restart + trace-level logging
+ debug mode is too much and even with 1 minute timeout, the read repair
times out sometimes.
Refactor the test to use injection point instead of restart. To make
sure the test still tests what it supposed to test, use tracing to
assert that read repair did indeed happen.
(cherry picked from commit 29eedaa0e5)
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.
A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.
Fixes: #23762Closesscylladb/scylladb#23984
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial`
This is part two of splitting the original PR.
This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1¬ifications_query=reason%3Aparticipating#pullrequestreview-2778582278.
Closesscylladb/scylladb#23882
* github.com:scylladb/scylladb:
test.py: add awareness of extra_scylla_cmdline_options
test.py: increase timeout for C++ tests in pytest
test.py: switch method of finding the root repo directory
test.py: move get_combined_tests to the correct facade
test.py: add common directory for reports
test.py: add the possibility to provide additional env vars
test.py: move setup cgroups to the generic method
test.py: refactor resource_gather.py
Much easier to avoid sstable collisions. Makes it possible to scrub
multiple sstables, with multiple calls to scylla-sstable, reusing the
same output directory. Previously, each new call to scylla-sstable
scrub, would start from generation 0, guaranteeing collision.
Remove the unit test for generation clash -- with UUID generations, this
is no longer possible to reproduce in practice.
Refs: #21387Closesscylladb/scylladb#23990
Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack.
Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election.
To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal.
This change ensures a more stable voter distribution and reduces unnecessary voter reassignments.
The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations.
Fixes: scylladb/scylladb#23950Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
No backport: The limited voters feature is currently only present in master.
Closesscylladb/scylladb#23888
* https://github.com/scylladb/scylladb:
raft: ensure topology coordinator retains votership
raft: retain existing voters across data centers and racks
raft: refactor limited voters calculator to prioritize nodes
raft: replace pointer with reference for non-null output parameter
raft: reduce code duplication in group0 voter handler
raft: unify and optimize datacenter and rack info creation
This series adds support for WCU tracking in batch_write_item and tests it.
The patches include:
Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.
Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.
Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.
The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.
Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.
Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.
**Need backport, WCU is partially supported, and is missing from batch_write_item**
Fixes#23940Closesscylladb/scylladb#23941
* github.com:scylladb/scylladb:
alternator/test_metrics.py: batch_write validate WCU
alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
alternator/executor: add WCU for batch_write_items
alternator/consumed_capacity: make wcu get_units public
Alternator: Change the WCU/RCU to use units
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail
* duplicate check of drops == 0 (just cosmetic)
Fix all three problems, the second is especially important because it made the test flaky.
Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them.
Fixes: #16794
Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there
Closesscylladb/scylladb#23783
* github.com:scylladb/scylladb:
test/boost/multishard_mutation_query_test: ensure test runs with vnodes
test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.
This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.
Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.
Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing
voters in each data center and rack.
This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.
Fixes: scylladb/scylladb#23950
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.
The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.
Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.
This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
All tests in this suite use the default "ks" keyspace from cql_test_env.
This keyspace has tablet support and at any time we might decide to make
it use tablets by default. This would make all these tests use the
tablet path in multishard_mutation_query.cc. These tests were created to
test the vastly more complex vnodes code path in said file. The tablet
path is much simpler and it is only used by SELECT * FROM
MUTATION_FRAGMENTS() and which has its own correctness tests.
So explicitely create a vnodes keyspace and use it in all the tests to
restore the test functionality.
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from
2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario,
this throws off per-page increment checks, when the previous scenario
moved these metrics and they don't start from 0; this causes the test
to sometimes fail
* duplicate check of drops == 0 (just cosmetic)
Fix all three problems, the second is especially important because it
made the test flaky.
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.
Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.
Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.
This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.
Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.
Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting
Fixes#23801
---
this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.
Closesscylladb/scylladb#23808
* github.com:scylladb/scylladb:
sstables_loader: fix the indent
sstables_loader: fix the racing between get_progress() and release_resources()
The .cache and .cargo directories are used during pip and rust builds
when preparing the toolchain, but aren't useful afterwards. Remove them
to save a bit of space.
Closesscylladb/scylladb#23955
Running as root enables nested containers under podman without
trouble from uid remapping. Unlike docker, under podman uid 0 in
the container is remapped to the host uid for bind mounts, so writes
to the build directory do not end up owned by root on the host.
Nested containers will allow us to consume opensearch, cassandra-stress,
and minio as containers rather than embedding them into the frozen
toolchain.
Closesscylladb/scylladb#23954
This patch adds a test that verifies the WCU metrics are updated
correctly during a batch_write_item operation.
It ensures that the WCU of each item is calculated independently.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds two tests:
A test that validates WCU calculation for batch put requests on a single table.
A test that validates WCU calculation for batch requests across multiple
tables, including ensuring that delete operations are counted as 1 WCU.
Both tests verify that the consumed capacity is reported correctly
according to the WCU rules.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds consumed capacity unit support to batch_write_item.
It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.
The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.
For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
This patch adds a public static get_units function to
wcu_consumed_capacity_counter. It will be used by the batch write item
implementation, which cannot use the wcu_consumed_capacity_counter
directly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
consume_capacity need merge
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:
Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.
Update the tests accordingly.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.
We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.
Closesscylladb/scylladb#23975
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.
A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.
Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).
There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.
Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.
It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.
Fixes#23634.
We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.
Closesscylladb/scylladb#23806
* github.com:scylladb/scylladb:
test: Verify partitioned set store split and unsplit correctly
sstables: Fix quadratic space complexity in partitioned_sstable_set
compaction: Wire table_state into make_sstable_set()
compaction: Introduce token_range() to table_state
dht: Add overlap_ratio() for token range
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.
The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.
This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.
Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.
Closesscylladb/scylladb#23957
This dependency is already there, topology coordinator doesn't need
to use database reference to get to the features.
Previous patch of the same kind: b79137eaa4
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23777
It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet.
Closesscylladb/scylladb#23815
* github.com:scylladb/scylladb:
test: Remove sstable_assertions::get_stats_metadata()
test: Add sstable_assertions::operator->()