Commit Graph

47669 Commits

Author SHA1 Message Date
Botond Dénes
3c3f6ca233 tools/scylla-sstable: scrub: use UUID sstable identifiers
Much easier to avoid sstable collisions. Makes it possible to scrub
multiple sstables, with multiple calls to scylla-sstable, reusing the
same output directory. Previously, each new call to scylla-sstable
scrub, would start from generation 0, guaranteeing collision.

Remove the unit test for generation clash -- with UUID generations, this
is no longer possible to reproduce in practice.

Refs: #21387

Closes scylladb/scylladb#23990
2025-05-06 15:09:53 +03:00
Patryk Jędrzejczak
7f843e0a5c Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky
Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack.

Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election.
To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal.

This change ensures a more stable voter distribution and reduces unnecessary voter reassignments.

The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations.

Fixes: scylladb/scylladb#23950
Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786

No backport: The limited voters feature is currently only present in master.

Closes scylladb/scylladb#23888

* https://github.com/scylladb/scylladb:
  raft: ensure topology coordinator retains votership
  raft: retain existing voters across data centers and racks
  raft: refactor limited voters calculator to prioritize nodes
  raft: replace pointer with reference for non-null output parameter
  raft: reduce code duplication in group0 voter handler
  raft: unify and optimize datacenter and rack info creation
2025-05-06 13:49:55 +02:00
Nadav Har'El
252c5b5c9d Merge 'Alternator batch_write_item wcu' from Amnon Heiman
This series adds support for WCU tracking in batch_write_item and tests it.

The patches include:

Switch the metrics (RCU and WCU) to count units vs half-units as they were, to make the metrics clearer for users.

Adding a public static get_half_units function to wcu_consumed_capacity_counter for use by batch write item, which cannot directly use the counter object.

Adding WCU calculation support to batch_write_item, based on item size for puts and a fixed 1 WCU for deletes. WCU metrics are updated, and consumed capacity is returned per table when requested.

The return handling was refactored to be coroutine-like for easier management of the consumed capacity array.

Adding tests that validate WCU calculation for batch put requests on a single table and across multiple tables, ensuring delete operations are counted correctly.

Adding a test that validates that WCU metrics are updated correctly during batch write item operations, ensuring the WCU of each item is calculated independently.

**Need backport, WCU is partially supported, and is missing from batch_write_item**

Fixes #23940

Closes scylladb/scylladb#23941

* github.com:scylladb/scylladb:
  alternator/test_metrics.py: batch_write validate WCU
  alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
  alternator/executor: add WCU for batch_write_items
  alternator/consumed_capacity: make wcu get_units public
  Alternator: Change the WCU/RCU to use units
2025-05-06 13:31:53 +03:00
Avi Kivity
fc2204cea0 Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it made the test flaky.
Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them.

Fixes: #16794

Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there

Closes scylladb/scylladb#23783

* github.com:scylladb/scylladb:
  test/boost/multishard_mutation_query_test: ensure test runs with vnodes
  test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
2025-05-05 20:49:03 +03:00
Emil Maskovsky
24dfd2034b raft: ensure topology coordinator retains votership
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.

This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.

Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.

Fixes: scylladb/scylladb#23588
Fixes: scylladb/scylladb#23786
2025-05-05 16:58:34 +02:00
Emil Maskovsky
2ae59e8a87 raft: retain existing voters across data centers and racks
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.

Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.

Improved the prioritization logic to account for the number of existing
voters in each data center and rack.

This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.

Fixes: scylladb/scylladb#23950
2025-05-05 16:51:48 +02:00
Emil Maskovsky
018fb63305 raft: refactor limited voters calculator to prioritize nodes
Refactor the limited voters calculator to use a priority queue for
sorting nodes by their priorities. This change simplifies the voter
selection logic and makes it more extensible for future enhancements,
such as supporting more complex priority calculations.

The priority value is determined based on the node's existing status,
including whether it is alive, a voter, or any further criteria.
2025-05-05 16:36:17 +02:00
Emil Maskovsky
26fdc7b8f8 raft: replace pointer with reference for non-null output parameter
The output parameter cannot be `null`. Previously, a pointer was used to
make it explicit that the parameter is an output parameter being
modified. However, this is unnecessary, as references are more
appropriate for parameters that cannot be `null`.

Switching to a reference improves code readability and ensures the
parameter's non-null constraint is enforced at the type level.
2025-05-05 16:12:00 +02:00
Emil Maskovsky
f0468860a3 raft: reduce code duplication in group0 voter handler
Refactor the group0 voter handler by introducing a helper lambda to
handle the common logic for adding a node. This eliminates unnecessary
code duplication.

This refactor does not introduce any functional changes but prepares
the codebase for easier future modifications.
2025-05-05 16:09:53 +02:00
Botond Dénes
855411caad test/boost/multishard_mutation_query_test: ensure test runs with vnodes
All tests in this suite use the default "ks" keyspace from cql_test_env.
This keyspace has tablet support and at any time we might decide to make
it use tablets by default. This would make all these tests use the
tablet path in multishard_mutation_query.cc. These tests were created to
test the vastly more complex vnodes code path in said file. The tablet
path is much simpler and it is only used by SELECT * FROM
MUTATION_FRAGMENTS() and which has its own correctness tests.
So explicitely create a vnodes keyspace and use it in all the tests to
restore the test functionality.
2025-05-05 09:22:54 -04:00
Botond Dénes
1175e1ed49 test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from
  2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario,
  this throws off per-page increment checks, when the previous scenario
  moved these metrics and they don't start from 0; this causes the test
  to sometimes fail
* duplicate check of drops == 0 (just cosmetic)

Fix all three problems, the second is especially important because it
made the test flaky.
2025-05-05 09:22:53 -04:00
Emil Maskovsky
2ef654149f raft: unify and optimize datacenter and rack info creation
Refactor the code to use a consistent pattern for creating the
datacenter info list and the rack info list.

Both now use a map of vectors, which improves efficiency by reducing
temporary conversions to maps/sets during node list processing.

Also ensure the node descriptor is passed by reference instead of by
copy, leveraging the guaranteed lifetime of the descriptors.
2025-05-05 15:15:17 +02:00
Pavel Emelyanov
cf1ffd6086 Merge 'sstables_loader: fix the racing between get_progress() and release_resources()' from Kefu Chai
This change addresses a critical race condition in the sstables_loader where `get_progress()` could access invalid `progress_holder` instances after `release_resources()` destroyed them.

Problem:
- Progress tracking uses two components: `_progress_state` (tracks state) and `_progress_per_shard` (sharded service with actual progress data)
- `get_progress()` first checks if `_progress_state` is initialized, then accumulates progress from `_progress_per_shard`
- As both functions are coroutines, `get_progress()` could be preempted after state check but before accessing `_progress_per_shard`
- If `release_resources()` runs during this preemption, it destroys the `progress_holder` instances in `_progress_per_shard`, causing `get_progress()` to access invalid memory.

Solution:
- Implemented shared/exclusive locking to protect access to both state and sharded progress data
- Multiple `get_progress()` calls can execute in parallel (shared access)
- `release_resources()` acquires exclusive access before modifying resources
- This prevents potential memory corruption and ensures consistent progress reporting

Fixes #23801

---

this change addresses a racing related to tracking the restore progress from S3 using scylla's native API, which is not used in production yet, hence no need to backport.

Closes scylladb/scylladb#23808

* github.com:scylladb/scylladb:
  sstables_loader: fix the indent
  sstables_loader: fix the racing between get_progress() and release_resources()
2025-05-05 15:45:15 +03:00
Avi Kivity
e688e89430 tools: toolchain: clear .cache and .cargo directories
The .cache and .cargo directories are used during pip and rust builds
when preparing the toolchain, but aren't useful afterwards. Remove them
to save a bit of space.

Closes scylladb/scylladb#23955
2025-05-05 14:43:14 +03:00
Avi Kivity
4c1f4c419c tools: toolchain: dbuild: run as root in container under podman
Running as root enables nested containers under podman without
trouble from uid remapping. Unlike docker, under podman uid 0 in
the container is remapped to the host uid for bind mounts, so writes
to the build directory do not end up owned by root on the host.

Nested containers will allow us to consume opensearch, cassandra-stress,
and minio as containers rather than embedding them into the frozen
toolchain.

Closes scylladb/scylladb#23954
2025-05-05 14:40:43 +03:00
Amnon Heiman
2ab99d7a07 alternator/test_metrics.py: batch_write validate WCU
This patch adds a test that verifies the WCU metrics are updated
correctly during a batch_write_item operation.
It ensures that the WCU of each item is calculated independently.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:24 +03:00
Amnon Heiman
14570f1bb5 alternator/test_returnconsumedcapacity.py: Add tests for batch write WCU
This patch adds two tests:
A test that validates WCU calculation for batch put requests on a single table.

A test that validates WCU calculation for batch requests across multiple
tables, including ensuring that delete operations are counted as 1 WCU.

Both tests verify that the consumed capacity is reported correctly
according to the WCU rules.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:23 +03:00
Amnon Heiman
68db77643f alternator/executor: add WCU for batch_write_items
This patch adds consumed capacity unit support to batch_write_item.

It calculates the WCU based on an item's length (for put) or a static 1
WCU (for delete), for each item on each table.

The WCU metrics are always updated. if the user requests consumed
capacity, a vector of consumed capacity is returned with an entry for
each of the tables.

For code simplicity, the return part of batch_write_item was updated to
be coroutine-like; this makes it easier to manage the life cycle of the
returned consumed_capacity array.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:20:14 +03:00
Amnon Heiman
f2ade71f4f alternator/consumed_capacity: make wcu get_units public
This patch adds a public static get_units function to
wcu_consumed_capacity_counter.  It will be used by the batch write item
implementation, which cannot use the wcu_consumed_capacity_counter
directly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

consume_capacity need merge
2025-05-05 13:19:04 +03:00
Amnon Heiman
5ae11746fa Alternator: Change the WCU/RCU to use units
This patch changes the RCU/WCU Alternator metrics to use whole units
instead of half units. The change includes the following:

Change the metrics documentation. Keep the RCU counter internally in
half units, but return the actual (whole unit) value.
Change the RCU name to be rcu_half_units_total to indicates that it
counts half units.
Change the WCU to count in whole units instead of half units.

Update the tests accordingly.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-05-05 13:18:09 +03:00
Anna Stuchlik
851a433663 doc: add a link to the previous Enterprise documentation
This commit adds a link to the docs for previous Enterprise versions
at https://enterprise.docs.scylladb.com/ to the left menu.

As we still support versions 2024.1 and 2024.2, we need to ensure
easier access to those docs sets.

Fixes https://github.com/scylladb/scylladb/issues/23870

Closes scylladb/scylladb#23945
2025-05-05 12:16:47 +03:00
Avi Kivity
04fb2c026d config: decrease default large allocation warning threshold to 128k
Back in 2017 (5a2439e702), we introduced a check for large
allocations as they can stall the memory allocator. The warning
threshold was set at 1 MB. Since then many fixes for large allocations
went in and it is now time to reduce the threshold further.

We reduce it here to 128 kB, the natural allocation size for the
system. A quick run showed no warnings.

Closes scylladb/scylladb#23975
2025-05-05 12:13:48 +03:00
Pavel Emelyanov
b56d6fbb84 Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.

Fixes #23634.

We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.

Closes scylladb/scylladb#23806

* github.com:scylladb/scylladb:
  test: Verify partitioned set store split and unsplit correctly
  sstables: Fix quadratic space complexity in partitioned_sstable_set
  compaction: Wire table_state into make_sstable_set()
  compaction: Introduce token_range() to table_state
  dht: Add overlap_ratio() for token range
2025-05-05 11:28:38 +03:00
David Garcia
4ba7182515 docs: fix md redirections for multiversion support
This change resolves an issue where selecting a version from the multiversion dropdown on Markdown pages (e.g. https://docs.scylladb.com/manual/stable/alternator/getting-started.html) incorrectly redirected users to the main page instead of the corresponding versioned page.

The underlying cause was that the `multiversion` extension relies on `source_suffix` to identify available pages for URL mapping. Without this configuration, proper redirection fails for `.md` files.

This fix should be backported to `2025.1` to ensure correct behavior. Otherwise, the fix will only take effect in future releases.

Testing locally is non-trivial: clone the repository, apply the changes to each relevant branch, set `smv_remote_whitelist` to "", then run `make multiversionpreview`. Afterward, switch between versions in the dropdown to verify behavior. I've tested it locally, so the best next step is to merge and confirm that it works as expected in the live environment.

Closes scylladb/scylladb#23957
2025-05-05 10:39:39 +03:00
Pavel Emelyanov
7b786d9398 topology_coordinator: Use this->_feature_service directly
This dependency is already there, topology coordinator doesn't need
to use database reference to get to the features.

Previous patch of the same kind: b79137eaa4

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23777
2025-05-05 09:37:29 +02:00
Piotr Dulikowski
05c797795f Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov
It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet.

Closes scylladb/scylladb#23815

* github.com:scylladb/scylladb:
  test: Remove sstable_assertions::get_stats_metadata()
  test: Add sstable_assertions::operator->()
2025-05-05 09:33:45 +02:00
Nadav Har'El
834107ae97 test/cqlpy,alternator: fix reporting of Scylla crash during test
The cqlpy and alternator test frameworks use a single Scylla node started
once for all tests to run on. In the distant past, we had a problem where
if one test caused Scylla to crash, the result was a confusing report of
hundreds of failed tests - all tests after the crash "failed" and it wasn't
easy to find which test really caused the crash.

Our old solution to this problem was to have an autouse fixture (called
cql_test_connection or dynamodb_test_connection) which tested the
connection at the end of each test, and if it detected Scylla has
crashed - it used pytest.exit() to report the error and have pytest
exit and therefore stop running any further tests (which would have
led to all of them testing).

This approach had two problems:

1. The pytest.exit() caused the entire cqlpy suite to report a failure,
   but but not the individual test - the individual test might have
   failed as well, but that isn't guaranteed and in any case this test's
   output is missing the informative message that Scylla crashed during
   the test. This was fine when for each cqlpy failure we had two separate
   error logs in Jenkins - the specific failed function, and the failed
   file - but when we recently got rid of the suplication by removing the
   second one, we no longer see the "Scylla crashed" messages any more.

2. Exiting pytest will be the wrong thing to do if the same pytest
   run could run tests from different test suites. We don't do this
   today, but we plan to support this approach soon.

This patch fixes both problems by replacing the pytest.exit() call by
setting a "scylla_crashed" flag and using pytest.fail(). The pytest.fail()
causes the current test - the one which caused Scylla to crash - to be
reported as an "ERROR" and the "Scylla crashed" message will correctly
appear in this test's log. The flag will cause all other tests in the
same test suite to be skip()ed. But other tests in other directories,
depending on different fixtures, might continue to run normally.

Fixes #23287

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23307
2025-05-05 10:15:56 +03:00
Nadav Har'El
3ce7e250cc alternator: fix schema "concurrent modification" errors
In ScyllaDB, schema modification operations use "optimistic locking":
A schema operation reads the current schema, decides what it wants to do
and prepares changes to the schema, and then attempts to commit those
changes - but only if the schema hasn't changed since the first read.
If the schema has already been changed by some other node - we need to
try again. In a loop.

In Alternator, there are six operations that perform schema modification:
CreateTable, DeleteTable, UpdateTable, TagResource, UntagResource and
UpdateTimeToLive. All of them were missing this loop. We knew about
this - and even had FIXME in all places. So all these operations,
when facing contention of concurrent schema modifications on different
nodes may fail one of these operations with an error like:

   Internal server error: service::group0_concurrent_modification
   (Failed to apply group 0 change due to concurrent modification).

This problem had very minor effect, if any, on real users because the
DynamoDB SDK automatically retries operations that fail with retryable
errors - like this "Internal server error" - and most likely the schema
operation will succeed upon retry. However, as shown in issue #13152
these failures were annoying in our CI, where tests - which disable
request retries - failed on these errors.

This patch fixes all six operations (the last three operations all
use one common function, db::modify_tags(), so are fixed by one
change) to add the missing loop.

The patch also includes reproducing tests for all these operations -
the new tests all fail before this patch, and pass with it.

These new tests are much more reliable reproducers than the dtests
we had that only sometimes - very rarely - reproduced the problem.
Moreover, the new tests reproduces the bug seperately for each of the
six operations, so if we forget to fix one of the six operations, one
of the tests would have continued to fail. Of course I checked this
during development.

The new tests are in the test/cluster framework, not test/alternator,
because this problem can only be reproduced in a multi-node cluster:
On a single node, it serializes its schema modifications on its own;
The collisions only happen when more than one node attempts schema
modifications at the same time.

Fixes #13152

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23827
2025-05-05 09:59:08 +03:00
Pavel Emelyanov
d40d6801b0 sstable_directory: Print ks.cf when moving unshared remove sstables
When an sstable is identified by sstable_directory as remote-unshared,
it will at some point be moved to the target shard. When it happens a
log-message appears:

    sstable_directory - Moving 1 unshared SSTables to shard 1

Processing of tables by sstable_directory often happens in parallel, and
messages from sstable_directory are intermixed. Having a message like
above is not very informative, as it tells nothing about sstables that
are being moved.

Equip the message with ks:cf pair to make it more informative.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23912
2025-05-05 09:45:44 +03:00
Pavel Emelyanov
e0f30a30a7 sstable_directory: Print unshared remote sstable when sorting
When collecting sstables, the sstable_directory may sort the collected
descriptors into one of three buckets -- unshared local and remote, and
shared ones. Unshared local and shared sstables' paths are loggerd (with
trace level) while unshared remote is silently collected for further
processing. Add log message for that case too, there's enough data to
print the sstable path as well.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23913
2025-05-05 09:33:06 +03:00
Piotr Dulikowski
8ffe4b0308 utils::loading_cache: gracefully skip timer if gate closed
The loading_cache has a periodic timer which acquires the
_timer_reads_gate. The stop() method first closes the gate and then
cancels the timer - this order is necessary because the timer is
re-armed under the gate. However, the timer callback does not check
whether the gate was closed but tries to acquire it, which might result
in unhandled exception which is logged with ERROR severity.

Fix the timer callback by acquiring access to the gate at the beginning
and gracefully returning if the gate is closed. Even though the gate
used to be entered in the middle of the callback, it does not make sense
to execute the timer's logic at all if the cache is being stopped.

Fixes: scylladb/scylladb#23951

Closes scylladb/scylladb#23952
2025-04-30 16:43:22 +03:00
Aleksandra Martyniuk
1f4edd8683 test_tablet_tasks: use injection to revoke resize
Currently, test_tablet_resize_revoked tries to trigger split revoke
by deleting some rows. This method isn't deterministic and so a test
is flaky.

Use error injection to trigger resize revoke.

Fixes: #22570.

Closes scylladb/scylladb#23966
2025-04-30 07:04:57 +03:00
Michał Chojnowski
9e2343ecb0 test_sstable_compression_dictionaries_autotrain: raise the timeout
There were CI runs in which the training happened as planned,
but it was too slow to fit within the timeout.

Raise the timeout to pacify the CI.

Fixes scylladb/scylladb#23964

Closes scylladb/scylladb#23965
2025-04-29 22:09:14 +03:00
Raphael S. Carvalho
d5bee4c814 test: Verify partitioned set store split and unsplit correctly
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
c77f710a0c sstables: Fix quadratic space complexity in partitioned_sstable_set
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.

A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.

Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).

There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.

Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.

It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.

Fixes #23634.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
59dad2121f compaction: Introduce token_range() to table_state
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Raphael S. Carvalho
494ed6b887 dht: Add overlap_ratio() for token range
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Patryk Jędrzejczak
0cdcf82cd0 Merge 'topology coordinator: do not proceed further on invalid boostrap tokens' from Piotr Dulikowski
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897

From the code inspection alone it looks like 2025.1 and 6.2 have this problem, so marking for backport to both of them.

Closes scylladb/scylladb#23914

* https://github.com/scylladb/scylladb:
  test: cluster: add test_bad_initial_token
  topology coordinator: do not proceed further on invalid boostrap tokens
  cdc: add sanity check for generating an empty generation
2025-04-28 12:45:33 +02:00
Botond Dénes
d582c436e5 Merge 'tasks: check whether a node is alive before rpc' from Aleksandra Martyniuk
Check whether a node is alive before making an rpc that gathers children
infos from the whole cluster in virtual_task::impl::get_children.

Fixes: https://github.com/scylladb/scylladb/issues/22514.

Needs backport to 2025.1 and 6.2 as they contain the bug.

Closes scylladb/scylladb#23787

* github.com:scylladb/scylladb:
  test: add test for getting tasks children
  tasks: check whether a node is alive before rpc
2025-04-28 09:32:45 +03:00
Nadav Har'El
262530f27c Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros
Currently, the base_info may or may not be set in view schemas.
Even when it's set, it may be modified. This necessitates extra
checks when handling view schemas, as we'll as potentially causing
errors when we forget to set it at some point.

Instead, we want to make the base info an immutable member of view
schemas (inside view_info). To achieve this, in this series we remove
all base_info members that can change due to a base schema update,
and we calculate the remaining values during view update generation,
using the most up-to-date base schema version.

To calculate the values that depend on the base schema version, we
need to iterate over the view primary key and find the corresponding
columns, which adds extra overhead for each batch of view updates.
However, this overhead should be relatively small, as when creating
a view update, we need to prepare each of its columns anyway. And
if we need to read the old value of the base row, the relative
overhead is even lower.

After this change, the base info in view schemas stays the same
for all base schema updates, so we'll no longer get issues with
base_info being incompatible with a base schema version. Additionally,
it's a step towards making the schema objects immutable, which
we sometimes incorrectly assumed in the past (they're still not
completely immutable yet, as some other fields in view_info other
than base_info are initialized lazily and may depend on the base
schema version).

Fixes https://github.com/scylladb/scylladb/issues/9059
Fixes https://github.com/scylladb/scylladb/issues/21292
Fixes https://github.com/scylladb/scylladb/issues/22194
Fixes https://github.com/scylladb/scylladb/issues/22410

Closes scylladb/scylladb#23337

* github.com:scylladb/scylladb:
  test: remove flakiness from test_schema_is_recovered_after_dying
  mv: add a test for dropping an index while it's building
  base_info: remove the lw_shared_ptr variant
  view_info: don't re-set base_info after construction
  base_info: remove base_info snapshot semantics
  base_info: remove base schema from the base_info
  schema_registry: store base info instead of base schema for view entries
  base_info: make members non-const
  view_info: move the base info to a separate header
  view_info: move computation of view pk columns not in base pk to view_updates
  view_info: move base-dependent variables into base_info
  view_info: set base info on construction
2025-04-27 19:12:12 +03:00
David Garcia
cf7d846b9e docs: update dependencies
This is a mandatory dependency update to resolve a critical Dependabot alert. For more details, see the [Dependabot alerts](https://docs.github.com/en/code-security/dependabot/dependabot-alerts/viewing-and-updating-dependabot-alerts).

Closes scylladb/scylladb#23918

Fixes #23935
2025-04-27 18:45:11 +03:00
Piotr Szymaniak
e588c8667f alternator: Limit attribute name lengths
Attribute names are now checked against DynamoDB-compatible length
limits. When exceeded, Alternator emits exception identical or similar
to the DDB one. It might be worth noting that DDB emits more than a
single kind of an exception string for some exceptions. The tests'
catch clauses handle all the observed kinds of messages from DynamoDB.
The validation differentiates between key and non-key attributes and
applies the limit accordingly.

AWS DDB raises exceptions with somewhat different contents when the
get request contains ProjectionExpression, so this case needed separate
treatment to emit the corresponding exception string. The
length-validating function was declared and defined in
expressions.hh/.cc respectively, because that's where the relevant
parsing happens.

** Tests

The following tests were validated when handling this issue:
test_limit_attribute_length_nonkey_good,
test_limit_attribute_length_nonkey_bad,
test_limit_attribute_length_key_good,
test_limit_attribute_length_key_bad,
test_limit_attribute_length_gsi_lsi_good,
test_limit_attribute_length_gsi_lsi_bad,
test_limit_attribute_length_gsi_lsi_projection_bad.

Some of the tests were expanded into being more granular. Namely, there
is a new test function
`test_limit_attribute_length_key_bad_incoherent_names`
which groups tests with too long attribute names in the case of
incorrect (incoherent) user requests.
Similarily, there is a new test function
`test_limit_attribute_length_gsi_lsi_bad_incoherent_names`
All the tests cover now each combination of the key/keys being too long.
Both the new fuctions contain tests that verify that ScyllaDB throws
length-related exceptions (instead of the coherency-related), similar
to what DynamoDB does.

The new test test_limit_gsiu_key_len_bad covers the case of too long
attribute name inside GlobalSecondaryIndexUpdates.
The new test test_limit_gsiu_key_len_bad_incoherent_names covers the
case of incorrect (incoherent) user requests containing too long
attribute names and GlobalSecondaryIndexUpdates.

test_limit_attribute_length_key_bad was found to have contaned an
illegal KeySchema structure.

Some of the tests were corrected their match clause.

All the tests are stripped of the xfail flag except
test_limit_attribute_length_key_bad, which has it changed since it
still fails due to Projection in GSI and LIS not implemented in Alternator.
The xfail now points to #5036.

Fixes scylladb/scylladb#9169

Closes scylladb/scylladb#23097
2025-04-27 18:39:20 +03:00
Piotr Dulikowski
82e1678fbe test: mv: skip test_mv_tablets_empty_ip in debug mode
This test shuts down a node and then replaces it with another one while
continuously writing to the cluster. The test has been observed to take
a lot of time in debug mode and time out on the replace operation.
Replace takes very long because rebuilding tablets on the new node is
very slow, and the slowest part is memtable flush which happens at the
beginning of streaming. The slowness seems to be specific to the debug
mode.

Turn off the test in debug mode to deflake the CI. As a follow-up, the
test is planned to be reworked into an quicker error injection test so
that the code path tested by this test will be again exercised in debug
unit tests (scylladb/scylladb#23898)

Fixes: scylladb/scylladb#20316

Closes scylladb/scylladb#23900
2025-04-27 18:06:08 +03:00
Piotr Dulikowski
670a69007e test: cluster: add test_bad_initial_token
Adds a test which checks that rollback works properly in case when a bad
value of the initial_token function is provided.
2025-04-25 12:25:15 +02:00
Piotr Dulikowski
845cedea7f topology coordinator: do not proceed further on invalid boostrap tokens
In case when dht::boot_strapper::get_boostrap_tokens fail to parse the
tokens, the topology coordinator handles the exception and schedules a
rollback. However, the current code tries to continue with the topology
coordinator logic even if an exception occurs, leaving boostrap_tokens
empty. This does not make sense and can actually cause issues,
specifically in prepare_and_broadcast_cdc_generation_data which
implicitly expect that the bootstrap_tokens of the first node in the
cluster will not be empty.

Fix this by adding the missing break.

Fixes: scylladb/scylladb#23897
2025-04-25 11:30:01 +02:00
Piotr Dulikowski
66acaa1bf8 cdc: add sanity check for generating an empty generation
It doesn't make sense to create an empty CDC generation because it does
not make sense to have a cluster with no tokens. Add a sanity check to
cdc::make_new_generation_description which fails if somebody attempts to
do that (i.e. when the set of current tokens + optionally bootstrapping
node's tokens is empty).

The function does not work correctly if it is misused, as we saw in
scylladb/scylladb#23897. While the function should not be misused in the
first place, it's better to throw an exception rather than crash -
especially that this crash could happen on the topology coordinator.
2025-04-25 11:25:07 +02:00
Aleksandra Martyniuk
76cd707b18 test: test_tablets: wait for cql
Wait for cql after rolling restart in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
to prevent failing queries.

Fixes: #23620.

Closes scylladb/scylladb#23796
2025-04-24 21:25:29 +03:00
Patryk Jędrzejczak
2a8bb47cfb test: test_zero_token_nodes_topology_ops: use host IDs for ignored nodes
Providing IP of an ignored node during removenode made the test flaky.
It could happen that the address map contained mappings of two
nodes with the same IP:
1. the node being ignored,
2. the node that expectedly failed replacing earlier in the test.

So, `address_map::find_by_addr()` called in `find_raft_nodes_from_hoeps`
could return the host ID of the second node instead of the first node
and cause removenode to fail.

We fix flakiness in this patch by providing the host ID of the ignored
node instead of its IP. We would have to do it anyway sooner or later
because providing IP is deprecated.

The bug in `find_raft_nodes_from_hoeps` is tracked by
scylladb/scylladb#23846.

The test became flaky because of f0af3f261e.
That patch is not present in 2025.1, so the test isn't flaky outside
master, and hence there is no reason to backport this patch.

Fixes scylladb/scylladb#23499

Closes scylladb/scylladb#23863
2025-04-24 20:17:19 +03:00
Pavel Emelyanov
68a178eba9 Merge 'replica: skip flush of dropped table' from Aleksandra Martyniuk
Currently, flush throws no_such_column_family if a table is dropped. Skip the flush of dropped table instead.

Fixes: #16095.

Needs backport to 2025.1 and 6.2 as they contain the bug

Closes scylladb/scylladb#23876

* github.com:scylladb/scylladb:
  test: test table drop during flush
  replica: skip flush of dropped table
2025-04-24 20:02:59 +03:00