- Restore test_table_ts_lwt fixture with system:write_isolation=always to
explicitly test that the timestamp attribute is rejected in LWT_ALWAYS mode
- Add test_timestamp_attribute_lwt_always_rejected which verifies that even
a plain PutItem with a timestamp is rejected when always_use_lwt is set
- Keep test_timestamp_attribute_with_condition_rejected using test_table_ts
(with the test runner's default only_rmw_uses_lwt isolation) to test
that a ConditionExpression triggers LWT rejection
- Update docs: fix item 4 (non-numeric now rejected), improve Limitations
section to clearly state always_use_lwt is incompatible with the feature
and recommend system:write_isolation=only_rmw_uses_lwt
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Use scope="module" instead of testpy_test_fixture_scope in fixtures
- Rename test_table_ts_sc to test_table_ts_ss (ss = string+string keys)
- Remove test_table_ts_lwt; use test_table_ts for LWT-rejection test
(the test server runs with only_rmw_uses_lwt, so conditions trigger LWT)
- Add comment that fixtures make tests implicitly Scylla-only
- Change non-numeric timestamp attribute behavior: reject with
ValidationException instead of silently storing (test + C++ implementation)
- Add test_timestamp_attribute_microseconds: verifies the timestamp unit
is microseconds and tests interaction with default server timestamps
- Add import time for the new microseconds test
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
Closesscylladb/scylladb#28771
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.
Closesscylladb/scylladb#28783
PR #28703 was merged into master but not with the latest version of the
changes. This patch is an incremental fix for this.
Currently, the elements of the tablet_sizes_per_shard vector are
incremented in separate shards. This is prone to false sharing of cache
lines, and ping-pong of memory, which leads to reduced performance.
In this patch, in order to avoid cache line collisions while updating
the sum of tablet sizes per shard, we align the counter to 64 bytes.
Fixes: SCYLLADB-678
Closesscylladb/scylladb#28757
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.
Closesscylladb/scylladb#28774
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.
Fixes SCYLLADB-778
Closesscylladb/scylladb#28781
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.
It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.
Fixes: #22501Closesscylladb/scylladb#28745
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.
Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)
When CPU is starved on CI we can return timeouts and/or other errors.
The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759Closesscylladb/scylladb#28767
Lua doesn't have separate integer and floating point numbers,
so we check if a number can fit in an integer and if so convert
it to an integer.
The conversion routine invokes undefined behavior (and even
acknowledges it!). More recent compilers changed their behavior
when casting infinities, breaking test_user_function_double_return
which tests this conversion.
Fix by tightening the conversion to not invoke undefined behavior.
Closesscylladb/scylladb#28503
This patchset:
- ensures the loading semaphore is acquired in cross-shard callbacks
- fixes iterator invalidation problem when reloading all cached permissions
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780
Backport: no, affected code not released yet
Closesscylladb/scylladb#28766
* github.com:scylladb/scylladb:
auth: cache: fix permissions iterator invalidation in reload_all_permissions
auth/cache: acquire _loading_sem in cross-shard callbacks
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28565
- add an overload to the rest http client to accept retry strategy instance as an argument
- remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate
No backport neede since it is only refactoring
Closesscylladb/scylladb#28161
* github.com:scylladb/scylladb:
object_storage: add retryable machinery to object storage
rest_client: add `simple_send` overload
This patch adds a unit test for tablet_map::get_secondary_replica().
It was never officially defined how the "primary" and "secondary"
replicas were chosen, and their implementation changed over time,
but the one invariant that this test verifies is that the secondary
replica and the primary replica must be a different node.
This test reproduces issue SCYLLADB-777, where we discovered that
the get_primary_replica() changed without a corresponding change to
get_primary_replica(). So before the previous patch, this test failed,
and after the previous patch - it passes.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have many single-node functional tests for Alternator TTL in
test/alternator/test_ttl.py. This patch adds a multi-node test in
test/cluster/test_alternator.py. The new test verifies that:
1. Even though Alternator TTL splits the work of scanning and expiring
items between nodes, all the items get correctly expired.
2. When one node is down, all the items still expire because the
"secondary" owner of each token range takes over expiring the
items in this range while the "primary" owner is down.
This new test is actually a port of a test we already had in dtest
(alternator_ttl_tests.py::test_multinode_expiration). This port is
faster and smaller then the original (fewer nodes, fewer rows), but it
still found a regression (SCYLLADB-777) that dtest missed - the new test
failed when running with tablets and in release build mode.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The function tablet_map::get_secondary_replica() is used by Alternator
TTL to choose a node different from get_primary_replica(). Unfortunately,
recently (commits 817fdad and d88037d) the implementation of the latter
function changed, without changing the former. So this patch changes
the former to match.
The next two patches will have two tests that fail before this patch,
and pass with it:
1. A unit test that checks that get_secondary_replica() returns a
different node than get_primary_replica().
2. An Alternator TTL test that checks that when a node is down,
expirations still happen because the secondary replica takes over
the primary replica's work.
Fixes SCYLLADB-777
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Move the storage test suite from test/storage/ to test/cluster/storage/
to consolidate related cluster-based tests.This removes the standalone
test/storage/suite.yaml as the tests will use the cluster's test configuration.
Initially these tests were in cluster, but to use unshare at first
iteration they were moved outside. Now they are using another way to
handle volumes without unshare, they should be in cluster
Closesscylladb/scylladb#28634
The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there.
Closesscylladb/scylladb#28609
* github.com:scylladb/scylladb:
test/vector_search: add reproducer for rescoring with zero vectors
vector_search: return NaN for similarity_cosine with all-zero vectors
This patch series moves `test/cluster/dtest/guardrails_test.py`
to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/`
to `cluster/` framework.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
No backport, `dtest/guardrails_test.py` is only on master
Closesscylladb/scylladb#28737
* github.com:scylladb/scylladb:
test: move dtest/guardrails_test.py to test_guardrails.py
test: prepare guardrails_test.py to be moved to test/cluster/
The inner loops in reload_all_permissions iterate role's permissions
and _anonymous_permissions maps across yield points. Concurrent
load_permissions calls (which don't hold _loading_sem) can emplace
into those same maps during a yield, potentially triggering a rehash
that invalidates the active iterator.
We want to avoid adding semaphore acquire in load_permissions
because it's on a common path (get_permissions).
Fixing by snapshotting the keys into a vector before iterating with
yields, so no long-lived map iterator is held across suspension
points.
This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop.
Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths.
Patch 2 adds explicit callback lifetime coordination in view_builder:
- introduce a seastar::gate member
- acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures
- keep the hold alive until each detached future resolves
- close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown
Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation.
Testing:
- pytest --test-py-init test/cluster/mv (47 passed, 7 skipped)
backport: not required started happening in master
fixes: SCYLLADB-687
Closesscylladb/scylladb#28648
* github.com:scylladb/scylladb:
db/view: gate detached view-builder callbacks during shutdown
db:view: refactor on_update_view to use coroutine dispatcher
distribute_role() modifies _roles on non-zero shards via
invoke_on_others() without holding _loading_sem. Similarly, load_all()'s
invoke_on_others() callback calls prune_all() without the semaphore.
When these run concurrently with reload_all_permissions(), which
iterates _roles across yield points, an insertion can trigger
absl::flat_hash_map::resize(), freeing the backing storage while
an iterator still references it.
Fix by acquiring _loading_sem on the target shard in both
distribute_role()'s and load_all()'s invoke_on_others callbacks,
serializing all _roles mutations with coroutines that iterate
the map.
remove hand rolled error handling from object storage client
and replace with common machinery that supports exception
handling and retrying when appropriate
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3
Cleaning tests, not backporting
Closesscylladb/scylladb#28600
* https://github.com/scylladb/scylladb:
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.
But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.
As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).
I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28268
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.
Closesscylladb/scylladb#28733
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.
Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.
Minor test fix, not backporting
Closesscylladb/scylladb#28607
* github.com:scylladb/scylladb:
test: Fix the condition for streaming directions validation
test: Split test_backup.py::check_data_is_back() into two
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
Only 2026.1 vulnerable.
Closesscylladb/scylladb#28688
* github.com:scylladb/scylladb:
test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
test: cluster: task_manager_client: Introduce wait_task_appears()
tests: pylib: util: Add exponential backoff to wait_for
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).
There's shorter way -- just call manager.get_server_exe().
Same for backup-restore test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28676
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28677
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.
In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).
In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28606
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.
Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.
This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.
Backport: none, because this fixes internal documentation.
Closesscylladb/scylladb#28126
* github.com:scylladb/scylladb:
protocol-extensions.md: Fix client_options docs
system_keyspace.md: Add client_options column
system_keyspace.md: Fix order in system.clients
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28630
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.
Fixes SCYLLADB-665
Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.
Closesscylladb/scylladb#28624
* https://github.com/scylladb/scylladb:
test: raft: Add test_aborting_wait_for_state_change
raft: Describe exception types for wait_for_state_change and wait_for_leader
raft: Await instead of returning future in wait_for_state_change
This commit moves `guardrails_test.py`, prepared in the previous
commit of this patch series, to `test/cluster/test_guardrails.py`.
It also cleans up `suite.yaml`.
Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and
make it compatible with the `test/cluster/` framework. This will
allow moving this file from `test/cluster/dtest/` to `test/cluster/`
in the next commit of this patch series.
There are two motivations for moving the test:
- Execution time reduction (from 12s to 9s in 'dev' in my env)
- Facilitate adding new tests to the `guardrails_test.py` file
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable
When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.
What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28669
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910
No backport, just a README link fixes.
Closesscylladb/scylladb#28699
* github.com:scylladb/scylladb:
docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
docs: fix link to docker build README.MD
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.
We also add extensive testing so that it's more difficult to break the tooling in the future.
Fixes SCYLLADB-560
Backport: no, internal tooling improvement
Closesscylladb/scylladb#28541
* github.com:scylladb/scylladb:
test: cluster: add tests for perf tools
test: perf: fix port race condition on startup in connect workload
test: perf: prepare benchmarks to bind to custom host
test: perf: make perf-alterantor remote port configurable
test: perf: fix ASAN leak warnings in perf-alternator
Reapply "main: test: add future and abort_source to after_init_func"
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
We also raise an internal error to prevent a segmentation fault in a few places.
Fixes#27987
Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.
Closesscylladb/scylladb#28558
* github.com:scylladb/scylladb:
raft topology: prevent accessing nullptr returned by topology::find
raft topology: make some assertions non-crashing
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).
Closesscylladb/scylladb#28376
* github.com:scylladb/scylladb:
docs: decommission: note audit ks may require ALTERing
docs: mention table audit enabled by default
audit: disable DDL by default
db/config: enable table audit by default
test/cluster: fix `test_table_desc_read_barrier` assertion
test/cluster: adjust audit in tests involving decommissioning its ks
audit_test: fix incorrect config in `test_audit_type_none`
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28540
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)
It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.
Improving tests, not backporting
Closesscylladb/scylladb#28569
* https://github.com/scylladb/scylladb:
test: Merge test_restore_primary_replica_different_... tests
test: Merge test_restore_primary_replica_same_... tests
test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
Fix the build of the test and the upload operation flow
No need to backport since it is only a test we barely use
Closesscylladb/scylladb#28595
* github.com:scylladb/scylladb:
s3_perf: fix upload operation flow
s3_perf: fix the CMake build
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.
Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.
To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.
Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.
Refs scylladb/siren#15317Closesscylladb/scylladb#28563
* github.com:scylladb/scylladb:
tablets, config: Reduce migration concurrency to 2
tablets: load_balancer: Always accept migration if the load is 0
config, tablets: Make tablet migration concurrency configurable
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.
Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.
As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.
Fixes SCYLLADB-665
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.
We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).
Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1
Fixes https://github.com/scylladb/scylladb/issues/27941Closesscylladb/scylladb#28704
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.
Fix by using --pull=always to get a fresh image.
Ref https://scylladb.atlassian.net/browse/SCYLLADB-714Closesscylladb/scylladb#28680
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.
This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.
Refs: SCYLLADB-678
Closesscylladb/scylladb#28703
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.
Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.
Closesscylladb/scylladb#28710
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: SCYLLADB-105
Closesscylladb/scylladb#28080
It checks if all workloads can be properly
executed with succesfull startup and teardown.
Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.
Test runtime:
Release - 24s
Debug - 1m 15s
Test time consists mostly of Scylla startup in various modes.
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.
This commit adds a single place where we will wait for port
for all workloads.
Timeout is set to 5 minutes so that even slowest machines
are able to start.
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.
Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.
Closesscylladb/scylladb#28581
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.
BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.
Skip such repairs at the scheduler level for auto repair.
If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.
Fixes SCYLLADB-561
Closesscylladb/scylladb#28640
This commit introduces four changes:
- In the `table` example, singular forms (node, partition) are changed to
plural forms (nodes, partitions). Currently, the default `table`
audit configuration is RF=3 and writes use CL=ONE. Therefore,
a `table` audit log write failure should not be caused by a single
node unavailability, and plural forms are more adequate.
- In the `table` example, unreachability due to network issues is
mentioned because with RF=3, audit failure due to network problems
is more likely to happen than a simultaneous failure of three
nodes (such network failures happened in SCYLLADB-706).
- In the `syslog` example, a slash `/` is changed to `or`, so `table`
and `syslog` examples have similar structure.
- As the `syslog` line is already being changed, I also change `unix`
to `Unix`, as the capitalized form is the correct one.
Refs SCYLLADB-706
Closesscylladb/scylladb#28702
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fdClosesscylladb/scylladb#28530
* github.com:scylladb/scylladb:
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.
Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.
Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.
Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147Closesscylladb/scylladb#28078
* github.com:scylladb/scylladb:
test: boost: add auth cache tests
auth: add cache size metrics
docs: conf: update permissions cache documentation
auth: remove old permissions cache
auth: use unified cache for permissions
auth: ldap: add permissions reload to unified cache
auth: add permissions cache to auth/cache
auth: add service::revoke_all as main entry point
auth: explicitly life-extend resource in auth_migration_listener
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28566
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28527
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.
The documentation is adjusted since cache will be
persistent for sccache without further work.
Closesscylladb/scylladb#28524
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.
Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().
Code cleanup, not backporting
Closesscylladb/scylladb#28491
* github.com:scylladb/scylladb:
api: Remove map_to_key_value() helpers
api: Streamify view_build_statuses handler
api: Streamify few more storage_service/ handlers
api: Add map_to_json() helper
api: Coroutinize view_build_statuses handler
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.
Closesscylladb/scylladb#28490
Detached migration callbacks (on_create_view, on_update_view, on_drop_view)
can race with view_builder::drain() teardown.
Add a lifetime gate to view_builder and wire callback launches through
_ops_gate.hold() so each detached dispatch future is tracked until it
completes (finally keeps the hold alive). During shutdown, drain()
now waits for all tracked callback work with _ops_gate.close().
This ensures drain does not proceed past callback lifetime while shutdown is in
progress, and ignores only gate_closed_exception at callback entry as the
expected shutdown path.
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.
The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.
Refs: https://github.com/scylladb/seastar/pull/3236
No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).
Closesscylladb/scylladb#28573
* github.com:scylladb/scylladb:
init: fix infinite loop on npos wrap with updated Seastar
init: remove unnecessary object creation in emplace calls
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.
Extend the message timeout of tasks_vt_get_children from 10s to 60s.
Fixes: #28295.
Closesscylladb/scylladb#28599
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys
Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.
Fixes:PM-190
Closesscylladb/scylladb#28585
Right now the slowest test in the test/cqlpy directory is
cassandra_tests/validation/entities/collections_test.py::
testMapWithLargePartition
This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.
200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!
Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28271
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).
We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28505
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.
Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.
We install its dependencies via the normal methods so they're still
checked.
[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_defaultClosesscylladb/scylladb#28687
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling
No backport needed since it is only refactoring of the existing code
Closesscylladb/scylladb#28250
* github.com:scylladb/scylladb:
exceptions: add helper to build a chain of error handlers
http: extract error classification code
aws_error: extract `retryable` from aws_error
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
Closesscylladb/scylladb#28592
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.
Fix by waiting for the task to appear rather than the injection.
Fixes SCYLLADB-715
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471Closesscylladb/scylladb#28615
Fixes#28678
If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.
Simply move queue abort to always be done on loop exit.
Closesscylladb/scylladb#28679
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.
Refs: scylladb/seastar#3236
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.
The LDAP server may change role-chain assignments without notifying
Scylla. As a result, effective permissions can change, so some form of
polling is required.
Currently, this is handled via cache expiration. However, the unified
cache is designed to be consistent and does not support expiration.
To provide an equivalent mechanism for LDAP, we will periodically
reload the permissions portion of the new cache at intervals matching
the previously configured expiration time.
We want to get rid of loading cache because its periodic
refresh logic generates a lot of internal load when there
is many entries. Also our operation procedures involve tweaking
the config while new unified cache is supposed to work out
of the box.
In the following commit we'll need to add some
cache related logic (removing resource permissions).
This logic doesn't depend on authorizer so it should
be managed by the service itself.
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates
We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira.
This is a new github action, and does not require any code backport.
Fixes: PM-64
Closesscylladb/scylladb#28212
What changed
Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml
Why (Requirements Summary)
Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR
When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue
When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue
Testing was performed in staging.git and the STAG Jira project.
Fixes:PM-177
Closesscylladb/scylladb#28575
Fixes#28398Fixes#28399
When used as path elements in google storage paths, the object names need to be URL encoded. Due to
a.) tests not really using prefixes including non-url valid chars (i.e. / etc)
and
b.) the mock server used for most testing not enforcing this particular aspect,
this was missed.
Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show.
"Real" GCS also behaves a bit different when listing with pager, compared to mock;
The former will not give a pager token for last page, only penultimate.
Adds handling for this.
Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected.
Closesscylladb/scylladb#28400
* github.com:scylladb/scylladb:
utils/gcp/object_storage: URL-encode object names in URL:s
utils::gcp::object_storage: Fix list object pager end condition detection
The current manager flow have a flaw. It will trigger pytest.fail when
it found errors on teardown regardless if the test was already failed.
This will create an additional record in JUnit report with the same name
and Jenkins will not be able to show the logs correctly. So to avoid
this, this PR changes logic slightly.
Now manager will check that test failed or not to avoid two fails for
the same test in the report.
If test passed, manager will check the cluster status and fail if
something wrong with a status of it. There is no need to check the
cluster status in case of test fail.
If test passed, and cluster status if OK, but there are unexpected
errors in the logs, test will fail as well. But this check will gather
all information about the errors and potential stacktraces and will only
fail the test if it's not yet failed to avoid double entry in report.
Closesscylladb/scylladb#28633
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.
With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.
We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
No backport, test update.
Closesscylladb/scylladb#28571
* github.com:scylladb/scylladb:
test: run test_different_group0_ids in all modes
test: make test_different_group0_ids work with the Raft-based topology
Currently, the load balancing simulator computes node, shard and tablet load based on tablet count.
This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count.
This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series:
- First part for tablet size collection via load_stats: scylladb/scylladb#26035
- Second part reconcile load_stats: scylladb/scylladb#26152
- The third part for load_sketch changes: scylladb/scylladb#26153
- The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254
- The fifth part changes the load balancing simulator: scylladb/scylladb#26438
This is a new feature and backport is not needed.
Closesscylladb/scylladb#26438
* github.com:scylladb/scylladb:
test, simulator: compute load based on tablet size instead of count
test, simulator: generate tablet sizes and update load_stats
test, simulator: postpone creation of load_stats_ptr
In b03d520aff ("cql3: introduce similarity functions syntax") we
added vector similarity functions to the grammar. The grammar had to
be modified because we wanted to support literals as vector similarity
function arguments, and the general function syntax in selectors
did not allow that.
In cc03f5c89d ("cql3: support literals and bind variables in
selectors") we extended the selector function call grammar to allow
literals as function arguments.
Here, we remove the special case for vector similarity functions as
the general case in function calls covers all the possibilities the
special case does.
As a side effect, the vector similarity function names are no longer
reserved.
Note: the grammar change fixes an inconsistency with how the vector
similarity functions were evaluated: typically, when a USE statement
is in effect, an unqualified function is first matched against functions
in the keyspace, and only if there is no match is the system keyspace
checked. But with the previous implementation vector similarity functions
ignored the USE keyspace and always matched only the system keyspace.
This small inconsistency doesn't matter in practice because user defined
functions are still experimental, and no one would name a UDF to conflict
with a system function, but it is still good to fix it.
Closesscylladb/scylladb#28481
Currently, if the test fail, pytest will output only some basic information
about the fail. With this change, it will output the last 300 lines of the
boost/seastar test output.
Also add capturing the output of the failed tests to JUnit report, so it
will be present in the report on Jenkins.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449Closesscylladb/scylladb#28535
In ebda2fd4db ("test: cql_test_env: increase file descriptor limit"),
we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env
as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests
open 256 sstables, and with 2 files/sstable + resharding work it is understandable
that they overflow the 1024 limit.
No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches.
Closesscylladb/scylladb#28646
* github.com:scylladb/scylladb:
test: sstables::test_env: adjust file open limit
test: extract cql_test_env's adjust_rlimit() for reuse
* seastar f55dc7eb...d2953d2a (13):
> io_tester: Revive IO bandwidth configuration
> Merge 'io_tester: add vectorized I/O support' from Travis Downs
doc: add vectorized I/O options to io-tester.md
io_tester: add vectorized I/O support
> Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov
reactor: Drop sched group IDs bitmap
reactor: Allocate scheduling group on shard-0 first
reactor: Detach init_scheduling_group_specific_data()
reactor: Coroutinize create_scheduling_group()
> set_iterator: increase compatibility with C++ ranges
> test: fix race condition in test_connection_statistics
> Add Claude Code project instructions
> reactor: Unfriend pollable_fd via pollable_fd_state::make()
> Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń
apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job.
apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server.
> treewide: remove remnants of SEASTAR_MODULE
> test: Tune abort-accept test to use more readable async()
> build: support sccache as a compiler cache (#3205)
> posix-stack: Reuse parent class _reuseport from child
> Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg
tests: Add unit test for epoll busy spin bug
reactor_backend: Fix another busy spin bug in epoll
Closesscylladb/scylladb#28513
Previous implementation of Scylla lifecycle brought flakiness to the test.
This change leaves lifecycle management up to PythonTest.run_ctx,
which implements more stability logic for setup/teardown.
Replace pexpect-driven GDB interaction with GDB batch mode:
- Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty()
may lead to deadlocks in the child.", which ultimately caused CI deadlocks.
- Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts.
- Produces cleaner, more direct assertions around command execution and output.
- Trade-off: batch mode adds ~10s per command per test,
but with --dist=worksteal this is ~10% overall runtime increase across the suite.
Closesscylladb/scylladb#28484
After PR https://github.com/scylladb/scylladb/pull/28396 reduced
the test volumes to 20MiB to speed up test_out_of_space_prevention.py,
keeping the original 0.8 critical disk utilization threshold can make
the tests flaky: transient disk usage (e.g. commitlog segment churn)
can push the node into ENOSPC during the run.
These tests do not write much data, so reduce the critical disk
utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB
of headroom for temporary growth during the test.
Fixes: https://github.com/scylladb/scylladb/issues/28463Closesscylladb/scylladb#28593
test_maintenance_socket with new way of running is flaky. Looks like the
driver tries to reconnect with an old maintenance socket from previous
driver and fails. This PR adds white list for connection that stabilize
the test
test_no_removed_node_event_on_ip_change was flaky on CI, while the issue
never reproduced locally. The assumption that under load we have race
condition and trying to check the logs before message is arrived. Small
for loop to retry added to avoid such situation.
Closesscylladb/scylladb#28635
The test can currently fail like this:
```
> await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
because node A performs tablet migrations at the the same time.
What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.
The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.
Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.
The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.
Fixes#25938Closesscylladb/scylladb#28632
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).
Closesscylladb/scylladb#28625
* github.com:scylladb/scylladb:
test: explicitly set compression algorithm in test_autoretrain_dict
test: remove unneeded semicolons from python test
Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail,
increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures.
Fixes https://github.com/scylladb/scylladb/issues/27745
Backport: no, it's not a bug fix
Closesscylladb/scylladb#28628
* github.com:scylladb/scylladb:
test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call
test: pylib: increase curl's number of retries when downloading scylla
test: pylib: improve error reporting in get_scylla_2025_1_executable
Previously, global_tablet_token_metadata_barrier() could proceed with
fencing even if some nodes did not acknowledge the barrier_and_drain.
This could cause problems:
* In scylladb/scylladb#26864, replica locks did not provide mutual
exclusion, because “fenced out” requests from old topology versions
could run in parallel with requests using newer versions.
* In scylladb/scylladb#26375, the barrier could succeed even though we
did not wait for closed sessions to become unused. This could leave
aborted repair or streaming tasks running concurrently after a tablet
transition was aborted, and thus running concurrently with the next
transition.
In this commit we add a parameter drain_all_nodes: bool to
the global_token_metadata_barrier function. If this parameter is set,
the barrier waits for all nodes to acknowledge the barrier_and_drain
round of RPCs. If any of the nodes are not accessible or throw an error,
such errors are rethrown to the caller. We set this parameter only in
global_tablet_token_metadata_barrier since for topology migrations
the old behavior should be preserved. In case of errors, the tablet
migration is blocked until the problem goes away by itself or the
problematic node is added to the ignore_nodes list.
The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is
removed: with tablets, we now drain all nodes, so after a successful
barrier_and_drain round there can be no coordinators with an old
topology version. The fence_token check after executing a request on
a replica is therefore unnecessary for tablets, but still required for
vnodes, where topology changes do not wait for all nodes.
Topology fencing is covered by test_fence_lwt_during_bootstrap.
Fixesscylladb/scylladb#26864Fixesscylladb/scylladb#26375
Add explicit erm-holding variables in all replica-side RPC handlers.
This is required to ensure that tablet migration waits for in-flight
replica requests even if a non-replica coordinator has been fenced out.
Holding erms on the replica side may increase the global-barrier wait
time, since the barrier must drain these requests. We believe this
is acceptable because:
* We already hold erms during replica-side request execution, but in
an ad-hoc, non-systemic way in lower layers of storage_proxy
(e.g. in sp::mutate_locally and do_query_tablets).
* Replica requests are bounded by replica-side timeouts, so the
global-barrier wait time cannot exceed the maximum of these timeouts.
For Paxos verbs, we use token_metadata_guard, which wraps the ERM and
automatically refreshes it when tablet migration does not affect the
current token; see the token_metadata_guard comments for details.
We use this guard only for Paxos verbs because regular reads and writes
already hold raw erms in storage_proxy and on the coordinators.
The erms must be held in all RPC handlers that support fencing — that
is, those with a fencing_token parameter in storage_proxy.idl.
Counter updates already hold erms in
mutate_counter_on_leader_and_replicate.
Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets
barrier now waits for all nodes. As a result, the replica read
is expected to finish, rather than fail due to the tablet having
moved as it did previously. The test is renamed to
test_tablets_barrier_waits_for_replica_erms to better reflect its
purpose.
Refs scylladb/scylladb#26864
Before waiting on stale_versions_in_use(), we log the stale versions
the barrier_and_drain handler will wait for, along with the number of
token_metadata references representing each version.
To achieve this, we store a pointer to token_metadata in
version_tracker, traverse the _trackers list, and output all items
with a version smaller than the latest. Since token_metadata
contains the version_tracker instance, it is guaranteed to remain
alive during traversal. To count references, token_metadata now
inherits from enable_lw_shared_from_this.
This helps diagnose tablet migration stalls and allows more
deterministic tests: when a barrier is expected to block, we can
verify that the log contains the expected stale versions rather
than checking that the barrier_and_drain is blocked on
stale_versions_in_use() for a fixed amount of time.
One of the tests check that amount of the PK should be more than 2, but
the method that creates it can return table with less keys. This leads
to flakiness and to avoid it, this PR ensures that table will have at
least 3 PK
Closesscylladb/scylladb#28636
on_update_view() currently runs its serialized logic inline via with_semaphore()
from a detached callback path, while create/drop already use dedicated async
dispatchers.
Refactor update handling to follow the same pattern:
- add dispatch_update_view(sstring ks_name, sstring view_name)
- move update logic into that coroutine
- acquire the existing view-builder lock via get_or_adopt_view_builder_lock()
- keep existing behavior for missing base/view state
- keep background invocation semantics from on_update_view()
This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.
The twcs compaction tests open more than 1024 files (not
so good), and will fail in a user session with the default
soft limit (1024).
Attempt to raise the limit so the tests pass. On a modern
systemd installation the hard limit is >500,000, so this
will work.
There's no problem in dbuild since it raises the file limit
globally.
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.
The PR introduces two changes:
* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.
Fixes: #28012
Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.
Closesscylladb/scylladb#28617
* github.com:scylladb/scylladb:
vector_search: Fix missing timeout on TLS handshake
vector_search: test: Fix flaky cert rewrite test
CI currently fails in release and debug modes if the PR only changes
a test run only in dev mode. There is no reason to wait for the CI fix,
as there is no reason to run this test only in dev mode in the first
place. The test is very fast.
The test was marked with xfail in #28383, as it needed to be updated to
work with the Raft-based topology. We are doing that in this patch.
With the Raft-based topology, there is no reason to check that nodes with
different group0 IDs cannot merge their topology/token_metadata. That is
clearly impossible, as doing any topology change requires being in the
same group0. So, the original regression test doesn't make sense.
We can still test that nodes with different group0 IDs cannot gossip with
each other, so we keep the test. It's very fast anyway.
It's difficult to say if our download backend would always return
transient error correctly so that the curl could retry. Instead it's
more robust to always retry on error.
By default curl does exponential backoff, and we want to keep that
but there is time cap of 10 minutes, so with 40 retries we'd wait
long time, instead we set the cap to 60 seconds.
Total waiting time (excluding receiving request time):
before - 17m
after - 35m
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: scylladb/scylladb#28204
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
This patch changes the load balancing simulator so that it computes
table load based on tablet sizes instead of tablet count.
best_shard_overcommit measured minimal allowed overcommit in cases
where the number of tablets can not be evenly distributed across
all the available shards. This is still the case, but instead of
computing it as an integer div_ceil() of the average shard load,
it is now computed by allocating the tablet sizes using the
largest-tablet-first method. From these, we can get the lowest
overcommit for the given set of nodes, shards and tablet sizes.
This change adds a random tablet size generator. The tablet sizes are
created in load_stats.
Further changes to the load balance simulator:
- apply_plan() updates the load_stats after a migration plan is issued by the
load balancer,
- adds the option to set a command line option which controls the tablet size
deviation factor.
With size based load balancing, we will have to move the tablet size in
load_stats after each internode migration issued by balance_tablets().
This will be done in a subsequent commit in apply_plan() which is
called from rebalance_tablets().
Currently, rebalance_tablets() is passed a load_stats_ptr which is
defined as:
using load_stats_ptr = lw_shared_ptr<const load_stats>;
Because this is a pointer to const, apply_plan() can't modify it.
So, we pass a reference to load_stats to rebalance_tablets() and create
a load_stats_ptr from it for each call to balance_tablets().
test_remove_node_violating_rf_rack_with_rack_list creates a cluster
with four nodes. One of the nodes is excluded, then another one is
stopped, excluded, and removed. If the two stopped nodes were both
voters, the majority is lost and the cluster loses its raft leader.
As a result, the node cannot be removed and the operation times out.
Add the 5th node to the cluster. This way the majority is always up.
Fixes: https://github.com/scylladb/scylladb/issues/28596.
Closesscylladb/scylladb#28610
The test creates a single node cluster, then creates 3 tables which
remain empty. Then it adds another node with half the disk capacity of
the first one, and then it waits for the balancer to migrate tablets to
the newly added node by calling the quiesce topology API. The number of
tablets on the smaller node should be exactly half the number of tablets
on the larger node.
After waiting for quiesce topology, we could have a situation where we
query the number of tablets from the node which still hasn't processed
the last tablet migrations and updated system.tablets.
This patch adds a read barrier so that both nodes see the same tablets
metadata before we query the number of tablets.
Fixes: SCYLLADB-603
Closesscylladb/scylladb#28598
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.
This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.
This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.
Fixes: #28012
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
---
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.
Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.
---
The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):
Before:
CPU utilization: 0.9%
real 2m7.811s
user 0m25.446s
sys 0m16.733s
After:
CPU utilization: 1.1%
real 1m40.288s
user 0m25.218s
sys 0m16.566s
---
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
Backport: This improves the stability of our CI, so let's
backport it to all supported versions.
Closesscylladb/scylladb#28602
* github.com:scylladb/scylladb:
test: cluster: Reduce wait time in test_sync_point
test: cluster: Fix test_sync_point
test: cluster: Await sync points asynchronously
test: cluster: Create sync points asynchronously
test: cluster: Fetch hint metrics asynchronously
The ANN vector queries with all-zero vectors are allowed even on vector
indexes with similarity function set to cosine.
When enabling the rescoring option, those queries would fail as the rescoring
calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception
as all-zero vectors were not allowed matching Cassandra's behaviour.
To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass,
but return the NaN as the cosine similarity for zero vectors is mathematically incorrect.
We decided not to use arbitrary values contrary to USearch, for which the distance
(not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while
supporting the range of values [0, 2].
If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5,
which does not support mathematical reasoning why that would be more similar than
for example vectors marking obtuse angles.
It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact,
therefore we return NaN and eliminate them from best results.
Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour.
Fixes: SCYLLADB-456
Commit ea8a661119 tried to reduce the dataset for restoration tests.
While doing it effectively disabled part of itself -- the checks for
streaming directions were never ran after this change. The thing is that
this check only runs if restored tablet count matches some hardcoded one
of 512. This was the real dataset size of the test before the
aforementioned commit, but after it it had changed to over values, and
the comparison with 512 became always False.
Fix it with a local variable to prevent such mistakes in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This method does two things -- checks that the data is indeed back, and
validates streaming directions. The latter is not quite about "data is
back", so better to have it as explicit dedicated method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
To create a keyspace theres new_test_keyspace helper
Table is created with a single cql.run_async with explicit schema
Dataset is populated with a single parallel INSERT as well
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is preparational patch. Next will need to replace
foo()
bar()
with
with something() as s:
foo()
bar()
Effectively -- only add the `with something()` line. Not to shift the
whole file right together with that future change, do it here.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.
This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.
Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks
Similarly to previous patch, the handler can stream the map of build
statuses. Unlike previous patch, it doesn't need to fmt::format() key
and value, as these are strings already.
It could be a map_to_json<string, string> partial specialization, but
there's so far only one caller, so probably not worth it yet.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Like get_token_endpoint one streams the map that it got from storage
service, the get_ownership and get_effective_ownership can do the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The get_token_endpoint handler converts iterator of std::map into
generated maplist_mapper type. Next patch will do the same for more
handlers, so it's good to have a helper converter for it.
As a nice side effect, it's possible to avoid multiline lambda argument
to stream_range_as_array().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.
There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.
It needs a backport to 2026.1 to remove remaining validator tests from there.
Fixes: VECTOR-497
Closesscylladb/scylladb#28568
When running a gdb command, we check that the string 'Error'
does not appear within the output. However, if the command output
includes the string 'Error' as part of its normal operation, this
generates a false positive. In fact the task_histogram can include
the string 'error::Error' from the Rust core::error module.
Allow for that and only match 'Error' that isn't 'error::Error'.
Fixes#28516.
Closesscylladb/scylladb#28574
The difference is very tiny:
@@ -1,12 +1,12 @@
@pytest.mark.asyncio
async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage):
''' comment '''
- topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1)
- scope = "rack"
+ topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2)
+ scope = "dc"
ks = 'ks'
cf = 'cf'
@@ -42,7 +42,7 @@ async def test_restore_primary_replica_s
for r in res:
nodes_by_operation[r[1].group(1)].append(r[1].group(2))
- scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ])
+ scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ])
for op, nodes in nodes_by_operation.items():
logger.info(f'Operation {op} streamed to nodes {nodes}')
assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica"
The (removed in the above example) test description comments differ only
in their usage of "rack" and "dc" words.
Squashing them into one parametrized test makes perfect sense.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SStables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.
Also, tablet sizes can temporary grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.
The reduce impact from this, reduce concurrency of
migation. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.
Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.
Refs scylladb/siren#15317
Different transitions have different weights, and limits are
configurable. We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.
For example, repair uses weight 2 for read concurrency. Migrating
co-located tablets scales the cost by the number of co-located
tablets.
The test `test_size_based_load_balancing.py::test_balance_empty_tablets`
waits for tablet load stats to be refreshed and uses the
`short_tablet_stats_refresh_interval` injection to speed up the refresh
interval.
This injection has no effect; it was replaced by the
`tablet_load_stats_refresh_interval_in_seconds` config option (patch: 1d6808aec4),
so the test currently waits for 60 seconds (default refresh interval).
Use the config option. This reduces the execution time to ~8 seconds.
Fixes SCYLLADB-556.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28536
This field is only used to initialize the following _memtable_controller
one. It's simpler just to do the initialization with whatever value the
field itself is initialized and drop the field itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28539
This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator.
These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker.
Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice).
Backport not needed, xpass doesn't bother anyone.
Closesscylladb/scylladb#28441
* github.com:scylladb/scylladb:
test/cqlpy: remove xfail from tests for fixed issue 7972
test/cqlpy: remove xfail from tests for fixed issue 10358
test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation
test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only
The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used.
* The first patch reproduces this specific bug in Alternator Streams.
* The second patch adds missing (Fixes#28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write.
Closesscylladb/scylladb#28528
* github.com:scylladb/scylladb:
test/alternator: add test for BatchWriteItem with different write isolations
test/alternator: reproducer for Alternator Streams bug
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
Alternator's various write operations have different code paths for the
different write isolation modes. Because most of the test suite runs in
only a single write mode (currently - only_rmw_uses_lwt), we already
introduced a test file test/alternator/test_write_isolation.py for
checking the different write operations in *all* four write isolation
modes.
But we missed testing one write operation - BatchWriteItem. This
operation isn't very "interesting" because it doesn't support *any*
read-modify-option option (it doesn't support UpdateExpression,
ConditionExpression or ReturnValues), but even without those, the
pure write code still has different code paths with and without LWT,
and should be tested. So we add the missing test here - and it passes.
In issue #28439 we discovered a bug that can be seen in Alternator
Streams in the case of BatchWriteItem with multiple writes to the
same partition and always_use_lwt mode. The fact that the test added
here passes shows that the bug is NOT in BatchWriteItem itself, which
works correctly in this case - but only in the Alternator Streams layer.
Fixes#28171
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a reproducer for an Alternator Streams bug described in
issue #28439, where the stream returns the wrong events (and fewer of
them) in the following specific combination of the following circumstances:
1. A BatchWriteItem operation writing multiple items to the *same*
partition.
2. The "always_use_lwt" write isolation mode is used. (the bug doesn't
occur in other write isolation modes).
We didn't catch this bug earlier because the Alternator Streams test
we had for BatchWriteItem had multiple items in multiple partitions,
and we missed the multiple-items-in-one-partition case. Moreover,
today we run all the tests in only_rmw_uses_lwt mode (in the past,
we did use always_use_lwt, but changed recently in commit e7257b1393
following commit 76a766c that changed test.py).
As issue #28439 explains, the underlying cause of the bug is that the
always_use_lwt causes the multiple items to be written with the same
timestamp, which confused the Alternator Streams code reading the CDC
log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but
just in the Alternator Streams layer.
The test in this patch is parameterized to run on each of the four
write isolation modes, and currently fails (and so marked xfail) just
for the one mode 'always_use_lwt'. The test is scylla_only, as its
purpose is to checks the different write isolation mode - which don't
exist in AWS DynamoDB.
Refs #28439
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Improve events printing, when test in test_streams.py failed.
New code will print both expected and received events (keys, previous
image, new image and type).
New code will explicitly mark, at which output event comparison failed.
Fixes#28455Closesscylladb/scylladb#28476
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Improvement of the tests running time, so no backport. The fix of
`test_tablets_parallel_decommission` may have to be backported to
2026.1, but it can be done manually.
Closesscylladb/scylladb#28464
* github.com:scylladb/scylladb:
test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0
test: test_tablets_parallel_decommission: prevent group0 majority loss
test: delete test_service_levels_work_during_recovery
The handler appeared back in c9e710dca3. In this commit it performed the
"core" part of the task -- the do_build_range() method -- inside the
streaming sched group. The setup code looks seemingly was copied from the
view_builder::do_build_step() method and got the explicit switch of the
scheduling group.
The switch looks both -- justified and not. On one hand, it makes it
explict that the activity runs in the streaming scheduling group. On the
other hand, the verb already uses RPC index on 1, which is negotiated to
be run in streaming group anyway. On the "third hand", even though being
explicit the switch happens too late, as there exists a lot of other
activities performed by the handler that seems to also belong to the
same scheduling group, but which is not switched into explicitly.
By and large, it seems better to avoid the explicit switch and rely on
the RPC-level negotiation-based sched group switching.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28397
This is the continuation of #28363 , this time about getting gossiper scheduling group via database.
Several places that do it already have gossiper at hand and should better get the group from it.
Eventually, this will allow to get rid of database::get_gossip_scheduling_group().
Refining inter-components API, not backporting
Closesscylladb/scylladb#28412
* github.com:scylladb/scylladb:
gossiper: Export its scheduling group for those who need it
migration_manager: Reorder members
Recently we had a question whether key columns can have any supported
type. I knew that actually - they can't, that key columns can have only
the types S(tring), B(inary) or N(umber), and that is all. But it turns
out we never had a test that confirms this understanding is true.
We did have a test for it for GSI key types already,
test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the
base table. So in this patch we add this missing test, and confirm that,
indeed, both DynamoDB and Alternator refuse a key attribute with any
type other than S, B or N.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28479
Current way is checking only fail during the test phase, and it will
miss the cases when fail happens on another phase. This PR eliminate
this, so every phase will have modified node reporter to enrich the
JUnit XML report with custom attribute function_path.
Closesscylladb/scylladb#28462
Current way is always assumed that the error happened in the test file,
but that not always true. This PR will show the error from the boost
logger where actually error is happened.
Closesscylladb/scylladb#28429
1. fmt::localtime is deprecated.
2. We should really print times in UTC, especially on the cloud.
3. The current log message does not print the timezone so it'd unclear
to anyone reading the lof message if the expiration time is in the
local timezone or in GMT/UTC.
Fixes the following warning:
```
gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations]
2428 | endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(),
| ^
/usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here
538 | FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm {
| ^
/usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED'
207 | # define FMT_DEPRECATED [[deprecated]]
| ^
```
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#28434
Some storage_service rpc verbs may checks that a handler is executed
inside gossiper scheduling group. For that, the expected group is
grabbed from database.
This patch puts the gossiper sched group into debug namespace and makes
this check use it from there. It removes one more place that uses
database as config provider.
Refs #28410
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28427
The test test_to_json_double used to fail due to #7972, but this issue
was already fixed in Scylla 5.1 and we didn't notice.
So remove the xfail marker from this test, and also update another test
which still xfails but no longer due to this issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The tests testWithUnsetValues and testFilteringWithoutIndices used to fail
due to #10358, but this issue was already fixed three years ago, when the
UNSET-checking code was cleaned up, and the test is now passing.
So remove the xfail marker from these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The test testInvalidNonFrozenUDTRelation used to fail due to #10632
(an incorrectly-printed column name in an error message) and was marked
"xfail". But this issue has already been fixed two years ago, and
the test is now passing. So remove the xfail marker.
The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only
tests whether the Alternator metrics report the exactly-DynamoDB-compatible
WCU number. It is parameterized with two cases - one that uses
alternator_force_read_before_write and one which doesn't.
The case that uses alternator_force_read_before_write is expected to
measure the "accurate" WCU, and currently it doesn't, so the test
rightly xfails.
But the case that doesn't use alternator_force_read_before_write is not
expected to measure the "accurate" WCU and has a different expectation,
so this case actually passes. But because the entire test is marked
xfail, it is reported as "XPASS" - unexpected pass.
Fix this by marking only the "True" case with xfail, while the "False"
case is not marked. After this pass, the True case continues to XFAIL
and the False case passes normally, instead of XPASS.
Also removed a sentence promising that the failing case will be solved
"by the next PR". Clearly this didn't happen. Maybe we even have such
a PR open (?), but it won't the "the next PR" even if merged today.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixes https://github.com/scylladb/scylladb/issues/28183
LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.
Closesscylladb/scylladb#28230
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
cql3/statements/describe_statement: hide paxos state tables
Copilot found these typos in comments and variable name in alternator/,
so might as well fix them.
There are no functional changes in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28447
This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it.
The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`.
Fixes: SCYLLADB-255
No backport, just test migration
Closesscylladb/scylladb#28454
* github.com:scylladb/scylladb:
test: refactor test_all_rf_limits in guardrails_test.py
test: specify exceptions being caught in guardrails_test.py
test: enable guardrails_test.py
test: add wait_other_notice to test_default_rf in guardrails_test.py
test: copy guardrails_test.py from scylla-dtest
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954Closesscylladb/scylladb#28444
Modern toxiproxy interprets `-h` as help and requires the subcommand
subject (e.g. the proxy name) to be after the subcommand switches.
Arrange the command line in the way it likes, and spell out the
subcommands to be more comprehensible.
Closesscylladb/scylladb#28442
related PR: https://github.com/scylladb/scylladb/pull/27527
This PR changes test.py logic of parsing boost test cases to use -- --list_json_content
and pass boost labels as pytests markers
using -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well
Fixes SCYLLADB-246
Closesscylladb/scylladb#28232
* github.com:scylladb/scylladb:
test: add nightly label
test.py: support boost labels in test.py
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.
Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031
The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).
Closesscylladb/scylladb#28367
* github.com:scylladb/scylladb:
storage_service: fix indentation after previous patch
raft topology: generate notification about released nodes only once
raft topology: extract "released" nodes calculation to external function
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`.
We also extend `test_maintenance_mode` to provide a reproducer for
Fixes#27988
This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.
Closesscylladb/scylladb#28322
* github.com:scylladb/scylladb:
test: test_maintenance_mode: enable maintenance mode properly
test: test_maintenance_mode: shutdown cluster connections
test: test_maintenance_mode: run with different keyspace options
test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
test: test_maintenance_mode: get rid of the conditional skip
test: test_maintenance_mode: remove the redundant value from the query result
storage_proxy: skip validate_read_replica in maintenance mode
storage_service: set up topology properly in maintenance mode
Before this commit, `test_all_rf_limits` was implemented in a
repetitive manner, making it harder to understand how the guardrails
were tested. This commit refactors the test to reduce code redundancy
and verify the guardrails more explicitly.
Before this commit, the test caught a broad `Exception`. This change
specifies the expected exceptions to avoid a situation where the product
or test is broken and it goes undetected.
The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from
```
co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms));
```
as the default value of `shutdown_announce_in_ms` is 2000. This sleep
makes every `server_stop_gracefully` call 2s slower. There are ~300 such
calls in cluster tests (note that some come from `rolling_restart`). So,
it looks like this sleep makes cluster tests 300 * 2s = 10min slower.
Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min
on the potwor machine (the one in the Warsaw office) without it.
We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them
faster.
The sleep is completely unnecessary in tests. Removing it could introduce
flakiness, but if that's the case, then the test for which it happens is
incorrect in the first place. Tests shouldn't assume that all nodes
receive and handle the shutdown message in 2s. They should use functions
like `server_not_sees_other_server` instead, which are faster and more
reliable.
Both of the changed test cases stop two out of four nodes when there are
three group0 voters in the cluster. If one of the two live nodes is
a non-voter (node 1, specifically, as node 0 is the leader), a temporary
majority loss occurs, which can cause the following operations to fail.
In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node`
API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`,
removenode can fail with an unexpected error message:
```
"service::raft_operation_timeout_error (group
[46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)"
```
Somehow, these test cases are currently not flaky, but they become flaky in
the following commit.
We can consider backporting this commit to 2026.1 to prevent flakiness.
The test becomes flaky in one of the following commits. However, there is
no need to fix it, as we should delete it anyway. We are in the process of
removing the gossip-based topology from the code base, which includes the
recovery mode. We don't have to rewrite the test to use the new Raft-based
recovery procedure, as there is nothing interesting to test (no regression
to legacy service levels).
add nightly label for test
test_foreign_reader_as_mutation_source
as an example of usinf boost labels pytest as markers
command to test :
./tools/toolchain/dbuild pytest --test-py-init --collect-only -q -m=nightly test/boost
output:
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1
boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1
The patch marks force-gossip-topology-changes as deprecated and removes
tests that use it. There is one test (test_different_group0_ids) which
is marked as xfail instead since it looks like gossiper mode was used
there as a way to easily achieve a certain state, so more investigation
is needed if the tests can be fixed to use raft mode instead.
Closesscylladb/scylladb#28383
This reverts commit 7bf7ff785a. The commit
tried to add clean shutdown to `scylla perf` paths, but forgot at least
`scylla perf-alternator --workload wr` which now crashes on uninitialized
`c.as`.
Fixes#28473Closesscylladb/scylladb#28478
Add support for literals in the SELECT clause. This allows
SELECT fn(column, 4) or SELECT fn(column, ?).
Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still
not accepted because of failed type inference - we cannot infer the
type of 7, and don't have a favored type for literals (like C favors
int). We might relax this later.
In the WHERE clause, and Cassandra in the SELECT clause, type hints
can also resolve type ambiguity: (bigint)7 or (text)?. But this is
deferred to a later patch.
A few changes to the grammar are needed on top of adding a `value`
alternative to `unaliasedSelector`:
- vectorSimilarityArg gained access to `value` via `unaliasedSelector`,
so it loses that alternate to avoid ambiguity. We may drop
`vectorSimilarityArg` later.
- COUNT(1) became ambiguous via the general function path (since
function arguments can now be literals), so we remove this case
from the COUNT special cases, remaining with count(*).
- SELECT JSON and SELECT DISTINCT became "ambiguous enough" for
ANTLR to complain, though as far as I can tell `value` does not
add real ambiguity. The solution is to commit early (via "=>") to
a parsing path.
Due to the loss of count(1) recognition in the parser, we have to
special-case it in prepare. We may relax it to count any expression
later, like modern Cassandra and SQL.
Testing is awkward because of the type inference problem in top-level.
We test via the set_intersection() function and via lua functions.
Example:
```
cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE LUA AS 'return a + b';
cqlsh> SELECT ks.sum(1, 2) FROM system.local;
ks.sum(1, 2)
--------------
3
(1 rows)
cqlsh>
```
(There are no suitable system functions!)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296Closesscylladb/scylladb#28256
`system.cluster_status` is missing the rack info compared to `nodetool status`
that is supposed to be equivalent. It has probably been an omission.
Closesscylladb/scylladb#28457
The hander of raft_topology_cmd::command::stream_ranges switches to
streaming scheduling group to perform data streaming in it. It grabs the
group from database db_config, which's not great. There's streaming
manager at hand in storage service handlers, since it's using its
functionality, it should use _its_ scheduling group.
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28363
Before we could observe two exactly the same
"starting auth service" messages in the log.
One from checkpoint() the other from notify().
We remove the second one to stay consistent
with other services.
Closesscylladb/scylladb#28349
Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has.
It is useful for automating test runs.
Related: https://scylladb.atlassian.net/browse/SCYLLADB-434
Bacport: no, original benchmark is not backported
Closesscylladb/scylladb#28451
* github.com:scylladb/scylladb:
test: perf: add example commands to perf-alternator and perf-cql-raw
test: perf: add option to write results to json in perf-cql-raw
test: perf: add option to write results to json in perf-alternator
test: perf: move write_json_result to a common file
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.
During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.
This fixes this problem by checking if the table still exists.
Fixes: #28359Closesscylladb/scylladb#28440
* github.com:scylladb/scylladb:
test: add test and reproducer for load_stats refresh exception
load_stats: handle dropped tables when refreshing load_stats
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.
The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.
Additionally, we check if the error message is as expected.
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.
We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).
It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.
As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.
We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
dropped by the compiler in non-debug builds.
We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.
Fixes#27988
After guardrails_test.py has been migrated to test.py and fixed in
previous commits of this patch series, it can finally be enabled.
Fixes: SCYLLADB-255
This commit adds `wait_other_notice=True` to `cluster.populate` in
`guardrails_test.py`. Without this, `test_default_rf` sometimes fails
because `NetworkTopologyStrategy` setting fails before
the node knows about all other DCs.
Refs: SCYLLADB-255
This commit copies guardrails_test.py from dtest repository and
(temporarily) disables it, as it requires improvement in following
commits of this patch series before being enabled.
Refs: SCYLLADB-255
Schema is already a member of select statement, avoiding
the call saves around 400 cpu instructions on a select
request hot path.
Closesscylladb/scylladb#28328
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.
During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.
It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.
This patch fixes this problem by checking if the table still exists.
Vector Search feature needs to support creating vector indexes with additional
filtering column. There will be two types of indexes: global which indexes
vectors per table, and local which indexes vectors per partition key. The new
syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary
Index. Vector indexes don't use secondary indexes functionalities in any way -
all indexing, filtering and processing data will be done on Vector Store side.
This patch allows creating vector indexes using this CQL syntax:
```
CREATE TABLE IF NOT EXISTS cycling.comments_vs (
commenter text,
comment text,
comment_vector VECTOR <FLOAT, 5>,
created_at timestamp,
discussion_board_id int,
country text,
lang text,
PRIMARY KEY ((commenter, discussion_board_id), created_at)
);
CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index
ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index
ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang)
USING 'vector_index'
WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' };
```
Currently, if we run these queries to create indexes we will receive such errors:
```
InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column"
InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ"
```
This commit refactors `vector_index::check_target` to correctly validate
columns building the index. Vector-store currently support filtering by native
types, so the type of columns is checked. The first column from the list must
be a vector (to build index based on these vectors), so it is also checked.
Allowed types for columns are native types without counter (it is not possible
to create a table with counter and vector) and without duration (it is not
possible to correctly compare durations, this type is even not allowed in
secondary indexes).
This commits adds cqlpy test to check errors while creating indexes.
Fixes: SCYLLADB-298
This needs to be backported to version 2026.1 as this is a fix for filtering support.
Closesscylladb/scylladb#28366
These were marked xfail due to #8077 (the column name was wrong),
but it was fixed long ago for 5.4 (exact commit not known).
Remove the xfail markers to prevent regressions.
Closesscylladb/scylladb#28432
This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard.
Backport: no, it's not a bug fix.
Closesscylladb/scylladb#28329
* github.com:scylladb/scylladb:
auth: bring back previous version of standard_role_manager::can_login
auth: switch find_record to use cache
auth: make find_record and callers standard_role_manager members
The method was coroutinized by 6df07f7ff7. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like
```
auto flush = [] {
... // do the flushing
auto post_flush = [] {
... // do the post-flushing
}
co_return co_await with_scheduling_group(group_b, post_flush);
};
co_return co_await with_scheduling_group(group_a, flush);
```
which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this
```
co_await switch_to(group_a);
... // do the flushing
co_await switch_to(group_b);
... // do the post-flushing
```
Code cleanup, not backporting
Closesscylladb/scylladb#28430
* github.com:scylladb/scylladb:
table: Fix indentation after previous patch
table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()
test_alternator_proxy_protocol starts a node and connects via the alternator ports.
Starting a node, by default, waits until the CQL ports are up. This does not guarantee
that the alternator ports are up (they will be up very soon after this), so there is a short
window where a connection to the alternator ports will fail.
Fix by adding a ServerUpState=SERVING mode, which waits for the node to report
to its supervisor (systemd, which we are pretending to be) that its ports are open.
The test is then adjusted to request this new ServerUpState.
Fixes#28210Fixes#28211
Flaky tests are only in master and branch-2026.1, so backporting there.
Closesscylladb/scylladb#28291
* github.com:scylladb/scylladb:
test: test_alternator_proxy_protocol: wait for the node to report itself as serving
test: cluster_manager: add ability to wait for supervisor STATUS=serving
It allows dropping the local lambdas passed into with_scheduling_group()
calls. Overall the code flow becomes more readable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.
This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.
Closesscylladb/scylladb#28343
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.
Code cleanup, no backport
Closesscylladb/scylladb#28353
* github.com:scylladb/scylladb:
replica/partition_snapshot_reader: remove unused includes
partition_snapshot_reader: remove "flat" from name
mv partition_snapshot_reader.hh -> replica/
This test case was observed to take over 2 minutes to run on CI
machines, contributing to already bloated CI run times.
Disable this test in debug mode. This test checks for memtable flush
being slowed down when compaction can't keep up. So this test needs to
overwhelm the CPU by definition. On the other hand, this is not a
correctness test, there are such tests for the memtable and compaction
already, so it is not critical to run this in debug mode, it is not
expected to catch any use-after-free and such.
Closesscylladb/scylladb#28407
This commit replaces the previous approach of running pytest inside
GDB’s Python interpreter. Instead, tests are executed by driving a
persistent GDB process externally using pexpect.
- pexpect: Python library for controlling interactive programs
(used here to send commands to GDB and capture its output)
- persistent GDB: keep one GDB session alive across multiple tests
instead of starting a new process for each test
Tests can now be executed via `./test.py gdb` or with
`pytest test/scylla_gdb`. This improves performance and
makes failures easier to debug since pytest no longer runs
hidden inside GDB subprocesses.
Closesscylladb/scylladb#24804
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:
if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.
Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353
Backport is not needed. Better to first observe its impact.
Closesscylladb/scylladb#21649
* github.com:scylladb/scylladb:
reader_concurrency_semaphore: Check during admission if read may timeout
permit_reader::impl: Replace break with return after evicting inactive permit on timeout
reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
config: Add parameters to control reads' preemptive_abort_factor
permit_reader: Add a new state: preemptive_aborted
reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
reader_concurrency_semaphore: Remove cpu_concurrency's default value
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.
Most notably:
- Make plan summary more concise, and print info only about present elements.
- Print rack name in addition to DC name when making a per-rack plan
- Print "Not possible to achieve balance" only when this is the final plan with no active migrations
- Print per-node stats when "Not possible to achieve balance" is printed
- amortize metrics lookup cost
- avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"
Backport to 2026.1: since the changes enhance debuggability and are relatively low risk
Fixes#28423Fixes#28422Closesscylladb/scylladb#28337
* github.com:scylladb/scylladb:
tablets: tablet_allocator.cc: Convert tabs to spaces
tablets: load_balancer: Warn about incomplete stats once for all offending nodes
tablets: load_balancer: Improve node stats printout
tablets: load_balancer: Warn about imbalance only when there are no more active migrations
tablets: load_balancer: Extract print_node_stats()
tablet: load_balancer: Use empty() instead of size() where applicable
tablets: Fix redundancy in migration_plan::empty()
tablets: Cache pointer to stats during plan-making
tablets: load_balancer: Print rack in addition to DC when giving context
tablets: load_balancer: Make plan summary concise
tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.
This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).
New feature (nested scheduling group), not backporting
Closesscylladb/scylladb#28386
* github.com:scylladb/scylladb:
view_builder: Start background in maintenance group
view_builder: Wake-up step fiber with condition variable
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.
The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.
To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28410
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.
Fixes: SCYLLADB-130
This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.
Closesscylladb/scylladb#28199
* github.com:scylladb/scylladb:
docs: add feature page for automatic repair
docs: inter-link incremental-repair and repair documents
docs: incremental-repair: fix curl example
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.
But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.
This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.
We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:
pytest.register_assert_rewrite("cassandra_tests.porting")
This tells pytest to rewrite assertions in porting.py the first time
that it is imported.
It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.
I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28411
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.
This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.
Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.
This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places in the code that need to explicitly switch into
gossiper scheduling group. For that they currently call database to
provide the group, but it's better to get gossiper sched group from
gossiper itself, all the more so all those places have gossiper at hand.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is to initialize dependency references, in particular gossiper&,
before _group0_barrier. The latter will need to access this->_gossiper
in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
When a shard on a replica is overloaded, it breaks down completely,
throughput collapses, latencies go through the roof and the
node/shard can even become completely unresponsive to new connection
attempts.
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued and thus they can time out while being in the queue or during
the execution. In the latter case, the timeout does not always
result in the read being aborted.
Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes
for a queued read to be admitted is around that of the timeout.
If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following cryteria:
if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor;
the read is rejected and the next one from the wait list is
considered.
Evicting an inactive permit destroyes the permit object when the
reader is closed, making any further member access invalid. Switch
from break to an early return to prevent any possible use-after-free
after evict() in the state::inactive timeout path.
The new parameter parametrizes the factor used to reject a read
during admission. Its value shall be between 0.0 and 1.0 where
+ 0.0 means a read will never get rejected during admission
+ 1.0 means a read will immediatelly get rejected during admission
Although passing values outside the interaval is possible, they
will have the exact same effects as they were clamped to [0.0, 1.0].
A permit gets into the preemptive_aborted state when:
- times out;
- gets rejected from execution due to high chance its execution would
not finalize on time;
Being in this state means a permit was removed from the wait list,
its internal timer was canceled and semaphore's statistic
`total_reads_shed_due_to_overload` increased.
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.
To make cleanup reliable this PR:
Adds an additional “fallback cleanup” stage
- `write_both_read_old_fallback_cleanup`
that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.
Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.
Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.
No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.
Closesscylladb/scylladb#28169
* github.com:scylladb/scylladb:
topology_coordinator: add write_both_read_old_fallback_cleanup state
topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.
Refining API between components, not backporting
Closesscylladb/scylladb#28387
* github.com:scylladb/scylladb:
raft_group0_client: Dont export system keyspace
raft_group0_client: Add and use get_last_group0_state_id()
group0_state_machine: Call ensure_group0_sched() with data_dictionary
view_building_worker: Use its own system_keyspace& reference
During test.py run, noticed this warning:
```
10:38:22 test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22 /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22 pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```
The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.
Closesscylladb/scylladb#28409
The include-what-you-use workflow fails with
```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```
Fix by adding the correct permissions.
Closesscylladb/scylladb#28390
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.
Before:
@pytest.mark.skip_mode("dev", reason="...")
@pytest.mark.skip_mode("debug", reason="...")
After:
@pytest.mark.skip_mode(["dev", "debug"], reason="...")
This reduces duplication when the same skip condition applies
to multiple build modes.
Closesscylladb/scylladb#28406
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.
We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.
Fixes#28385Closesscylladb/scylladb#28388
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.
This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28356
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.
Improvements. No backup is required.
Closesscylladb/scylladb#28396
* github.com:scylladb/scylladb:
test/storage: speed up out-of-space prevention tests by using smaller volumes
test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.
Closesscylladb/scylladb#28348
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.
Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
Before:
load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)
After:
load_balancer - Prepared plan: migrations: 1
We print only stats about elements which are present.
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.
The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.
Adding the explicit include fixes the build.
Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093
No backport: This problem is only present in master.
Closesscylladb/scylladb#28379
Fixes#28398
When used as path elements in google storage paths, the object names
need to be URL encoded. Due to a.) tests not really using prefixes including
non-url valid chars (i.e. / etc) and the mock server used for most
testing not enforcing this particular aspect, this was missed.
Modified unit tests to use prefixing for all names, so when run
in real GS, any errors like this will show.
Fixes#28399
When iterating with pager, the mock server and real GCS behaves differently.
The latter will not give a pager token for last page, only penultimate.
Need to handle.
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
Closesscylladb/scylladb#27891
* github.com:scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
Use the new ServerUpState=SERVING mechanism to wait to the alternator
ports to be up, rather than relying on the default waiting for CQL,
which happens earlier and therefore opens a window where a connection to
the alternator ports will fail.
When running under systemd, ScyllaDB sends a STATUS=serving message
to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus
making the cluster manager pretend it is systemd. Users of the cluster
manager can now wait for the node to report itself up, rather than
having to parse log files or retry connections.
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.
Closesscylladb/scylladb#28132
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.
The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.
Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.
Fixes: #28301
Refs: #25031
The commit 59faa6d, introduces a new parameter called cpu_concurrency
and sets its default value to 1 which violates the commit fbb83dd that
removes all default values from constructors but one used by the unit
tests.
The patch removes the default value of the cpu_concurrency parameter
and alters tests to use the test dedicated reader_concurrency_semaphore
constructor wherever possible.
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.
Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.
This cuts ~5–6s per test.
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.
This patch adds a helper method to provide the needed ID.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.
There's more explicit way -- get the data_dictionary via proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.
Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug
Closesscylladb/scylladb#28326
* github.com:scylladb/scylladb:
audit: replace batch dynamic_cast with static_cast
audit: eliminate dynamic_cast to batch_statement in inspect
audit: cql: remove create_no_audit_info
audit: add batch bool to audit_info class
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.
This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).
This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28358
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.
Closesscylladb/scylladb#28340
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.
This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.
Enanching testing code, not backporting
Closesscylladb/scylladb#28352
* github.com:scylladb/scylladb:
code: Move limiting data source to test/lib
util: Simplify limiting_data_source API
util: Remove buffer_input_stream
test: Use seastar::util::temporary_buffer_data_source in data consumer test
sstables: Use seastar::util::as_input_stream() in mx reader
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.
I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28324
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.
The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.
As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250
backport: not required, this is a refactor
Closesscylladb/scylladb#28093
* github.com:scylladb/scylladb:
db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
Previously, we wanted to make minimal changes with regards to the new
unified auth cache. However, as a result, some calls on the hot path
were missed. Now we have switched the underlying find_record call
to use the cache. Since caching is now at a lower level, we bring
back the original code.
Since every write-type auth statement takes group0_guard at the beginning,
we hold read_apply_mutex and cannot have a running raft apply during our
operation. Therefore, the auth cache and internal CQL reads return the same,
consistent results. This makes it safe to read via cache instead of internal
CQL.
LDAP is an exception, but it is eventually consistent anyway.
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.
Also move the code to the replica namespace.
test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.
Fixes: #13076
New functionality, no backport.
Closesscylladb/scylladb#27836
* github.com:scylladb/scylladb:
tools/scylla-sstable: introduce filter command
tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
tools/scylla-sstable: make partition_set ordered
tools/scylla-stable: remove unused boost/algorithm/string.hpp include
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.
Closesscylladb/scylladb#28226
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.
Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.
Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.
This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.
By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.
In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).
Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.
This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
if (action_failed(tablet_state.barriers[trinfo.stage])) {
if (check_excluded_replicas()) {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
break;
}
}
if (do_barrier()) {
...
}
break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.
On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.
Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.
Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).
Ref #18066.
Closesscylladb/scylladb#28336
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.
Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.
This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.
Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.
Not having special value allows us to return
fast when audit is not enabled.
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.
1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908
No need to backport as this code targets 2026.2 release (for tablet-aware restore)
Closesscylladb/scylladb#26834
* github.com:scylladb/scylladb:
tests: reuse test_backup_broken_streaming
streaming: enable direct download of contained sstables
storage: add method to create component source
streaming: keep sharded database reference on tablet_sstable_streamer
streaming: skip streaming empty sstable sets
streaming: inline streamer instance creation
tests: fix incorrect backup/restore test flow
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.
Closesscylladb/scylladb#28240
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:
* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
listed in the output.
Refs scylladb/scylladb#25415
Backport: not needed. The code hasn't been used anywhere yet.
Closesscylladb/scylladb#28255
* github.com:scylladb/scylladb:
test/lib/boost_test_tree_lister.cc: Record empty test suites
test/lib/boost_test_tree_lister.cc: Deduplicate labels
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.
Closesscylladb/scylladb#28299
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.
Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.
Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.
The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).
The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.
Refs #28183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixesscylladb/scylladb#28183
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.
However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.
We start including those empty suites and files in the output of
`--list_json_content`.
---
Examples (with additional formatting):
* Consider the following test file, `test/boost/dummy_test.cc` [1]:
```
BOOST_AUTO_TEST_SUITE(dummy_suite1)
BOOST_AUTO_TEST_SUITE(dummy_suite2)
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE_END()
BOOST_AUTO_TEST_SUITE(dummy_suite3)
BOOST_AUTO_TEST_SUITE_END()
```
Before this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
```
After this commit:
```
$ ./build/debug/test/boost/dummy_test -- --list_json_content
[{"file":"test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites": [
{"name": "dummy_suite2", "suites": [], "tests": []}
], "tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}
], "tests": []}}]
```
* Consider the same test file as in Example 1, but also assume it's compiled
into `test/boost/combined_tests`.
Before this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
$
```
After this commit:
```
$ ./build/debug/test/boost/combined_tests -- --list_json_content
[..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
{"name": "dummy_suite1", "suites":
[{"name": "dummy_suite2", "suites": [], "tests": []}],
"tests": []},
{"name": "dummy_suite3", "suites": [], "tests": []}],
"tests":[]}}, ...]
```
[1] Note that the example is simplified. As of now, it's not possible to use
`--list_json_content` with a file without any Boost tests. That will
result in the following error: `Test setup error: test tree is empty`.
Refs scylladb/scylladb#25415
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.
Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.
---
Example (with additional formatting):
Consider the following test in the file `test/boost/dummy_test.cc`:
```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
return make_ready_future();
}
```
Before this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```
After this commit:
```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
"tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```
Refs scylladb/scylladb#25415
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
When this column and relevant SUPPORTED key were added, the
documentation was mistakenly put in the section about shard awareness
extension. This commit moves the documentation into a dedicated section.
I also expended it to describe both the new column and the new SUPPORTED
key.
- name:Comment and close if author email is scylladb.com
uses:actions/github-script@v7
with:
github-token:${{ secrets.GITHUB_TOKEN }}
script:|
const issue = context.payload.issue;
const actor = context.actor;
// Get user data (only public email is available)
const { data: user } = await github.rest.users.getByUsername({
username: actor,
});
const email = user.email || "";
console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);
// Only continue if email exists and ends with @scylladb.com
if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {
console.log("User is not a scylladb.com email (or email not public); skipping.");
return;
}
const owner = context.repo.owner;
const repo = context.repo.repo;
const issue_number = issue.number;
const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";
CLEANER_DIRS:test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
// The attribute is consumed as timestamp, not stored in _cells.
continue;
}
throwapi_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)",to_string_view(*timestamp_attribute)));
}
_length_in_bytes+=column_name.size();
if(!cdef){
// This attribute may be a key column of one of the GSI or LSI,
throwapi_error::validation("Read-modify-write operations are disabled by 'forbid_rmw' write isolation policy. Refer to https://github.com/scylladb/scylla/blob/master/docs/alternator/alternator.md#write-isolation-policies for more information.");
throwapi_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)",to_string_view(*_timestamp_attribute)));
}
}
if(action=="DELETE"){
// The DELETE operation can do two unrelated tasks. Without a
// "Value" option, it is used to delete an attribute. With a
throwapi_error::validation(fmt::format("The '{}' attribute used as a write timestamp must be a positive number (microseconds since epoch)",to_string_view(*_timestamp_attribute)));
}
}
if(actions.second.has_value()){
// An action on a top-level attribute column_name. The single
// action is actions.second.get_value(). We can simply invoke
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
"How long permissions in cache remain valid. Depending on the authorizer, such as CassandraAuthorizer, fetching permissions can be resource intensive. Permissions caching is disabled when this property is set to 0 or when AllowAllAuthorizer is used. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"How long authorized statements cache entries remain valid. The cached value is considered valid as long as both its value is not older than the permissions_validity_in_ms "
"and the cached value has been read at least once during the permissions_validity_in_ms time frame. If any of these two conditions doesn't hold the cached value is going to be evicted from the cache.\n"
"Refresh interval for permissions cache (if enabled). After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms.")
"Refresh interval for authorized statements cache. After this interval, cache entries become eligible for refresh. An async reload is scheduled every permissions_update_interval_in_ms time period and the old value is returned until it completes. If permissions_validity_in_ms has a non-zero value, then this property must also have a non-zero value. It's recommended to set this value to be at least 3 times smaller than the permissions_validity_in_ms. This option additionally controls the permissions refresh interval for LDAP.")
"Enable or disable inter-node encryption. You must also generate keys and provide the appropriate key and trust store locations and passwords. The available options are:\n"
"The maximum fraction of cache memory permitted for use by index cache. Clamped to the [0.0; 1.0] range. Must be small enough to not deprive the row cache of memory, but should be big enough to fit a large fraction of the index. The default value 0.2 means that at least 80\% of cache memory is reserved for the row cache, while at most 20\% is usable by the index cache.")
,consistent_cluster_management(this,"consistent_cluster_management",value_status::Deprecated,true,"Use RAFT for cluster management and DDL.")
,force_gossip_topology_changes(this,"force_gossip_topology_changes",value_status::Used,false,"Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
,force_gossip_topology_changes(this,"force_gossip_topology_changes",value_status::Deprecated,false,"Force gossip-based topology operations in a fresh cluster. Only the first node in the cluster must use it. The rest will fall back to gossip-based operations anyway. This option should be used only for testing. Note: gossip topology changes are incompatible with tablets.")
,recovery_leader(this,"recovery_leader",liveness::LiveUpdate,value_status::Used,utils::null_uuid(),"Host ID of the node restarted first while performing the Manual Raft-based Recovery Procedure. Warning: this option disables some guardrails for the needs of the Manual Raft-based Recovery Procedure. Make sure you unset it at the end of the procedure.")
,wasm_cache_memory_fraction(this,"wasm_cache_memory_fraction",value_status::Used,0.01,"Maximum total size of all WASM instances stored in the cache as fraction of total shard memory.")
,wasm_cache_timeout_in_ms(this,"wasm_cache_timeout_in_ms",value_status::Used,5000,"Time after which an instance is evicted from the cache.")
"Allows target tablet size to be configured. Defaults to 5G (in bytes). Maintaining tablets at reasonable sizes is important to be able to " \
"redistribute load. A higher value means tablet migration throughput can be reduced. A lower value may cause number of tablets to increase significantly, " \
"potentially resulting in performance drawbacks.")
"Maximum number of tablets which may be leaving a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
"Maximum number of tablets which may be pending on a shard at the same time. Effecting only on topology coordinator. Set to the same value on all nodes.")
,replication_strategy_warn_list(this,"replication_strategy_warn_list",liveness::LiveUpdate,value_status::Used,{locator::replication_strategy_type::simple},"Controls which replication strategies to warn about when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
,replication_strategy_fail_list(this,"replication_strategy_fail_list",liveness::LiveUpdate,value_status::Used,{},"Controls which replication strategies are disallowed to be used when creating/altering a keyspace. Doesn't affect the pre-existing keyspaces.")
,service_levels_interval(this,"service_levels_interval_ms",liveness::LiveUpdate,value_status::Used,10000,"Controls how often service levels module polls configuration table")
,audit(this,"audit",value_status::Used,"none",
,audit(this,"audit",value_status::Used,"table",
"Controls the audit feature:\n"
"\n"
"\tnone : No auditing enabled.\n"
"\tsyslog : Audit messages sent to Syslog.\n"
"\ttable : Audit messages written to column family named audit.audit_log.\n")
,audit_categories(this,"audit_categories",liveness::LiveUpdate,value_status::Used,"DCL,DDL,AUTH","Comma separated list of operation categories that should be audited.")
,audit_categories(this,"audit_categories",liveness::LiveUpdate,value_status::Used,"DCL,AUTH,ADMIN","Comma separated list of operation categories that should be audited.")
,audit_tables(this,"audit_tables",liveness::LiveUpdate,value_status::Used,"","Comma separated list of table names (<keyspace>.<table>) that will be audited.")
,audit_keyspaces(this,"audit_keyspaces",liveness::LiveUpdate,value_status::Used,"","Comma separated list of keyspaces that will be audited. All tables in those keyspaces will be audited")
,audit_unix_socket_path(this,"audit_unix_socket_path",value_status::Used,"/dev/log","The path to the unix socket used for writing to syslog. Only applicable when audit is set to syslog.")
"Sets the minimal tablet size for the load balancer. For any tablet smaller than this, the balancer will use this size instead of the actual tablet size.")
"max cpu usage ratio (between 0 and 1) for compaction process. Not intended for setting in normal operations. Setting it to 1 or higher will disable it, recommended operational setting is 0.5.")
"If set to higher than 0, ignore the controller's output and set the memtable shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to higher than 0, ignore the controller's output and set the compaction shares statically. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"Set the maximum shares of regular compaction to the specific value. Do not set this unless you know what you are doing and suspect a problem in the controller. This option will be retired when the controller reaches more maturity.")
"If set to true, enforce the min_threshold option for compactions strictly. If false (default), Scylla may decide to compact even if below min_threshold.")
"Set the minimum interval in seconds between flushing all tables before each major compaction (default is 86400)."
"This option is useful for maximizing tombstone garbage collection by releasing all active commitlog segments."
"Set to 0 to disable automatic flushing all tables before major compaction.")
,default_log_level(this,"default_log_level",value_status::Used,seastar::log_level::info,"Default log level for log messages")
,logger_log_level(this,"logger_log_level",value_status::Used,{},"Map of logger name to log level. Valid log levels are 'error', 'warn', 'info', 'debug' and 'trace'")
,log_to_stdout(this,"log_to_stdout",value_status::Used,true,"Send log output to stdout")
To support writes arriving out-of-order -- either due to natural delays, or user provided timestamps -- the repair mode has a propagation delay.
Out-of-order writes present a problem for repair mode tombstone gc. Consider the following example sequence of events:
1) Write ``DELETE FROM table WHERE key = K1`` arrives at the node.
2) Repair is run.
3) Compaction runs and garbage collects the tombstone for ``key = K1``.
4) Write ``INSERT INTO table (key, ...) VALUES (K1, ...)`` arrives at the node with timestamp smaller than that of the delete. The tombstone for ``key = K1`` should apply to this write, but it is already garbage collected, so this data is resurrected.
Propagation delay solves this problem by establishing a window before repair, where tombstones are not yet garbage collectible: a tombstone is garbage collectible if it was written before the last repair by at least the propagation delay.
The value of the propagation delay can be set via the ``propagation_delay_in_seconds`` parameter:
The default value of the propagation delay is 1 hour. This parameter should only be changed if your application uses user provided timestamps and writes and deletes can arrive out-of-order by more than the default 1 hour.
@@ -78,6 +78,7 @@ Permits are in one of the following states:
*`active/await` - a previously `active/need_cpu` permit, which needs something other than CPU to proceed, it is waiting on I/O or a remote shards, other permits can be admitted while the permit is in this state, pending resource availability;
*`inactive` - the permit was marked inactive, it can be evicted to make room for admitting more permits if needed;
*`evicted` - a former inactive permit which was evicted, the permit has to undergo admission again for the read to resume;
*`preemptive_aborted` - the permit timed out or was rejected during admission as it was detected the read might time out later during execution;
Note that some older releases will have different names for some of these states or lack some of the states altogether:
* On a regular system or VM (running Ubuntu, CentOS, or RedHat Enterprise): :code:`$ scylla --version`
Check the :doc:`Operating System Support Guide </getting-started/os-support>` for a list of supported operating systems and versions.
Check the `Operating System Support Guide <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_ for a list of supported operating systems and versions.
* On a docker node: :code:`$ docker exec -it Node_Z scylla --version`
@@ -12,37 +12,42 @@ the package manager (dnf and apt).
Prerequisites
---------------
Ensure your platform is supported by the ScyllaDB version you want to install.
See :doc:`OS Support </getting-started/os-support>` for information about supported Linux distributions and versions.
See `OS Support <https://docs.scylladb.com/stable/versioning/os-support-per-version.html>`_
for information about supported Linux distributions and versions.
Note that if you're on CentOS 7, only root offline installation is supported.
Download and Install
-----------------------
#. Download the latest tar.gz file for ScyllaDB version (x86 or ARM) from ``https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-<version>/``.
**Example** for version 2025.1:
- Go to https://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-2025.1/
- Download the ``scylla-unified`` file for the patch version you want to
install. For example, to install 2025.1.9 (x86), download
All releases are available as a Docker container, EC2 AMI, GCP, and Azure images.
.._os-support-definition:
By *supported*, it is meant that:
- A binary installation package is available.
- The download and install procedures are tested as part of the ScyllaDB release process for each version.
- An automated install is included from :doc:`ScyllaDB Web Installer for Linux tool </getting-started/installation-common/scylla-web-installer>` (for the latest versions).
You can `build ScyllaDB from source <https://github.com/scylladb/scylladb#build-prerequisites>`_
on other x86_64 or aarch64 platforms, without any guarantees.
**Topic: Mitigate non-paged queries coming from connection authentications**
**Audience: ScyllaDB administrators**
Issue
-----
If you create lots of roles and give them lots of permissions your nodes might spike with non-paged queries.
Root Cause
----------
``permissions_cache_max_entries`` is set to 1000 by default. This setting may not be high enough for bigger deployments with lots of tables, users, and roles with permissions.
Solution
--------
Open the scylla.yaml configuration for editing and adjust the following parameters:
``permissions_cache_max_entries`` - increase this value to suit your needs. See the example below.
``permissions_update_interval_in_ms``
``permissions_validity_in_ms``
Note:: ``permissions_update_interval_in_ms`` and ``permissions_validity_in_ms`` can be set to also make the authentication records come from cache instead of lookups, which generate non-paged queries
Example
-------
Considering with ``permissions_cache_max_entries`` there is no maximum value, it's just limited by your memory.
The cache consumes memory as it caches all records from the list of users and their associated roles (similar to a cartesian product).
Every user, role, and permissions(7 types) on a per table basis are cached.
If for example, you have 1 user with 1 role and 1 table, the table will have 7 permission types and 7 entries 1 * 1 * 1 * 7 = 7.
When expanded to 5 users, 5 roles, and 10 tables this will be 5 * 5 * 10 * 7 = 1750 entries, which is above the default cache value of 1000. The entries that go over the max value (750 entries) will be non-paged queries for every new connection from the client (and clients tend to reconnect often).
In cases like this, you may want to consider trading your memory for not stressing the entire cluster with ``auth`` queries.
*:doc:`If a query does not reveal enough results </kb/cqlsh-results>`
*:doc:`How to Change gc_grace_seconds for a Table </kb/gc-grace-seconds>` - How to change the ``gc_grace_seconds`` parameter and prevent data resurrection.
*:doc:`How to flush old tombstones from a table </kb/tombstones-flush>` - How to remove old tombstones from SSTables.
*:doc:`Increase Cache to Avoid Non-paged Queries </kb/increase-permission-cache>` - How to increase the ``permissions_cache_max_entries`` setting.
*:doc:`How to Safely Increase the Replication Factor </kb/rf-increase>`
*:doc:`Facts about TTL, Compaction, and gc_grace_seconds <ttl-facts>`
*:doc:`Efficient Tombstone Garbage Collection in ICS <garbage-collection-ics>`
***segregate** - Fixes partition/row/mutation-fragment out-of-order errors by segregating the output into as many SStables as required so that the content of each output SStable is properly ordered.
***validate** - Validates the content of the SStable, reporting any corruptions found. Writes no output SStables. In this mode, scrub has the same outcome as the `validate operation <scylla-sstable-validate-operation_>`_ - and the validate operation is recommended over scrub.
Output SStables are written to the directory specified via ``--output-directory``. They will be written with the ``BIG`` format and the highest supported SStable format, with generations chosen by scylla-sstable. Generations are chosen such
that they are unique among the SStables written by the current scrub.
The output directory must be empty; otherwise, scylla-sstable will abort scrub. You can allow writing to a non-empty directory by setting the ``--unsafe-accept-nonempty-output-dir`` command line flag.
Note that scrub will be aborted if an SStable cannot be written because its generation clashes with a pre-existing SStable in the output directory.
Output SStables are written to the directory specified via ``--output-dir``. They will be written with the ``BIG`` format and the highest supported SStable format, with random generation.
validate-checksums
^^^^^^^^^^^^^^^^^^
@@ -870,7 +866,7 @@ The SSTable version to be used can be overridden with the ``--version`` flag, al
SSTables which are already on the designated version are skipped. To force rewriting *all* SSTables, use the ``--all`` flag.
Output SSTables are written to the path provided by the ``--output-dir`` flag, or to the current directory if not specified.
This directory is expected to exist and be empty. If not empty the tool will refuse to run. This can be overridden with the ``--unsafe-accept-nonempty-output-dir`` flag.
This directory is expected to exist.
It is strongly recommended to use the system schema tables as the schema source for this command, see the :ref:`schema options <scylla-sstable-schema>` for more details.
A schema which is good enough to read the SSTable and dump its content, may not be good enough to write its content back verbatim.
@@ -882,6 +878,25 @@ But even an altered schema which changed only the table options can lead to data
The mapping of input SSTables to output SSTables is printed to ``stdout``.
filter
^^^^^^
Filter the SSTable(s), including/excluding specified partitions.
Similar to ``scylla sstable dump-data --partition|--partition-file``, with some notable differences:
* Instead of dumping the content to stdout, the filtered content is written back to SSTable(s) on disk.
* Also supports negative filters (keep all partitions except the those specified).
The partition list can be provided either via the ``--partition`` command line argument, or via a file path passed to the the ``--partitions-file`` argument. The file should contain one partition key per line.
Partition keys should be provided in the hex format, as produced by `scylla types serialize </operating-scylla/admin-tools/scylla-types/>`_.
With ``--include``, only the specified partitions are kept from the input SSTable(s). With ``--exclude``, the specified partitions are discarded and won't be written to the output SSTable(s).
It is possible that certain input SSTable(s) won't have any content left after the filtering. These input SSTable(s) will not have a matching output SSTable.
By default, each input sstable is filtered individually. Use ``--merge`` to filter the combined content of all input sstables, producing a single output SSTable.
Output sstables use the latest supported sstable format (can be changed with ``--sstable-version``).
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
..note::
If table audit is enabled, the ``audit`` keyspace is automatically created with ``NetworkTopologyStrategy``.
You must also alter the ``audit`` keyspace to remove replicas from the decommissioned data-center. For example:
..code-block::shell
cqlsh> ALTER KEYSPACE audit WITH REPLICATION={'class' : 'NetworkTopologyStrategy', 'US-DC' : 3, 'ASIA-DC' : 0, 'EUROPE-DC' : 3};
Failure to do so will result in decommission errors such as "zero replica after the removal".
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.
*:doc:`Preventing Quorum Loss in Symmetrical Multi-DC Clusters <arbiter-dc>`
*:doc:`Cluster Platform Migration Using Node Cycling </operating-scylla/procedures/cluster-management/cluster-platform-migration>`
..panel-box::
:title:Topology Changes
:id:"getting-started"
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.