Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.
Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.
We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.
We also raise an internal error to prevent a segmentation fault in a few places.
Fixes#27987
Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.
Closesscylladb/scylladb#28558
* github.com:scylladb/scylladb:
raft topology: prevent accessing nullptr returned by topology::find
raft topology: make some assertions non-crashing
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).
Closesscylladb/scylladb#28376
* github.com:scylladb/scylladb:
docs: decommission: note audit ks may require ALTERing
docs: mention table audit enabled by default
audit: disable DDL by default
db/config: enable table audit by default
test/cluster: fix `test_table_desc_read_barrier` assertion
test/cluster: adjust audit in tests involving decommissioning its ks
audit_test: fix incorrect config in `test_audit_type_none`
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28540
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)
It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.
Improving tests, not backporting
Closesscylladb/scylladb#28569
* https://github.com/scylladb/scylladb:
test: Merge test_restore_primary_replica_different_... tests
test: Merge test_restore_primary_replica_same_... tests
test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
Fix the build of the test and the upload operation flow
No need to backport since it is only a test we barely use
Closesscylladb/scylladb#28595
* github.com:scylladb/scylladb:
s3_perf: fix upload operation flow
s3_perf: fix the CMake build
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.
Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.
To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.
Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.
Refs scylladb/siren#15317Closesscylladb/scylladb#28563
* github.com:scylladb/scylladb:
tablets, config: Reduce migration concurrency to 2
tablets: load_balancer: Always accept migration if the load is 0
config, tablets: Make tablet migration concurrency configurable
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.
We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).
Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1
Fixes https://github.com/scylladb/scylladb/issues/27941Closesscylladb/scylladb#28704
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.
Fix by using --pull=always to get a fresh image.
Ref https://scylladb.atlassian.net/browse/SCYLLADB-714Closesscylladb/scylladb#28680
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.
This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.
Refs: SCYLLADB-678
Closesscylladb/scylladb#28703
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.
Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.
Closesscylladb/scylladb#28710
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: SCYLLADB-105
Closesscylladb/scylladb#28080
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.
Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.
Closesscylladb/scylladb#28581
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.
Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.
BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.
Skip such repairs at the scheduler level for auto repair.
If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.
Fixes SCYLLADB-561
Closesscylladb/scylladb#28640
This commit introduces four changes:
- In the `table` example, singular forms (node, partition) are changed to
plural forms (nodes, partitions). Currently, the default `table`
audit configuration is RF=3 and writes use CL=ONE. Therefore,
a `table` audit log write failure should not be caused by a single
node unavailability, and plural forms are more adequate.
- In the `table` example, unreachability due to network issues is
mentioned because with RF=3, audit failure due to network problems
is more likely to happen than a simultaneous failure of three
nodes (such network failures happened in SCYLLADB-706).
- In the `syslog` example, a slash `/` is changed to `or`, so `table`
and `syslog` examples have similar structure.
- As the `syslog` line is already being changed, I also change `unix`
to `Unix`, as the capitalized form is the correct one.
Refs SCYLLADB-706
Closesscylladb/scylladb#28702
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fdClosesscylladb/scylladb#28530
* github.com:scylladb/scylladb:
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache.
Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh.
Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement.
Backport: no, it's a new feature.
Fixes https://github.com/scylladb/scylladb/issues/7397
Fixes https://github.com/scylladb/scylladb/issues/3693
Fixes https://github.com/scylladb/scylladb/issues/2589
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147Closesscylladb/scylladb#28078
* github.com:scylladb/scylladb:
test: boost: add auth cache tests
auth: add cache size metrics
docs: conf: update permissions cache documentation
auth: remove old permissions cache
auth: use unified cache for permissions
auth: ldap: add permissions reload to unified cache
auth: add permissions cache to auth/cache
auth: add service::revoke_all as main entry point
auth: explicitly life-extend resource in auth_migration_listener
The hostent::addr_list is deprecated in favor of address_entry::addr
field that contains the very same addresses.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28566
All users of it had been updated to get the streaming group elsewhere,
so this getter is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#28527
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.
The documentation is adjusted since cache will be
persistent for sccache without further work.
Closesscylladb/scylladb#28524
There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector.
Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array().
Code cleanup, not backporting
Closesscylladb/scylladb#28491
* github.com:scylladb/scylladb:
api: Remove map_to_key_value() helpers
api: Streamify view_build_statuses handler
api: Streamify few more storage_service/ handlers
api: Add map_to_json() helper
api: Coroutinize view_build_statuses handler
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.
Closesscylladb/scylladb#28490
Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated.
The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`.
Refs: https://github.com/scylladb/seastar/pull/3236
No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes).
Closesscylladb/scylladb#28573
* github.com:scylladb/scylladb:
init: fix infinite loop on npos wrap with updated Seastar
init: remove unnecessary object creation in emplace calls
test_node_ops_tasks.py::test_get_children fails due to timeout of
tasks_vt_get_children injection in debug mode. Compared to a successful
run, no clear root cause stands out.
Extend the message timeout of tasks_vt_get_children from 10s to 60s.
Fixes: #28295.
Closesscylladb/scylladb#28599
What changed
Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys
Why (Requirements Summary)
Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git.
This will create a new release in the SMI Jira project when a milestone is added to scylladb.git.
Fixes:PM-190
Closesscylladb/scylladb#28585
Right now the slowest test in the test/cqlpy directory is
cassandra_tests/validation/entities/collections_test.py::
testMapWithLargePartition
This test (translated from Cassandra's unit test), just wants to verify
that we can write and flush a partition with a single large map - with
200 items totalling around 2MB in size.
200 items totalling 2MB is large, but not huge, and is not the reason
why this test was so so slow (around 9 seconds). It turns out that most
of the test time was spent in Python code, preparing a 2MB random string
the slowest possible way. But there is no need for this string to be
random at all - we only care about the large size of the value, not the
specific characters in it!
Making the characters written in this text constant instead of random
made it 20 times fast - it now takes less than half a second.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#28271
File streaming only releases the file descriptors of a tablet being
streamed in the very streaming end. Which means that if the streaming
tablet has compaction on largest tier finished after streaming
started, there will be always ~2x space amplification for that
single tablet. Since there can be up to 4 tablets being migrated
away, it can add up to a significant amount, since nodes are pushed
to a substantial usage of available space (~90%).
We want to optimize this by dropping reference to a sstable after
it was fully streamed. This way, we reduce the chances of hitting
2x space amplification for a given tablet.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#28505
Fedora 45 tightened the default installation checks [1]. As a result
the cassandra-stress rpm we provide no longer installs.
Install it with --no-gpgchecks as a workaround. It's our own package
so we trust it. Later we'll sign it properly.
We install its dependencies via the normal methods so they're still
checked.
[1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_defaultClosesscylladb/scylladb#28687
Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client.
This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling
No backport needed since it is only refactoring of the existing code
Closesscylladb/scylladb#28250
* github.com:scylladb/scylladb:
exceptions: add helper to build a chain of error handlers
http: extract error classification code
aws_error: extract `retryable` from aws_error
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
Closesscylladb/scylladb#28592
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Improves performance of deserialization of vector data for calculating similarity functions.
Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float>
and then pass it to similarity functions as a std::span<const float>.
This avoids overhead of data_value allocations and conversions.
Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`:
client concurrency 1: before: ~135 QPS, after: ~1005 QPS
client concurrency 20: before: ~280 QPS, after: ~2097 QPS
Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search)
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471Closesscylladb/scylladb#28615
Fixes#28678
If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.
Simply move queue abort to always be done on loop exit.
Closesscylladb/scylladb#28679
Fixes parsing of comma-separated seed lists in "init.cc" and
"cql_test_env.cc" to use the standard `split_comma_separated_list`
utility, avoiding manual `npos` arithmetic. The previous code relied on
`npos` being `uint32_t(-1)`, which would not overflow in `uint64_t`
target and exit the loop as expected. With Seastar's upcoming change
to make `npos` `size_t(-1)`, this would wrap around to zero and cause
an infinite loop.
Switch to `split_comma_separated_list` standardized way of tokenization
that is also used in other places in the code. Empty tokens are handled
as before. This prevents startup hangs and test failures when Seastar is
updated.
Refs: scylladb/seastar#3236
The cache is covered already with general auth
dtests but some cases are more tricky and easier
to express directly as calls to cache class.
For such tests boost test file was added.