Commit Graph

52062 Commits

Author SHA1 Message Date
Dario Mirovic
51e7c2f8d9 dtest: wait_for speedup
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.

This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.

Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.

This patch also improves performance of some alternator tests that
use the same wait_for dtest function.

Refs SCYLLADB-573
2026-02-25 03:17:46 +01:00
Patryk Jędrzejczak
e8efcae991 Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov
The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3

Cleaning tests, not backporting

Closes scylladb/scylladb#28600

* https://github.com/scylladb/scylladb:
  test/object_store: Remove create_ks_and_cf() helper
  test/object_store: Replace create_ks_and_cf() usage with standard methods
  test/object_store: Shift indentation right for test cases
2026-02-20 15:53:38 +01:00
Nadav Har'El
d01915131a test/cqlpy: make test_indexing_paging_and_aggregation much faster
Currently, test_secondary_index.py::test_indexing_paging_and_aggregation
is very slow, and the slowest test in the test/cqlpy framework: It takes
around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep),
it is much slower on debug builds. The reason for this slowness is that it
needs to set up and read over 10,000 rows which is the default
select_internal_page_size.

But after the patches in pull request (#25368), we can configure
select_internal_page_size, so in this patch we change the test to
temporarily reduce this option to just 50, and then the test can reach
the same code paths with just 142 rows instead of 20120 rows before this
patch.

As a result, the test should now be 140 times faster than it was before.
In practice, because of some fixed overheads (the test creates several
tables and indexes), in dev build mode the test run speedup is "only"
26-fold (to around half a second).

I verified that removing the code added in bb08af7 indeed makes the new
shorter test fail - and this is the only test in test_secondary_index.py
that starts to fail besides test_index_paging_group_by which is also
related (so my revert didn't just break secondary indexing completely).
So the shorter test is still a good regression test.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28268
2026-02-20 15:44:53 +02:00
Avi Kivity
92bc5568c5 tools: toolchain: build sanitizers for future toolchain
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.

Closes scylladb/scylladb#28733
2026-02-20 15:44:24 +02:00
Botond Dénes
6c04e02f66 Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov
The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off.

Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway.

Minor test fix, not backporting

Closes scylladb/scylladb#28607

* github.com:scylladb/scylladb:
  test: Fix the condition for streaming directions validation
  test: Split test_backup.py::check_data_is_back() into two
2026-02-20 15:42:10 +02:00
Botond Dénes
6f88c0dbd3 Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec
Currently, the test assumes that when
'topology_coordinator_pause_before_processing_backlog: waiting' is
logged, the task for decommission must be there. This was based on the
assumption that topology coordinator is idle and decommission request
wakes it up. But if the server is slow enough, it may still be running
the load balancer in reaction to table creation, and block on that
injection point before decommission request was added.

Fix by waiting for the task to appear rather than the injection.

Fixes SCYLLADB-715

Only 2026.1 vulnerable.

Closes scylladb/scylladb#28688

* github.com:scylladb/scylladb:
  test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance
  test: cluster: task_manager_client: Introduce wait_task_appears()
  tests: pylib: util: Add exponential backoff to wait_for
2026-02-20 15:05:36 +02:00
Pavel Emelyanov
c96420c015 tests: Re-use manager.get_server_exe()
There's a bunch of incremental repair tests that want to call scylla
sstable command. For that they try to find where scylla binary by
scanning /proc directory (see local_process_id and get_scylla_path
helpers).

There's shorter way -- just call manager.get_server_exe().

Same for backup-restore test.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28676
2026-02-20 14:59:30 +02:00
Pavel Emelyanov
a4a0d75eee test/object_store: Parametrize test_simple_backup_and_restore()
There are three tests and a function with a pair of boolean parameters
called by those. It's less code if the function becomes a test with
parameters.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28677
2026-02-20 14:57:30 +02:00
Pavel Emelyanov
a2e1293f86 test/object_store: Squash two simple-backup tests together
The test_backup_simple creates a ks/cf, takes a snapshot, backs it up,
then checks that the files were uploaded. The test_backup_move does the
same, but also plays with 'move_files' parameter to be true/false.

In fact, the "move" test was the copy of "simple" one that dropepd check
for scheduling group being "streaming" (backup with --move-files can
check the same, it's not bad), and check for destination bucket to
contain needed files (same here -- checking that files arrived to bucket
after --move-files is good).

In the end of the day, after the change backup test is run two times,
instead of three, and performs extra checks for --move-files case.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28606
2026-02-20 14:49:30 +02:00
Botond Dénes
7e90ed657c Merge 'Fix client_options docs' from Karol Baryła
https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message.
This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key.

Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension.

This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit.

Backport: none, because this fixes internal documentation.

Closes scylladb/scylladb#28126

* github.com:scylladb/scylladb:
  protocol-extensions.md: Fix client_options docs
  system_keyspace.md: Add client_options column
  system_keyspace.md: Fix order in system.clients
2026-02-20 14:23:34 +02:00
Pavel Emelyanov
525cb5b3eb table: Use fmt::to_string() to stringify compation group ID
Doing it with format("{}", foo) is correct, but to_string is
a bit more lightweight.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28630
2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak
d399a197f5 Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek
The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem.

Fixes SCYLLADB-665

Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions.

Closes scylladb/scylladb#28624

* https://github.com/scylladb/scylladb:
  test: raft: Add test_aborting_wait_for_state_change
  raft: Describe exception types for wait_for_state_change and wait_for_leader
  raft: Await instead of returning future in wait_for_state_change
2026-02-20 12:17:22 +01:00
Raphael S. Carvalho
f33f324f77 mutation_compactor: Fix tombstone GC metrics to account for only expired
There are 3 metrics (that goes in every compaction_history entry):
total_tombstone_purge_attempt
total_tombstone_purge_failure_due_to_overlapping_with_memtable
total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable

When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or
grace period), it can be currently accounted as failure due to
overlapping with either memtable or uncompacting sstable.
So those 2 last metrics have noise of *unexpired* tombstones.

What we should do is to only account for expired tombstones in all
those 3 metrics. We lose the info of knowing the amount of tombstones
processed by compaction, now we'll only know about the expired ones.
But those metrics were primarily added for explaining why expired
tombstones cannot be removed.
We could have alternatively added a new field
purge_failure_due_to_being_unexpired or something, but
it requires adding a new field to compaction_history.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28669
2026-02-20 10:43:58 +02:00
Botond Dénes
0bf4c68af5 Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa
Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: 1abf981a73, see https://github.com/scylladb/scylladb/pull/22910

No backport, just a README link fixes.

Closes scylladb/scylladb#28699

* github.com:scylladb/scylladb:
  docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory
  docs: fix link to docker build README.MD
2026-02-20 08:21:46 +02:00
Avi Kivity
66bef0ed36 lua, tools: adjust for lua 5.5 lua_newstate seed parameter
Lua 5.5 adds a seed parameter to lua_newstate(), provide it
with a strong random seed.

Closes scylladb/scylladb#28734
2026-02-20 06:52:37 +02:00
Avi Kivity
27a5502f14 Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz
The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves
run_standalone function to common code in perf.hh with necessary templating.

We also add extensive testing so that it's more difficult to break the tooling in the future.

Fixes SCYLLADB-560
Backport: no, internal tooling improvement

Closes scylladb/scylladb#28541

* github.com:scylladb/scylladb:
  test: cluster: add tests for perf tools
  test: perf: fix port race condition on startup in connect workload
  test: perf: prepare benchmarks to bind to custom host
  test: perf: make perf-alterantor remote port configurable
  test: perf: fix ASAN leak warnings in perf-alternator
  Reapply "main: test: add future and abort_source to after_init_func"
2026-02-19 19:12:46 +02:00
Dawid Mędrek
c9d192c684 Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak
Some assertions in the Raft-based topology are likely to cause crashes of
multiple nodes due to the consistent nature of the Raft-based code. If the
failing assertion is executed in the code run by each follower (e.g., the code
reloading the in-memory topology state machine), then all nodes can crash. If
the failing assertion is executed only by the leader (e.g., the topology
coordinator fiber), then multiple consecutive group0 leaders will chain-crash
until there is no group0 majority.

Crashing multiple nodes is much more severe than necessary. It's enough to
prevent the topology state machine from making more progress. This will
naturally happen after throwing a runtime error. The problematic fiber will be
killed or will keep failing in a loop. Note that it should be safe to block
the topology state machine, but not the whole group0, as the topology state
machine is mostly isolated from the rest of group0.

We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT`
with `on_internal_error`. These are not all occurrences, as some fatal
assertions make sense, for example, in the bootstrap procedure.

We also raise an internal error to prevent a segmentation fault in a few places.

Fixes #27987

Backporting this PR is not required, but we can consider it at least for 2026.1
because:
- it is LTS,
- the changes are low-risk,
- there shouldn't be many conflicts.

Closes scylladb/scylladb#28558

* github.com:scylladb/scylladb:
  raft topology: prevent accessing nullptr returned by topology::find
  raft topology: make some assertions non-crashing
2026-02-19 16:50:03 +01:00
Marcin Maliszkiewicz
22c3d8d609 Merge 'db/config: enable table audit by default' from Piotr Smaron
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222

No backport: table audit has been enabled in 2026.1 in `scylla.yaml`,
and should be always on starting from the next release,
which is the release we're currently merging to (2026.2).

Closes scylladb/scylladb#28376

* github.com:scylladb/scylladb:
  docs: decommission: note audit ks may require ALTERing
  docs: mention table audit enabled by default
  audit: disable DDL by default
  db/config: enable table audit by default
  test/cluster: fix `test_table_desc_read_barrier` assertion
  test/cluster: adjust audit in tests involving decommissioning its ks
  audit_test: fix incorrect config in `test_audit_type_none`
2026-02-19 16:30:11 +01:00
Pavel Emelyanov
b4b9b547ce replica: Remove unused sched groups from keyspace and table configs
Compaction and statement groups are carried over on those configs, but
are in fact unused. Drop both.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28540
2026-02-19 15:47:31 +01:00
Patryk Jędrzejczak
45115415fb Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov
There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines)

It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later.

Improving tests, not backporting

Closes scylladb/scylladb#28569

* https://github.com/scylladb/scylladb:
  test: Merge test_restore_primary_replica_different_... tests
  test: Merge test_restore_primary_replica_same_... tests
  test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all
  test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all
2026-02-19 15:42:55 +01:00
Pavel Emelyanov
26372e65df Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky
Fix the build of the test and the upload operation flow

No need to backport since it is only a test we barely use

Closes scylladb/scylladb#28595

* github.com:scylladb/scylladb:
  s3_perf: fix upload operation flow
  s3_perf: fix the CMake build
2026-02-19 15:31:43 +02:00
Avi Kivity
7ec710c250 Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317

Closes scylladb/scylladb#28563

* github.com:scylladb/scylladb:
  tablets, config: Reduce migration concurrency to 2
  tablets: load_balancer: Always accept migration if the load is 0
  config, tablets: Make tablet migration concurrency configurable
2026-02-19 15:31:43 +02:00
Dawid Mędrek
fae71f79c2 test: raft: Add test_aborting_wait_for_state_change 2026-02-19 14:21:01 +01:00
Dawid Mędrek
e4f2b62019 raft: Describe exception types for wait_for_state_change and wait_for_leader
The methods of `raft::server` are abortable and if the passed
`abort_source` is triggered, they throw `raft::request_aborted`.
We document that.

Although `raft::server` is an interface, this is consistent with
the descriptions of its other methods.
2026-02-19 12:47:14 +01:00
Dawid Mędrek
c36623baad raft: Await instead of returning future in wait_for_state_change
The `try-catch` expression is pretty much useless in its current form.
If we return the future, the awaiting will only be performed by the
caller, completely circumventing the exception handling.

As a result, instead of handling `raft::request_aborted` with a proper
error message, the user will face `seastar::abort_requested_exception`
whose message is cryptic at best. It doesn't even point to the root
of the problem.

Fixes SCYLLADB-665
2026-02-19 12:47:14 +01:00
Marcin Maliszkiewicz
de4e5e10af test: perf: fix prepared statements logic in perf-simple-query
Due to lack of checks present in process_execute_internal from
transport/server.cc needs_authorization bool was always set to true
doing some extra work (check_access()) for each request.

We mirror the logic in this patch in test env which perf-simple-query
uses. This can also potentially improve runtime of unittests (marginally).

Note that bug is only in perf tool not scylla itself, the fix
decreases insns/op by around 10%:
Before: 41065 insns/op
After: 37452 insns/op
Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1

Fixes https://github.com/scylladb/scylladb/issues/27941

Closes scylladb/scylladb#28704
2026-02-19 12:42:07 +02:00
Avi Kivity
58a662b9db dist: refresh container base image (ubi9-minimal)
Using an outdated image can cause problems when `microdnf update`
runs, if the distribution doesn't maintain good update hygiene.
Although, I suspect that when update failures happen they're really
caused by propagation delay of packages to mirrors.

Fix by using --pull=always to get a fresh image.

Ref https://scylladb.atlassian.net/browse/SCYLLADB-714

Closes scylladb/scylladb#28680
2026-02-19 12:42:43 +03:00
Ferenc Szili
f1bc17bd4c load_stats: fix race condition when computing sum_tablet_sizes
In storage_service::load_stats_for_tablet_based_tables(), we are passing
a reference to sum_tablet_sizes to the lambda which increments this value
on each shard via map_reduce0(). This means we could have a race
condition because this is executed on separate threads/CPUs.

This patch fixed the problem by collecting the sums by shard into a
vector, then summing those up.

Refs: SCYLLADB-678

Closes scylladb/scylladb#28703
2026-02-19 12:29:25 +03:00
Avi Kivity
dee868b71a interval: avoid clang 23 warning on throw statement in potentially noexcept function
interval_data's move constructor is conditionally noexcept. It
contains a throw statemnt for the case that the underlying type's
move constructor can throw; that throw statemnt is never executed
if we're in the noexept branch. Clang 23 however doesn't understand
that, and warns about throwing in a noexcept function.

Fix that by rewriting the logic using seastar::defer(). In the
noexcept case, the optimizer should eliminate it as dead code.

Closes scylladb/scylladb#28710
2026-02-19 12:24:20 +03:00
Ernest Zaslavsky
45d824e0fe s3_perf: fix upload operation flow
Correct the upload operation logic. The previous flow incorrectly
checked for the test file on S3 even when performing operations that do
not download the file, such as uploads.
2026-02-19 11:14:59 +02:00
Botond Dénes
b637e17b19 db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: SCYLLADB-105

Closes scylladb/scylladb#28080
2026-02-19 09:51:09 +01:00
Calle Wilund
8e71a6f52a gcp: Add handling of 429 (too many requests) to exponential backoff
Fixes: SCYLLADB-611

Adds http error code 429 to codes handled by exponential backoff.

Closes scylladb/scylladb#28588
2026-02-19 09:42:39 +01:00
Marcin Maliszkiewicz
3417d50add test: cluster: add tests for perf tools
It checks if all workloads can be properly
executed with succesfull startup and teardown.

Especially testing alternator in remote mode is important
because it's invoked like this during pgo training in pgo.py.

Test runtime:
Release - 24s
Debug - 1m 15s

Test time consists mostly of Scylla startup in various modes.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
c69534504c test: perf: fix port race condition on startup in connect workload
Other workloads at startup call prepopulate() which connects
with retry loop therefore it waits until cql port is open.

This commit adds a single place where we will wait for port
for all workloads.

Timeout is set to 5 minutes so that even slowest machines
are able to start.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
828f2fbdb1 test: perf: prepare benchmarks to bind to custom host
This is usefull for tests where we use local networks
like 127.5.5.5 to avoid port and host collisions.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
9f2b97bef4 test: perf: make perf-alterantor remote port configurable
It could be a usefull option to have.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
f5a212e91e test: perf: fix ASAN leak warnings in perf-alternator
Those were intentional as test process is short lived
but when we add automated tests in the following commits
we expect clean exit, with 0 exit code.
2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz
0c76c73e34 Reapply "main: test: add future and abort_source to after_init_func"
This reverts commit ceec703bb7.

The commit was fixed with abort source handling for alternator
standalone path so it's safe to reapply.
2026-02-19 09:33:10 +01:00
Piotr Dulikowski
7d6f734a51 dictionary compression: add missing co_awaits on get_units
There is a handful of places in the code related to dictionary
compression which calls get_units to acquire semaphore units but the
returned future is not awaited, seemingly by mistake. The result of
get_units is assigned to a variable - which is reasonable at a glance
because the semaphore units need to be assigned to a variable in order
to control their scope - but at the same time if co_await is mistakenly
omitted, like here, doing so will silence the nodiscard check of
seastar::future and, effectively, the get_units call will be nearly
useless. Unfortunately, this is an easy mistake to make.

Fix the places in the code that acquire semaphore units via get_units
but never await the future returned by it. I found them by manual code
inspection, so I hope that I didn't miss any.

Closes scylladb/scylladb#28581
2026-02-18 16:40:40 +01:00
Ernest Zaslavsky
4026b54a5e s3_perf: fix the CMake build
Fix the CMake build of the perf_s3_client by adding the
necessary linkage with the jsoncpp library.
2026-02-18 17:12:08 +02:00
Piotr Smaron
797c5cd401 docs: decommission: note audit ks may require ALTERing
With audit feature enabled, it's not immediately obvious that its
pseudo-system keyspace `audit` may require adjusting its RF across DCs
before decommissioning a node, and this should be documented.
2026-02-18 15:14:57 +01:00
Piotr Smaron
65eec6d8e7 docs: mention table audit enabled by default
Also align the documentation with the current audit settings.
2026-02-18 15:14:57 +01:00
Piotr Smaron
c30607d80b audit: disable DDL by default
DDL audit category doesn't make sense if its enabled by default on its
own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables
setting is empty. This may be counter-intuitive to our users, who may
expect to actually see these statements logged if we're enabling this by
default. Also, it doesn't make sense to enable a setting by default if
it has no effect.
Additionally, listed all possible audit categories for user's
convenience.
2026-02-18 15:14:57 +01:00
Piotr Smaron
08dc1008ba db/config: enable table audit by default
In https://github.com/scylladb/scylladb/pull/27262 table audit has been
re-enabled by default in `scylla.yaml`, logging certain categories to a table,
which should make new Scylla deployments have audit enabled.
Now, in the next release, we also want to enable audit in `db/config.cc`,
which should enable audit for all deployments, which don't explicitly configure
audit otherwise in `scylla.yaml` (or via cmd line).
BTW. Because this commit aligns audit's default config values in `db/config.cc`
to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which
is based on `db/config.cc` will start showing that table audit is the default.

Refs: https://github.com/scylladb/scylladb/issues/28355
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222
2026-02-18 15:14:57 +01:00
Piotr Smaron
95ee4a562c test/cluster: fix test_table_desc_read_barrier assertion
The test `assertion desc_schema[0] == desc_schema[1]` does a direct
list comparison, which is order-sensitive. Before enabling audit by default,
both nodes would return only the test keyspace/table, so the order
didn't matter. With audit enabled, there will be multiple keyspaces,
and they can be returned in different order by different nodes.
2026-02-18 15:14:57 +01:00
Piotr Smaron
2e12b83366 test/cluster: adjust audit in tests involving decommissioning its ks
When table audit is enabled, Scylla creates the "audit" ks with
NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail
for the audit ks with "zero replica after the removal" when all nodes from a DC
are removed, and so we have to ALTER audit ks to either zero the number of its
replicas, to allow for a clear decommission, or have them in the 2nd DC.

BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but
in dtests repository.
2026-02-18 15:14:55 +01:00
Piotr Smaron
0cf20fa15a audit_test: fix incorrect config in test_audit_type_none
Passing Python `None` to setup is incorrect, because config updates are sent
as a dict and `None` is treated as "unset" - meaning: use Scylla's default.
Using the explicit string "none" to guarantee that audit is disabled.
2026-02-18 15:12:26 +01:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Andrzej Jackowski
4221d9bbfd docs: improve examples in Handling Audit Failures section
This commit introduces four changes:
 - In the `table` example, singular forms (node, partition) are changed to
   plural forms (nodes, partitions). Currently, the default `table`
   audit configuration is RF=3 and writes use CL=ONE. Therefore,
   a `table` audit log write failure should not be caused by a single
   node unavailability, and plural forms are more adequate.
 - In the `table` example, unreachability due to network issues is
   mentioned because with RF=3, audit failure due to network problems
   is more likely to happen than a simultaneous failure of three
   nodes (such network failures happened in SCYLLADB-706).
 - In the `syslog` example, a slash `/` is changed to `or`, so `table`
   and `syslog` examples have similar structure.
 - As the `syslog` line is already being changed, I also change `unix`
   to `Unix`, as the capitalized form is the correct one.

Refs SCYLLADB-706

Closes scylladb/scylladb#28702
2026-02-18 13:10:01 +01:00
Botond Dénes
3bfd47da4b Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

Closes scylladb/scylladb#28530

* github.com:scylladb/scylladb:
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
2026-02-18 13:48:49 +02:00