- Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum
- Handle unhandled timeout exception in the wait-for-leader loop during group startup
When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion.
Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state.
SCYLLADB-2080 fix:
- Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released).
- Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling.
- Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between).
Timeout handling fix:
- Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled.
Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that:
1. Pauses a write on the leader before add_entry
2. Drops the table (follower destroys its group immediately)
3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds
backport: no need, strong consistency is not released yet
Fixes: SCYLLADB-2080
Closesscylladb/scylladb#30105
* github.com:scylladb/scylladb:
strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
strong consistency: abort raft server before gate close when dropping a table
test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
When a strongly consistent table is dropped, schedule_raft_group_deletion()
used to call g->close() first, which waits for all in-flight operations to
release their gate holders. But other nodes may have already destroyed their
raft servers for this group, so an in-flight write on the leader cannot
reach quorum and hangs until the client timeout expires, unnecessarily
delaying group deletion.
Fix: initiate gate close (prevents new operations from entering), then
abort the raft server (causes in-flight add_entry/read_barrier to throw
raft::stopped_error, releasing their gate holders), then await the gate
future (resolves immediately since holders are now released).
Handle raft::stopped_error in the coordinator's top-level catch blocks
(both write and read paths): if the table no longer exists, return
no_such_column_family (which the CQL layer converts to InvalidRequest
'unconfigured table'). Otherwise fall through to the default timeout
handling.
Also replace gate->hold() with try_hold() + on_internal_error in
acquire_server, and handle the timeout exception in the wait-for-leader
loop in update() gracefully (log + break instead of propagating).
Fixes: SCYLLADB-2080
Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes
the quorum-loss scenario: when a table is dropped, the follower destroys
its raft group immediately while the leader's in-flight operations are
still holding the gate.
The test pauses both a read and a write on the leader, drops the table,
then resumes them. Both are expected to fail with 'no such column family'
since the raft server is aborted as part of group deletion. A 15-second
timeout guard detects the old buggy behavior (write stuck forever).
Marked xfail until the fix is applied in the next commit.
The purpose of this test is to verify that the task manager's "wait" API
works correctly for vnodes-to-tablets migration virtual tasks. It starts
a `wait_task` HTTP request concurrently with a finalize (or rollback)
operation, and asserts that the wait returns the correct final state
("done" or "suspended").
The test `uses asyncio.create_task()` to wrap the wait request into a
task, and then immediately calls finalize. With asyncio's lazy task
scheduling, the wait coroutine does not start until the event loop
yields, so the finalization request reaches the server before wait, and
therefore may also complete before it. Once finalization completes, the
virtual migration task is no longer discoverable, causing a
"task not found" error.
Add a log message in Scylla's wait handler and a synchronization point
in the test to ensure that the wait request lands the server before
finalization. This follows the same pattern used in
`test_tablet_tasks.py::check_and_abort_repair_task`.
Fixes SCYLLADB-2077
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29973
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.
Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.
Fixes: SCYLLADB-2106
Backport to all supported branches, as the bug is there and it has
already caused a failure in 2026.1 CI.
Closesscylladb/scylladb#30015
* github.com:scylladb/scylladb:
test: add test_stop_before_starting_compaction_manager
compaction_manager: unregister compaction module on early shutdown
The keyspace RF test starts zero-token nodes as part of its topology setup.
The python driver 3.29.9 can't schedule queries on zero-token nodes, so waiting for `CQL_ALTERNATOR_QUERIED` on those nodes is the wrong readiness gate.
This change makes the zero-token `server_add()` calls stop at `CQL_ALTERNATOR_CONNECTED`.
The test still exercises the keyspace replication assertions through a normal token-owning contact point.
Verified with running all 4 variations of `cluster.test_keyspace_rf::test_create_keyspace_with_default_replication_factor` on this branch.
Closesscylladb/scylladb#29779
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.
By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.
Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.
The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request
Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.
Fixes: SCYLLADB-1157
Closesscylladb/scylladb#29575
* github.com:scylladb/scylladb:
strong_consistency: test read forwarding to leader
strong_consistency: skip read_barrier() for non-linearizable reads
strong_consistency: split coordinator-level read latency metrics
strong_consistency: forward linearizable reads to raft leader
strong_consistency: classify reads by consistency level
strong_consistency: add begin_read() to raft_server
in database::truncate_table_on_all_shards disable logstor compaction
before the table data is truncated, similarly to how non-logstor
compaction is disabled, to avoid race conditions between logstor
compaction and segments discarding.
Fixes SCYLLADB-2186
Test the linearizable read forwarding behavior in a single test that
exercises all scenarios on one cluster:
- CL=QUORUM reads on leader, follower, and non-replica nodes
- CL=ONE reads (non-linearizable, no forwarding)
- Linearizability: write + CL=QUORUM read from follower (10 iterations)
- Coordinator latency histogram metrics for both read types
Refs: SCYLLADB-1157
Split the latency metrics for strongly consistent reads into two
categories: linearizable and non-linearizable. They replace the
existing metrics for both types combined - this shouldn't cause
issues because the feature is still experimental and both the
initial introduction of latency metrics and the split will be
a part of the same release.
Also fix a test that was using the old metric.
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
registered on the CV. The loop kept spinning, holding the HTTP
server's task gate open and preventing http_server::stop() from
completing. After ~15 minutes, systemd killed the process with
SIGABRT.
The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532Closesscylladb/scylladb#29485
The test_split_and_incremental_repair_synchronization[True] test was
timing out waiting for 'Finalizing resize decision for table' in
debug mode.
The root cause is a timing race: the incremental_repair_prepare_wait
error injection has a hardcoded 60s auto-expiry timeout
(wait_for_message(60s)), but split compactions in debug mode take ~58s
per SSTable due to -O0 compilation and scheduler starvation (the
maintenance_compaction group gets ~10% of wall-clock time). When the
injection auto-expires before split finalization, the repair fails,
leaving tablets stuck in transition=repair state. This prevents the
topology coordinator from finalizing the split, causing the 600s test
timeout.
Fix both contributing factors:
- Increase the injection timeout from 60s to 10min, giving split
compactions ample time to complete before the injection auto-expires.
The test explicitly messages the injection to release it (line 2200),
so the longer timeout is just a safety net.
- Reduce data volume from 256 to 64 rows (and repair data from 256 to
64 rows), producing smaller SSTables that split much faster in debug
mode.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2123.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#30004
This is about ungraceful stop, where the node is killed.
Test cases typically need to wait for other nodes to notice that the
node is down before proceeding. By default, that takes about 20s. Can
be reduced via config by reducing failure detector threshold, but it's
not the best solution:
- cannot set the threshold too low, or we'll introduce falkiness due to false
positives
- so it's still slow (a couple of seconds)
- developers forget about it and the test still works
This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter.
At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps:
- the series starts with defaulting to convict=False
- each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable
- finally, the "convict" parameter is made mandatory
There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost.
Tested on dev-mode
cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain:
Wall clock time reduced from 41s to 16s
No backport: enhancement
Closesscylladb/scylladb#28495
* https://github.com/scylladb/scylladb:
test: gossiper: Add test for natural failure detection
test: pylib: Make convict a required parameter in server_stop()
test: Annotate server_stop() calls where conviction is harmful
test: Annotate server_stop() calls where conviction is beneficial
test: Annotate server_stop() calls where conviction is useless
test: pylib: Add convict option to server_stop()
api: failure_detector: Introduce convict-node API
gms: gossiper: Make convict() public and safe to call from any scheduling group
api: Extract validate functions to common header
9646ee05bd changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace.
No backport, just a missing test case.
Closesscylladb/scylladb#29542
* github.com:scylladb/scylladb:
test: audit: pin empty-keyspace DDL audit behavior
test: audit: restart server when any non-live config key changes
test: audit: rename 'needed' to 'target_config' for clarity
Add test_natural_failure_detection which verifies that the failure
detector detects a killed node as DOWN without using the convict
mechanism. Uses the failure_detector_timeout fixture to keep the
FD timeout short (2s in release mode).
This ensures that natural failure detection continues to work
correctly even as other tests adopt the convict mechanism for speed.
Add explicit convict=False to server_stop() calls where convicting
the node would break or weaken the test.
In test_backoff_when_node_fails_task_rpc, the desired behavior is for
the node to not be marked as down immediately:
# The purpose of this is to simulate a situation when the gossiper
# doesn't mark a dead node as such immediately.
In raft tests, conviction could trigger voter reassignment while the
test wants to test the scenario with voters being still down.
In test_tablet_mv_replica_pairing_during_replace, conviction triggers
SCYLLADB-1996 (replace fails with "Failed to add server").
Add explicit convict=True to server_stop() calls where the test
needs other nodes to detect the stopped node as DOWN in order to
proceed. These are cases before remove_node, replace, or explicit
waits for failure detection (server_not_sees_other_server,
wait_new_coordinator_elected).
Convicting immediately speeds up the test.
Pass convict=False explicitly to server_stop() calls where conviction
provides no benefit because there is no consumer of the failure
detection:
- single-node clusters (no other node to call the API on)
- all nodes being stopped concurrently (no live node remains)
- immediate restart (no test logic between stop and start
depends on other nodes detecting the stopped node as dead)
- node stopped for file manipulation or bootstrap abort
- majority killed with no quorum on surviving nodes to react
- no test logic depends on other nodes detecting the failure
This is a no-op change since the default is already convict=False,
but makes the intent explicit for each call site.
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).
Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.
The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.
Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.
Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.
Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked
Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1
Closesscylladb/scylladb#29882
* https://github.com/scylladb/scylladb:
storage_service: cancel write handlers during drain to prevent shutdown deadlock
test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
test: scylla_cluster: allow stop() to bypass start_stop_lock
error_injection: add non-shared mode to wait_for_message
error_injection: release waiters when injection is disabled
There is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work.
To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV and is checked before awaiting on the CV.
Fixes SCYLLADB-2029
The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator.
Closesscylladb/scylladb#27313
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add reproducer
db/view/view_building_coordinator: add flag to mark if any remote work was finished
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.
Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.
The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.
Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.
On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.
An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.
We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.
Fixes: SCYLLADB-1064
Closesscylladb/scylladb#29392
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.
Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):
1. Queries system_schema.scylla_tables for rows where committed_by_group0
is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
cell (from the in-memory schema). The committed_by_group0 = true flag
is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.
On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.
This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.
Closesscylladb/scylladb#29911
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.
Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted
Fixes: SCYLLADB-2120
Closesscylladb/scylladb#29967
9646ee05bd changed behavior of empty keyspace handling
and this code path was never tested for CQL audit.
Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE
targeting both an existing keyspace and a nonexistent
one to verify both are audited with empty keyspace.
Before 9646ee05bd, an empty keyspace in audit_info
would be checked against audit_keyspaces like any other
value, silently skipping the statement when "" did not
match any configured keyspace. That commit introduced a
will_log() helper that treats an empty keyspace as
unfilterable, so these DDL statements are now always
logged when their category matches.
Refs SCYLLADB-1641
_check_restart_needed only compared NON_LIVE_AUDIT_KEYS against the
running server config, so extra keys like enable_user_defined_functions
were silently ignored and never applied. Generalize the check to
restart whenever any key outside LIVE_AUDIT_KEYS differs.
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.
Add a segment sequence number that is a global (per-shard) increasing
number that is allocated when getting a new segment for write, and is
written in buffer headers in the segment.
It is used to distinguish between buffers written to different generations
of a segment, and for recovery to break ties by keeping the record
from the newest segment.
Refs https://scylladb.atlassian.net/browse/SCYLLADB-770
no backport - logstor is a new feature
Closesscylladb/scylladb#29933
* github.com:scylladb/scylladb:
test: logstor: add basic delete test
logstor: rewrite segment seq num from streaming
logstor: add segment sequence number
logstor: get_segment helper
logstor: compare records by timestamp
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.
If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
(the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
deadlock (the barrier_and_drain RPC handler holds the gate open).
Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().
Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.
Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.
Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
The existing test only covers the case where the shutting-down node is
NOT the topology coordinator (deadlocks in uninit_messaging_service).
When the node IS the coordinator, the deadlock manifests differently:
the topology coordinator fiber calls barrier_and_drain on itself
(without messaging), and do_drain -> wait_for_group0_stop blocks
because the coordinator can't stop while stale_versions_in_use is
waiting on the uncancelled write handler.
Run the test twice on the same 2-node cluster (RF=2):
- Run 1: target is a non-coordinator
- Restore cluster state (restart target, decommission added node)
- Run 2: target is the topology coordinator
Use CL=ONE so the write completes from the local replica even with
the other server's response paused.
Mark as xfail since this reproduces bugs not yet fixed on this branch.
Refs: SCYLLADB-1842
The test was written for another case, and was not supposed to
reproduce the issue that was fixed in this PR.
Fix the test to reproduce the real scenario:
1. Use one_shot=False for pause_before_barrier_and_drain so the
injection fires on every barrier_and_drain RPC, not just the first.
2. Let the first barrier_and_drain through (at this point the write
handler's ERM version matches the current token_metadata version).
3. Wait for the second barrier_and_drain. Between the two calls,
topology_state_load installs a new token_metadata version. The
write handler still holds the old version's ERM — now stale.
4. After stop_transport completes, disable the injection (rather than
sending a single message) to release the paused handler and any
subsequent ones that arrived during stop_transport. The 'disabled'
flag in injection_shared_data ensures all waiters wake up.
With these changes the test reliably fails (shutdown deadlock within
15s) on the unfixed code and passes on the fixed version from
e0dc73f52a ('Cancel all write requests on storage_proxy shutdown').
Refs: scylladb/scylladb#23665
asyncio cancel() only affects the client-side coroutine. The
server-side addserver handler in the cluster manager continues
running. If it can't complete (e.g. no raft quorum because the
target node is shut down), the orphaned handler blocks _after_test
cleanup for 120s.
Await the task instead so it completes cleanly (we restart the
target node first to restore quorum).
Add a 15s timeout around the shutdown_task await. If the timeout
fires, the deadlock is reproduced (shutdown hung because
stale_versions_in_use blocks on a write handler holding a stale
token_metadata version).
When the timeout fires, explicitly kill the node via
server_stop() so that the manager's _after_test handler does not
wait 120s for the stuck stop_gracefully request. Then fail the
test with a clear message.
In SCYLLADB-2058 we observed a timeout exception while querying the base
table after restarting nodes 2 and 3.
Unfortunately, logs don't give us much useful information about the
root cause.
This patch adds basic checks that nodes see each other after the restart
and that the cql connection sees restarted node.
It doesn't guarantee that the error won't occur again - in logs from
SCYLLADB-2058 we see that each node sees other via gossip after part of
the cluster is restarted.
In case the error will occur again, this commit also increases logging
level of `cql_server` and `storage_proxy`.
Refs SCYLLADB-2058
Closesscylladb/scylladb#29951
This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches.
To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node.
**Performance**
Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads:
```
===============================================================================================================================================================================
Configuration | Binary | throughput (tps) | insns/op | cpu_cycles/op | alloc/op | logal/op | task/op
===============================================================================================================================================================================
audit=none [1] | baseline | 206922.4 | 36591.6 | 15348.3 | 58.1 | 0.0 | 14.1
audit=none [1] | this PR | 207856.4 (+0.5%) | 36544.9 (-0.1%) | 15274.0 (-0.5%) | 58.1 | 0.0 | 14.1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks [2] | baseline | 94871.8 | 54163.0 | 27172.4 | 72.0 | 0.0 | 24.0
audit=syslog keyspaces=ks [2] | this PR | 96138.4 (+1.3%) | 54072.3 (-0.2%) | 26699.3 (-1.7%) | 72.0 | 0.0 | 24.0
audit=syslog audit-rules=ks [3] | this PR | 95142.1 (+0.3%) | 54457.8 (+0.5%) | 26953.8 (-0.8%) | 72.0 | 0.0 | 24.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks-non-existent [4] | baseline | 213997.8 | 36735.6 | 14848.1 | 58.1 | 0.0 | 14.1
audit=syslog keyspaces=ks-non-existent [4] | this PR | 219297.2 (+2.5%) | 36667.3 (-0.2%) | 14500.1 (-2.3%) | 58.1 | 0.0 | 14.1
audit=syslog audit-rules=ks-non-existent [5] | this PR | 211038.7 (-1.4%) | 36999.7 (+0.7%) | 15048.6 (+1.4%) | 58.1 | 0.0 | 14.1
===============================================================================================================================================================================
[1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none"
[2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
[4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null`
```
Fixes: SCYLLADB-1430
No backport: new feature
Closesscylladb/scylladb#29267
* github.com:scylladb/scylladb:
test: alternator: audit: rules filtering and batch bypass
test: perf: add --audit-rules option to perf-simple-query
docs: add audit rules section to the auditing guide
test: audit: cover role and schema cache notifications
test: audit: cover audit rules cluster behavior
audit: rebuild rule caches on group0 snapshot and role changes
audit: refresh rule caches on schema, role, and config changes
audit: route matching rules to configured sinks
test: cover preprocessed audit rule cache
audit: add preprocessed rule matching cache
audit: pass sink targets to storage helpers
test: audit: cover rule matching semantics
audit: add rule matching and sink helpers
test: audit: cover audit_rules configuration
config: add live audit_rules option
test: cover audit rule parsing and validation
audit: define audit_rule type with parsing and validation
The test was starting Scylla with --write-request-timeout-in-ms=500 on the
command line. This tight timeout also applied to paxos state table creation,
which goes through raft and can take longer than 500ms on slow platforms
(e.g. aarch64/dev). When the first batch of CAS requests triggered paxos
state table creation under error injection, the raft schema change could
still be in-flight when the second batch fired, causing spurious WriteTimeout
failures unrelated to the semaphore bug being tested.
Fix by changing the write timeout at runtime via the REST API: lower it to
500ms only for the error-injection CAS phase (after table creation is done),
then restore it to 10000ms before the second batch that must succeed.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104Closesscylladb/scylladb#29969
Verify on a multi-node cluster that role creation/alter/
drop and table/materialized-view create/drop trigger
updates to the preprocessed audit-rules cache on every
node, and that a matching DML on the newly created table
is audited via the cache.
Refs SCYLLADB-1430
Cluster-level tests should validate rule matching, live
updates, sink routing, role filtering, and error handling
without rerunning the broader audit suite.
Add audit_rules to LIVE_AUDIT_KEYS so the test framework
tracks it as a live-updatable config key. Test that rules
with empty categories or roles match nothing, that DML
rules coexist with legacy audit config, AUTH rules fire
on login events, CQL and REST API update paths reject
invalid JSON, per-rule sink routing works for table and
syslog, role-based filtering works across sessions, and
sink mismatch produces a warning in server logs.
Refs SCYLLADB-1430
Add a per-scheduling-group latency histogram on the transport level that measures the full CQL request lifetime: from fetching the request buffer until the response is written to the socket.
Today latencies are accounted only on the storage proxy level, leaving the time spent in the transport layer (response queue wait + actual I/O) unaccounted. Having both transport and storage proxy latencies allows operators to tell where latency accumulates.
The metric is exposed as scylla_transport_cql_request_latency_histogram with the scheduling_group_name label, following the cql_ prefix convention of all other per-SG transport metrics.
Fixes: SCYLLADB-1691
New feature, no backport.
Closesscylladb/scylladb#29878
* github.com:scylladb/scylladb:
test/cluster: add test for per-service-level transport request latency histogram
transport: add per-service-level transport request latency histogram
Between stopping a server and excluding it, wait for other nodes to see
the server as down, otherwise exclude may see the server as alive and
fail.
Fixes SCYLLADB-2110
Closesscylladb/scylladb#29966
Verify that the new scylla_transport_cql_request_latency_histogram metric
correctly records transport-level request latencies per service level.
Uses error injection to pause a request mid-flight and verifies that the
histogram is not updated while the request is paused (since the response
has not been written yet), and is updated after the request completes.
Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
Add more racks to dc2 to verify that the default replication factor
covers all available racks (rather than e.g. limited to 3).
With tablets and rf_rack_valid_keyspaces, verify also the automatically
selected rack list.
Restrict the extension to non-debug build modes to prevent running out
of memory with --repeat=100.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#29931
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.
Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py
Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
add_test_list(), build_test_list(), all_tests(), test_count(),
SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
Test attributes (args, core_args, valid_exit_codes, allure_dir,
is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
_prepare_pytest_params(), pattern, test_file_ext, xmlout,
server_log, scylla_env setup, and shlex import.
Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
architecture, log paths, and removed features.
Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
Closesscylladb/scylladb#29613
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.
Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.
Fixes: SCYLLADB-2049
Closesscylladb/scylladb#29920
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory
Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.
This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements
Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645
Fixes SCYLLADB-1645
CI stability improvement. Backport to versions that have this test.
Closesscylladb/scylladb#29759
* github.com:scylladb/scylladb:
test: prepare max cells inserts
test: reuse max cells schema
test: limits: skip empty max cells iterations