Commit Graph

1374 Commits

Author SHA1 Message Date
Piotr Dulikowski
8dfd455001 Merge 'strong consistency: fix drop table blocking on stuck writes and handle timeout in update()' from Petr Gusev
- Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum
- Handle unhandled timeout exception in the wait-for-leader loop during group startup

When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion.

Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state.

SCYLLADB-2080 fix:
- Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released).
- Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling.
- Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between).

Timeout handling fix:
- Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled.

Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that:
1. Pauses a write on the leader before add_entry
2. Drops the table (follower destroys its group immediately)
3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds

backport: no need, strong consistency is not released yet

Fixes: SCYLLADB-2080

Closes scylladb/scylladb#30105

* github.com:scylladb/scylladb:
  strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
  strong consistency: abort raft server before gate close when dropping a table
  test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
2026-05-28 09:59:20 +02:00
Petr Gusev
d922c43358 strong consistency: abort raft server before gate close when dropping a table
When a strongly consistent table is dropped, schedule_raft_group_deletion()
used to call g->close() first, which waits for all in-flight operations to
release their gate holders. But other nodes may have already destroyed their
raft servers for this group, so an in-flight write on the leader cannot
reach quorum and hangs until the client timeout expires, unnecessarily
delaying group deletion.

Fix: initiate gate close (prevents new operations from entering), then
abort the raft server (causes in-flight add_entry/read_barrier to throw
raft::stopped_error, releasing their gate holders), then await the gate
future (resolves immediately since holders are now released).

Handle raft::stopped_error in the coordinator's top-level catch blocks
(both write and read paths): if the table no longer exists, return
no_such_column_family (which the CQL layer converts to InvalidRequest
'unconfigured table'). Otherwise fall through to the default timeout
handling.

Also replace gate->hold() with try_hold() + on_internal_error in
acquire_server, and handle the timeout exception in the wait-for-leader
loop in update() gracefully (log + break instead of propagating).

Fixes: SCYLLADB-2080
2026-05-27 12:06:46 +02:00
Petr Gusev
89307064b5 test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes
the quorum-loss scenario: when a table is dropped, the follower destroys
its raft group immediately while the leader's in-flight operations are
still holding the gate.

The test pauses both a read and a write on the leader, drops the table,
then resumes them. Both are expected to fail with 'no such column family'
since the raft server is aborted as part of group deletion. A 15-second
timeout guard detects the old buggy behavior (write stuck forever).

Marked xfail until the fix is applied in the next commit.
2026-05-27 12:06:46 +02:00
Nikos Dragazis
54cb6d4608 test: Order task-wait before finalization in test_migration_wait_task
The purpose of this test is to verify that the task manager's "wait" API
works correctly for vnodes-to-tablets migration virtual tasks. It starts
a `wait_task` HTTP request concurrently with a finalize (or rollback)
operation, and asserts that the wait returns the correct final state
("done" or "suspended").

The test `uses asyncio.create_task()` to wrap the wait request into a
task, and then immediately calls finalize. With asyncio's lazy task
scheduling, the wait coroutine does not start until the event loop
yields, so the finalization request reaches the server before wait, and
therefore may also complete before it. Once finalization completes, the
virtual migration task is no longer discoverable, causing a
"task not found" error.

Add a log message in Scylla's wait handler and a synchronization point
in the test to ensure that the wait request lands the server before
finalization. This follows the same pattern used in
`test_tablet_tasks.py::check_and_abort_repair_task`.

Fixes SCYLLADB-2077

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#29973
2026-05-26 10:43:22 +03:00
Botond Dénes
db89f3f095 Merge 'compaction_manager: unregister compaction module on early shutdown' from Patryk Jędrzejczak
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.

Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.

Fixes: SCYLLADB-2106

Backport to all supported branches, as the bug is there and it has
already caused a failure in 2026.1 CI.

Closes scylladb/scylladb#30015

* github.com:scylladb/scylladb:
  test: add test_stop_before_starting_compaction_manager
  compaction_manager: unregister compaction module on early shutdown
2026-05-25 16:08:20 +03:00
Dmitry Kropachev
06eeaf48ff tests: avoid CQL_ALTERNATOR_QUERIED on zero-token nodes
The keyspace RF test starts zero-token nodes as part of its topology setup.

The python driver 3.29.9 can't schedule queries on zero-token nodes, so waiting for `CQL_ALTERNATOR_QUERIED` on those nodes is the wrong readiness gate.
This change makes the zero-token `server_add()` calls stop at `CQL_ALTERNATOR_CONNECTED`.
The test still exercises the keyspace replication assertions through a normal token-owning contact point.

Verified with running all 4 variations of `cluster.test_keyspace_rf::test_create_keyspace_with_default_replication_factor` on this branch.

Closes scylladb/scylladb#29779
2026-05-25 14:22:04 +03:00
Piotr Dulikowski
3a5dd2e5be Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.

By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.

Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.

The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request

Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.

Fixes: SCYLLADB-1157

Closes scylladb/scylladb#29575

* github.com:scylladb/scylladb:
  strong_consistency: test read forwarding to leader
  strong_consistency: skip read_barrier() for non-linearizable reads
  strong_consistency: split coordinator-level read latency metrics
  strong_consistency: forward linearizable reads to raft leader
  strong_consistency: classify reads by consistency level
  strong_consistency: add begin_read() to raft_server
2026-05-25 10:55:00 +02:00
Michael Litvak
73470150a0 logstor: disable logstor compaction in table truncate
in database::truncate_table_on_all_shards disable logstor compaction
before the table data is truncated, similarly to how non-logstor
compaction is disabled, to avoid race conditions between logstor
compaction and segments discarding.

Fixes SCYLLADB-2186
2026-05-24 10:25:08 +02:00
Wojciech Mitros
45f5df14e5 strong_consistency: test read forwarding to leader
Test the linearizable read forwarding behavior in a single test that
exercises all scenarios on one cluster:
- CL=QUORUM reads on leader, follower, and non-replica nodes
- CL=ONE reads (non-linearizable, no forwarding)
- Linearizability: write + CL=QUORUM read from follower (10 iterations)
- Coordinator latency histogram metrics for both read types

Refs: SCYLLADB-1157
2026-05-23 11:35:37 +02:00
Wojciech Mitros
d07692a7ff strong_consistency: split coordinator-level read latency metrics
Split the latency metrics for strongly consistent reads into two
categories: linearizable and non-linearizable. They replace the
existing metrics for both types combined - this shouldn't cause
issues because the feature is still experimental and both the
initial introduction of latency metrics and the split will be
a part of the same release.

Also fix a test that was using the old metric.
2026-05-23 11:35:37 +02:00
Łukasz Paszkowski
96a992002c tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
   topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
   registered on the CV. The loop kept spinning, holding the HTTP
   server's task gate open and preventing http_server::stop() from
   completing. After ~15 minutes, systemd killed the process with
   SIGABRT.

The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532

Closes scylladb/scylladb#29485
2026-05-22 16:47:48 +03:00
Raphael S. Carvalho
3ba6184462 repair, test: fix split-repair synchronization test timeout in debug mode
The test_split_and_incremental_repair_synchronization[True] test was
timing out waiting for 'Finalizing resize decision for table' in
debug mode.

The root cause is a timing race: the incremental_repair_prepare_wait
error injection has a hardcoded 60s auto-expiry timeout
(wait_for_message(60s)), but split compactions in debug mode take ~58s
per SSTable due to -O0 compilation and scheduler starvation (the
maintenance_compaction group gets ~10% of wall-clock time). When the
injection auto-expires before split finalization, the repair fails,
leaving tablets stuck in transition=repair state. This prevents the
topology coordinator from finalizing the split, causing the 600s test
timeout.

Fix both contributing factors:

- Increase the injection timeout from 60s to 10min, giving split
  compactions ample time to complete before the injection auto-expires.
  The test explicitly messages the injection to release it (line 2200),
  so the longer timeout is just a safety net.

- Reduce data volume from 256 to 64 rows (and repair data from 256 to
  64 rows), producing smaller SSTables that split much faster in debug
  mode.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2123.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#30004
2026-05-22 15:03:47 +03:00
Patryk Jędrzejczak
082936ce43 Merge 'test: pylib: Convict the node on server_stop()' from Tomasz Grabiec
This is about ungraceful stop, where the node is killed.

Test cases typically need to wait for other nodes to notice that the
node is down before proceeding. By default, that takes about 20s. Can
be reduced via config by reducing failure detector threshold, but it's
not the best solution:

 - cannot set the threshold too low, or we'll introduce falkiness due to false
   positives
 - so it's still slow (a couple of seconds)
 - developers forget about it and the test still works

This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter.

At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps:

- the series starts with defaulting to convict=False
- each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable
- finally, the "convict" parameter is made mandatory

There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost.

Tested on dev-mode
cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain:
Wall clock time reduced from 41s to 16s

No backport: enhancement

Closes scylladb/scylladb#28495

* https://github.com/scylladb/scylladb:
  test: gossiper: Add test for natural failure detection
  test: pylib: Make convict a required parameter in server_stop()
  test: Annotate server_stop() calls where conviction is harmful
  test: Annotate server_stop() calls where conviction is beneficial
  test: Annotate server_stop() calls where conviction is useless
  test: pylib: Add convict option to server_stop()
  api: failure_detector: Introduce convict-node API
  gms: gossiper: Make convict() public and safe to call from any scheduling group
  api: Extract validate functions to common header
2026-05-22 13:39:50 +02:00
Patryk Jędrzejczak
b7400d20dd test: add test_stop_before_starting_compaction_manager 2026-05-22 11:58:37 +02:00
Marcin Maliszkiewicz
18dd281e72 Merge 'test: audit: pin empty-keyspace DDL audit behavior' from Andrzej Jackowski
9646ee05bd changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace.

No backport, just a missing test case.

Closes scylladb/scylladb#29542

* github.com:scylladb/scylladb:
  test: audit: pin empty-keyspace DDL audit behavior
  test: audit: restart server when any non-live config key changes
  test: audit: rename 'needed' to 'target_config' for clarity
2026-05-22 09:42:34 +02:00
Tomasz Grabiec
445a8b9a3e test: gossiper: Add test for natural failure detection
Add test_natural_failure_detection which verifies that the failure
detector detects a killed node as DOWN without using the convict
mechanism. Uses the failure_detector_timeout fixture to keep the
FD timeout short (2s in release mode).

This ensures that natural failure detection continues to work
correctly even as other tests adopt the convict mechanism for speed.
2026-05-21 21:33:24 +02:00
Tomasz Grabiec
9b40cf89fe test: Annotate server_stop() calls where conviction is harmful
Add explicit convict=False to server_stop() calls where convicting
the node would break or weaken the test.

In test_backoff_when_node_fails_task_rpc, the desired behavior is for
the node to not be marked as down immediately:

    # The purpose of this is to simulate a situation when the gossiper
    # doesn't mark a dead node as such immediately.

In raft tests, conviction could trigger voter reassignment while the
test wants to test the scenario with voters being still down.

In test_tablet_mv_replica_pairing_during_replace, conviction triggers
SCYLLADB-1996 (replace fails with "Failed to add server").
2026-05-21 21:33:19 +02:00
Tomasz Grabiec
92416d850a test: Annotate server_stop() calls where conviction is beneficial
Add explicit convict=True to server_stop() calls where the test
needs other nodes to detect the stopped node as DOWN in order to
proceed. These are cases before remove_node, replace, or explicit
waits for failure detection (server_not_sees_other_server,
wait_new_coordinator_elected).

Convicting immediately speeds up the test.
2026-05-21 21:31:22 +02:00
Tomasz Grabiec
624fe11178 test: Annotate server_stop() calls where conviction is useless
Pass convict=False explicitly to server_stop() calls where conviction
provides no benefit because there is no consumer of the failure
detection:

 - single-node clusters (no other node to call the API on)
 - all nodes being stopped concurrently (no live node remains)
 - immediate restart (no test logic between stop and start
   depends on other nodes detecting the stopped node as dead)
 - node stopped for file manipulation or bootstrap abort
 - majority killed with no quorum on surviving nodes to react
 - no test logic depends on other nodes detecting the failure

This is a no-op change since the default is already convict=False,
but makes the intent explicit for each call site.
2026-05-21 21:13:55 +02:00
Patryk Jędrzejczak
1ed3f5c4af Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).

Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.

The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.

Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.

Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.

Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked

Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665

backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1

Closes scylladb/scylladb#29882

* https://github.com/scylladb/scylladb:
  storage_service: cancel write handlers during drain to prevent shutdown deadlock
  test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
  test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
  test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
  test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
  test: scylla_cluster: allow stop() to bypass start_stop_lock
  error_injection: add non-shared mode to wait_for_message
  error_injection: release waiters when injection is disabled
2026-05-21 15:43:36 +02:00
Piotr Dulikowski
6148316f66 Merge 'db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak
There is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work.

To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV and is checked before awaiting on the CV.

Fixes SCYLLADB-2029

The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator.

Closes scylladb/scylladb#27313

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer
  db/view/view_building_coordinator: add flag to mark if any remote work was finished
2026-05-21 15:11:58 +02:00
Wojciech Mitros
13c043903d strong_consistency: cache leader location for non-replica nodes
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.

Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.

The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.

Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.

On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.

An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.

We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.

Fixes: SCYLLADB-1064

Closes scylladb/scylladb#29392
2026-05-21 10:32:56 +02:00
Gleb Natapov
cc034f84c5 schema: ensure committed_by_group0 is set for all non-system tables on boot
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.

Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):

1. Queries system_schema.scylla_tables for rows where committed_by_group0
   is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
   cell (from the in-memory schema). The committed_by_group0 = true flag
   is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.

On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.

This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.

Closes scylladb/scylladb#29911
2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak
cbadc3d675 test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.

Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted

Fixes: SCYLLADB-2120

Closes scylladb/scylladb#29967
2026-05-21 10:50:00 +03:00
Artsiom Mishuta
2259307c2e test.py: remove redundant pytest.mark.asyncio decorators
Fixes: SCYLLADB-1935
2026-05-21 10:36:47 +03:00
Andrzej Jackowski
d2bb72438e test: audit: pin empty-keyspace DDL audit behavior
9646ee05bd changed behavior of empty keyspace handling
and this code path was never tested for CQL audit.
Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE
targeting both an existing keyspace and a nonexistent
one to verify both are audited with empty keyspace.

Before 9646ee05bd, an empty keyspace in audit_info
would be checked against audit_keyspaces like any other
value, silently skipping the statement when "" did not
match any configured keyspace. That commit introduced a
will_log() helper that treats an empty keyspace as
unfilterable, so these DDL statements are now always
logged when their category matches.

Refs SCYLLADB-1641
2026-05-21 08:49:44 +02:00
Andrzej Jackowski
2c15277d02 test: audit: restart server when any non-live config key changes
_check_restart_needed only compared NON_LIVE_AUDIT_KEYS against the
running server config, so extra keys like enable_user_defined_functions
were silently ignored and never applied. Generalize the check to
restart whenever any key outside LIVE_AUDIT_KEYS differs.
2026-05-21 08:49:44 +02:00
Botond Dénes
f8ac8540bd Merge 'logstor: compare records by timestamp and segment sequence number' from Michael Litvak
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.
When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.

Add a segment sequence number that is a global (per-shard) increasing
number that is allocated when getting a new segment for write, and is
written in buffer headers in the segment.
It is used to distinguish between buffers written to different generations
of a segment, and for recovery to break ties by keeping the record
from the newest segment.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-770

no backport - logstor is a new feature

Closes scylladb/scylladb#29933

* github.com:scylladb/scylladb:
  test: logstor: add basic delete test
  logstor: rewrite segment seq num from streaming
  logstor: add segment sequence number
  logstor: get_segment helper
  logstor: compare records by timestamp
2026-05-21 08:44:18 +03:00
Andrzej Jackowski
29b7bef15d test: audit: rename 'needed' to 'target_config' for clarity 2026-05-21 07:41:51 +02:00
Petr Gusev
2927f0dd21 storage_service: cancel write handlers during drain to prevent shutdown deadlock
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.

If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
  (the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
  deadlock (the barrier_and_drain RPC handler holds the gate open).

Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().

Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.

Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.

Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
2026-05-20 22:21:45 +02:00
Petr Gusev
5bc3e84d1e test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
The existing test only covers the case where the shutting-down node is
NOT the topology coordinator (deadlocks in uninit_messaging_service).
When the node IS the coordinator, the deadlock manifests differently:
the topology coordinator fiber calls barrier_and_drain on itself
(without messaging), and do_drain -> wait_for_group0_stop blocks
because the coordinator can't stop while stale_versions_in_use is
waiting on the uncancelled write handler.

Run the test twice on the same 2-node cluster (RF=2):
- Run 1: target is a non-coordinator
- Restore cluster state (restart target, decommission added node)
- Run 2: target is the topology coordinator

Use CL=ONE so the write completes from the local replica even with
the other server's response paused.

Mark as xfail since this reproduces bugs not yet fixed on this branch.

Refs: SCYLLADB-1842
2026-05-20 17:22:23 +02:00
Petr Gusev
a093be9ca9 test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
The test was written for another case, and was not supposed to
reproduce the issue that was fixed in this PR.

Fix the test to reproduce the real scenario:

1. Use one_shot=False for pause_before_barrier_and_drain so the
   injection fires on every barrier_and_drain RPC, not just the first.

2. Let the first barrier_and_drain through (at this point the write
   handler's ERM version matches the current token_metadata version).

3. Wait for the second barrier_and_drain. Between the two calls,
   topology_state_load installs a new token_metadata version. The
   write handler still holds the old version's ERM — now stale.

4. After stop_transport completes, disable the injection (rather than
   sending a single message) to release the paused handler and any
   subsequent ones that arrived during stop_transport. The 'disabled'
   flag in injection_shared_data ensures all waiters wake up.

With these changes the test reliably fails (shutdown deadlock within
15s) on the unfixed code and passes on the fixed version from
e0dc73f52a ('Cancel all write requests on storage_proxy shutdown').

Refs: scylladb/scylladb#23665
2026-05-20 17:22:23 +02:00
Petr Gusev
32002f6443 test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
asyncio cancel() only affects the client-side coroutine. The
server-side addserver handler in the cluster manager continues
running. If it can't complete (e.g. no raft quorum because the
target node is shut down), the orphaned handler blocks _after_test
cleanup for 120s.

Await the task instead so it completes cleanly (we restart the
target node first to restore quorum).
2026-05-20 17:22:16 +02:00
Petr Gusev
fa01f74ae6 test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
Add a 15s timeout around the shutdown_task await. If the timeout
fires, the deadlock is reproduced (shutdown hung because
stale_versions_in_use blocks on a write handler holding a stale
token_metadata version).

When the timeout fires, explicitly kill the node via
server_stop() so that the manager's _after_test handler does not
wait 120s for the stuck stop_gracefully request. Then fail the
test with a clear message.
2026-05-20 17:21:56 +02:00
Michał Jadwiszczak
eac9449967 test/test_mv_building: ensure nodes see each other after restart
In SCYLLADB-2058 we observed a timeout exception while querying the base
table after restarting nodes 2 and 3.
Unfortunately, logs don't give us much useful information about the
root cause.

This patch adds basic checks that nodes see each other after the restart
and that the cql connection sees restarted node.
It doesn't guarantee that the error won't occur again - in logs from
SCYLLADB-2058 we see that each node sees other via gossip after part of
the cluster is restarted.

In case the error will occur again, this commit also increases logging
level of `cql_server` and `storage_proxy`.

Refs SCYLLADB-2058

Closes scylladb/scylladb#29951
2026-05-20 14:11:41 +02:00
Marcin Maliszkiewicz
83823149e9 Merge 'audit: implement audit_rules config' from Andrzej Jackowski
This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches.

To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node.

**Performance**
Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads:

```
===============================================================================================================================================================================
Configuration                                     | Binary     |         throughput (tps) | insns/op                 | cpu_cycles/op            | alloc/op | logal/op | task/op
===============================================================================================================================================================================
audit=none [1]                                    | baseline   |                 206922.4 |                  36591.6 |                  15348.3 |     58.1 |      0.0 |    14.1
audit=none [1]                                    | this PR    |        207856.4  (+0.5%) |         36544.9  (-0.1%) |         15274.0  (-0.5%) |     58.1 |      0.0 |    14.1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks [2]                     | baseline   |                  94871.8 |                  54163.0 |                  27172.4 |     72.0 |      0.0 |    24.0
audit=syslog keyspaces=ks [2]                     | this PR    |         96138.4  (+1.3%) |         54072.3  (-0.2%) |         26699.3  (-1.7%) |     72.0 |      0.0 |    24.0
audit=syslog audit-rules=ks [3]                   | this PR    |         95142.1  (+0.3%) |         54457.8  (+0.5%) |         26953.8  (-0.8%) |     72.0 |      0.0 |    24.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks-non-existent [4]        | baseline   |                 213997.8 |                  36735.6 |                  14848.1 |     58.1 |      0.0 |    14.1
audit=syslog keyspaces=ks-non-existent [4]        | this PR    |        219297.2  (+2.5%) |         36667.3  (-0.2%) |         14500.1  (-2.3%) |     58.1 |      0.0 |    14.1
audit=syslog audit-rules=ks-non-existent [5]      | this PR    |        211038.7  (-1.4%) |         36999.7  (+0.7%) |         15048.6  (+1.4%) |     58.1 |      0.0 |    14.1
===============================================================================================================================================================================

[1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none"
[2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
[4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"

audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null`
```

Fixes: SCYLLADB-1430
No backport: new feature

Closes scylladb/scylladb#29267

* github.com:scylladb/scylladb:
  test: alternator: audit: rules filtering and batch bypass
  test: perf: add --audit-rules option to perf-simple-query
  docs: add audit rules section to the auditing guide
  test: audit: cover role and schema cache notifications
  test: audit: cover audit rules cluster behavior
  audit: rebuild rule caches on group0 snapshot and role changes
  audit: refresh rule caches on schema, role, and config changes
  audit: route matching rules to configured sinks
  test: cover preprocessed audit rule cache
  audit: add preprocessed rule matching cache
  audit: pass sink targets to storage helpers
  test: audit: cover rule matching semantics
  audit: add rule matching and sink helpers
  test: audit: cover audit_rules configuration
  config: add live audit_rules option
  test: cover audit rule parsing and validation
  audit: define audit_rule type with parsing and validation
2026-05-20 14:10:45 +02:00
Gleb Natapov
c2cc7ebf39 test: fix test_cas_semaphore flakiness due to paxos state table creation timeout
The test was starting Scylla with --write-request-timeout-in-ms=500 on the
command line. This tight timeout also applied to paxos state table creation,
which goes through raft and can take longer than 500ms on slow platforms
(e.g. aarch64/dev). When the first batch of CAS requests triggered paxos
state table creation under error injection, the raft schema change could
still be in-flight when the second batch fired, causing spurious WriteTimeout
failures unrelated to the semaphore bug being tested.

Fix by changing the write timeout at runtime via the REST API: lower it to
500ms only for the error-injection CAS phase (after table creation is done),
then restore it to 10000ms before the second batch that must succeed.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104

Closes scylladb/scylladb#29969
2026-05-20 13:06:17 +02:00
Andrzej Jackowski
f03398fdba test: audit: cover role and schema cache notifications
Verify on a multi-node cluster that role creation/alter/
drop and table/materialized-view create/drop trigger
updates to the preprocessed audit-rules cache on every
node, and that a matching DML on the newly created table
is audited via the cache.

Refs SCYLLADB-1430
2026-05-20 06:55:15 +02:00
Andrzej Jackowski
7f61d7662d test: audit: cover audit rules cluster behavior
Cluster-level tests should validate rule matching, live
updates, sink routing, role filtering, and error handling
without rerunning the broader audit suite.

Add audit_rules to LIVE_AUDIT_KEYS so the test framework
tracks it as a live-updatable config key. Test that rules
with empty categories or roles match nothing, that DML
rules coexist with legacy audit config, AUTH rules fire
on login events, CQL and REST API update paths reject
invalid JSON, per-rule sink routing works for table and
syslog, role-based filtering works across sessions, and
sink mismatch produces a warning in server logs.

Refs SCYLLADB-1430
2026-05-20 06:55:15 +02:00
Tomasz Grabiec
f0dea67a87 Merge 'transport: add per-service-level transport request latency histogram' from Piotr Smaron and Marcin Maliszkiewicz
Add a per-scheduling-group latency histogram on the transport level that measures the full CQL request lifetime: from fetching the request buffer until the response is written to the socket.

Today latencies are accounted only on the storage proxy level, leaving the time spent in the transport layer (response queue wait + actual I/O) unaccounted. Having both transport and storage proxy latencies allows operators to tell where latency accumulates.

The metric is exposed as scylla_transport_cql_request_latency_histogram with the scheduling_group_name label, following the cql_ prefix convention of all other per-SG transport metrics.

Fixes: SCYLLADB-1691

New feature, no backport.

Closes scylladb/scylladb#29878

* github.com:scylladb/scylladb:
  test/cluster: add test for per-service-level transport request latency histogram
  transport: add per-service-level transport request latency histogram
2026-05-20 01:12:14 +02:00
Michael Litvak
eecbead541 test: wait for others_not_see_server before exclude
Between stopping a server and excluding it, wait for other nodes to see
the server as down, otherwise exclude may see the server as alive and
fail.

Fixes SCYLLADB-2110

Closes scylladb/scylladb#29966
2026-05-19 19:36:54 +02:00
Piotr Smaron
810ed6eedc test/cluster: add test for per-service-level transport request latency histogram
Verify that the new scylla_transport_cql_request_latency_histogram metric
correctly records transport-level request latencies per service level.

Uses error injection to pause a request mid-flight and verifies that the
histogram is not updated while the request is paused (since the response
has not been written yet), and is updated after the request completes.

Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>
2026-05-19 16:07:33 +02:00
Benny Halevy
97e03762c5 test/cluster/test_keyspace_rf: extend test_create_keyspace_with_default_replication_factor for tablets rack lists
Add more racks to dc2 to verify that the default replication factor
covers all available racks (rather than e.g. limited to 3).

With tablets and rf_rack_valid_keyspaces, verify also the automatically
selected rack list.

Restrict the extension to non-debug build modes to prevent running out
of memory with --repeat=100.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29931
2026-05-19 10:44:24 +03:00
Andrei Chekun
6414c48fc2 test.py: rewrite resource gather
Python tests requires different handling of metrics gathering from
cgroup than C++ tests. pytest do not execute each python tests in
a separate process, so we can't put it there and get the metrics.
The idea is to put the whole pytest process to the cgroup and get the
metrics. This will work because pytest runs the threads as as completely
separate processes and inside the thread it will run tests consequently.
Additionally, to simplify system resource monitor moved to pytest main
thread.
2026-05-18 12:23:40 +02:00
Michał Jadwiszczak
c767ac7ef3 test/cluster/test_view_building_coordinator: add reproducer
Add test which reproduces scylladb/scylladb#27298
2026-05-18 09:18:56 +02:00
Evgeniy Naydanov
39a10d6d67 test: remove dead suite subclasses and legacy execution pipeline
After all test suites migrated to test_config.yaml with type: Python,
the specialized suite classes (Topology, CQLApproval, Run, Tool) and
the legacy execution pipeline (find_tests, run_test, TestSuite.run,
Test.run) became unreachable. Remove all this dead code.

Deleted files:
- suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py

Simplified:
- base.py: remove run_test(), read_log(), TestSuite.run(),
  add_test_list(), build_test_list(), all_tests(), test_count(),
  SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead
  Test attributes (args, core_args, valid_exit_codes, allure_dir,
  is_flaky, is_cancelled, etc.)
- python.py: remove PythonTestSuite.run(), PythonTest.run(),
  _prepare_pytest_params(), pattern, test_file_ext, xmlout,
  server_log, scylla_env setup, and shlex import.
  Simplify run_ctx() to take no parameters.
- runner.py: remove --scylla-log-filename option,
  print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import,
  and suite.yaml probe in TestSuiteConfig.from_pytest_node().
- __init__.py: remove re-exports of deleted classes.
- test_config.yaml: Topology -> Python, Approval -> Python.
- conftest files: run_ctx(options=...) -> run_ctx().
- docs/dev/testing.md: update to reflect current pytest-based
  architecture, log paths, and removed features.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>

Closes scylladb/scylladb#29613
2026-05-17 22:16:31 +03:00
Michael Litvak
c3ab341234 test: logstor: add basic delete test
extend the test with basic DELETE queries and verify the results are as
expected.
2026-05-17 17:22:43 +02:00
Michael Litvak
c18d616f64 logstor: compare records by timestamp
Add the record timestamp. The timestamp is extracted from the row marker
of the mutation when we write it.

When inserting a record to index, we compare it with the existing
record, and insert it only if it has newer timestamp.
2026-05-17 17:22:26 +02:00
Andrzej Jackowski
61e5ec9888 test: storage: retry fusermount3 unmount on teardown
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.

Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.

Fixes: SCYLLADB-2049

Closes scylladb/scylladb#29920
2026-05-16 19:36:48 +03:00
Piotr Dulikowski
460cb1656e Merge 'test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Dario Mirovic
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory

Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.

This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements

Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645

Fixes SCYLLADB-1645

CI stability improvement. Backport to versions that have this test.

Closes scylladb/scylladb#29759

* github.com:scylladb/scylladb:
  test: prepare max cells inserts
  test: reuse max cells schema
  test: limits: skip empty max cells iterations
2026-05-15 18:12:48 +02:00