scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-30 19:46:48 +00:00

Author	SHA1	Message	Date
Piotr Dulikowski	8dfd455001	Merge 'strong consistency: fix drop table blocking on stuck writes and handle timeout in update()' from Petr Gusev - Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum - Handle unhandled timeout exception in the wait-for-leader loop during group startup When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion. Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state. SCYLLADB-2080 fix: - Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released). - Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling. - Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between). Timeout handling fix: - Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled. Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that: 1. Pauses a write on the leader before add_entry 2. Drops the table (follower destroys its group immediately) 3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds backport: no need, strong consistency is not released yet Fixes: SCYLLADB-2080 Closes scylladb/scylladb#30105 * github.com:scylladb/scylladb: strong consistency/groups_manager: handle timeout in update() wait-for-leader loop strong consistency: abort raft server before gate close when dropping a table test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080	2026-05-28 09:59:20 +02:00
Petr Gusev	d922c43358	strong consistency: abort raft server before gate close when dropping a table When a strongly consistent table is dropped, schedule_raft_group_deletion() used to call g->close() first, which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires, unnecessarily delaying group deletion. Fix: initiate gate close (prevents new operations from entering), then abort the raft server (causes in-flight add_entry/read_barrier to throw raft::stopped_error, releasing their gate holders), then await the gate future (resolves immediately since holders are now released). Handle raft::stopped_error in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return no_such_column_family (which the CQL layer converts to InvalidRequest 'unconfigured table'). Otherwise fall through to the default timeout handling. Also replace gate->hold() with try_hold() + on_internal_error in acquire_server, and handle the timeout exception in the wait-for-leader loop in update() gracefully (log + break instead of propagating). Fixes: SCYLLADB-2080	2026-05-27 12:06:46 +02:00
Petr Gusev	89307064b5	test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080 Rewrite the test to use 2 nodes (RF=2) instead of 1 (RF=1), which exposes the quorum-loss scenario: when a table is dropped, the follower destroys its raft group immediately while the leader's in-flight operations are still holding the gate. The test pauses both a read and a write on the leader, drops the table, then resumes them. Both are expected to fail with 'no such column family' since the raft server is aborted as part of group deletion. A 15-second timeout guard detects the old buggy behavior (write stuck forever). Marked xfail until the fix is applied in the next commit.	2026-05-27 12:06:46 +02:00
Nikos Dragazis	54cb6d4608	test: Order task-wait before finalization in test_migration_wait_task The purpose of this test is to verify that the task manager's "wait" API works correctly for vnodes-to-tablets migration virtual tasks. It starts a `wait_task` HTTP request concurrently with a finalize (or rollback) operation, and asserts that the wait returns the correct final state ("done" or "suspended"). The test `uses asyncio.create_task()` to wrap the wait request into a task, and then immediately calls finalize. With asyncio's lazy task scheduling, the wait coroutine does not start until the event loop yields, so the finalization request reaches the server before wait, and therefore may also complete before it. Once finalization completes, the virtual migration task is no longer discoverable, causing a "task not found" error. Add a log message in Scylla's wait handler and a synchronization point in the test to ensure that the wait request lands the server before finalization. This follows the same pattern used in `test_tablet_tasks.py::check_and_abort_repair_task`. Fixes SCYLLADB-2077 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#29973	2026-05-26 10:43:22 +03:00
Botond Dénes	db89f3f095	Merge 'compaction_manager: unregister compaction module on early shutdown' from Patryk Jędrzejczak The compaction module is registered with task_manager in the compaction_manager constructor, and unregistered in compaction_manager::really_do_stop(), which was gated behind `_state != state::none` in compaction_manager::do_stop(). Since enable() -- which transitions _state from none to running -- is called later during startup (from database::start() or the disk space monitor callback) than the compaction_manager constructor, an early shutdown could leave the compaction module registered after compaction_manager::do_stop() returned. task_manager::stop() then aborted with 'Tried to stop task manager while some modules were not unregistered'. Fix compaction_manager::do_stop() to call _task_manager_module->stop() even when `_state == state::none`, so that the compaction module is always properly unregistered. Fixes: SCYLLADB-2106 Backport to all supported branches, as the bug is there and it has already caused a failure in 2026.1 CI. Closes scylladb/scylladb#30015 * github.com:scylladb/scylladb: test: add test_stop_before_starting_compaction_manager compaction_manager: unregister compaction module on early shutdown	2026-05-25 16:08:20 +03:00
Dmitry Kropachev	06eeaf48ff	tests: avoid CQL_ALTERNATOR_QUERIED on zero-token nodes The keyspace RF test starts zero-token nodes as part of its topology setup. The python driver 3.29.9 can't schedule queries on zero-token nodes, so waiting for `CQL_ALTERNATOR_QUERIED` on those nodes is the wrong readiness gate. This change makes the zero-token `server_add()` calls stop at `CQL_ALTERNATOR_CONNECTED`. The test still exercises the keyspace replication assertions through a normal token-owning contact point. Verified with running all 4 variations of `cluster.test_keyspace_rf::test_create_keyspace_with_default_replication_factor` on this branch. Closes scylladb/scylladb#29779	2026-05-25 14:22:04 +03:00
Piotr Dulikowski	3a5dd2e5be	Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros Strongly consistent reads currently call read_barrier() on whichever replica happens to process the request. When a follower runs read_barrier(), it sends an RPC to the leader to get the current read index, then waits for its local apply index to catch up. If the follower is behind, this wait can be significant. By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally. Note that read_barrier() is still required on the leader to confirm it is still the leader and guarantee linearizability. A future optimization would be to implement leases in the raft library, which could eliminate read_barrier() on the leader entirely. The CL-to-behavior mapping is isolated in a single parse_consistency_level() function: - CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader - CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results) - All other CLs -> invalid request Read forwarding reuses the same CQL-layer bounce_to_node() mechanism that write forwarding already uses. The transport layer's existing requests_forwarded_* metrics automatically count forwarded reads. Coordinator-level metrics (linearizable_reads, non_linearizable_reads, writes) are added for visibility into the strong consistency workload. Fixes: SCYLLADB-1157 Closes scylladb/scylladb#29575 * github.com:scylladb/scylladb: strong_consistency: test read forwarding to leader strong_consistency: skip read_barrier() for non-linearizable reads strong_consistency: split coordinator-level read latency metrics strong_consistency: forward linearizable reads to raft leader strong_consistency: classify reads by consistency level strong_consistency: add begin_read() to raft_server	2026-05-25 10:55:00 +02:00
Michael Litvak	73470150a0	logstor: disable logstor compaction in table truncate in database::truncate_table_on_all_shards disable logstor compaction before the table data is truncated, similarly to how non-logstor compaction is disabled, to avoid race conditions between logstor compaction and segments discarding. Fixes SCYLLADB-2186	2026-05-24 10:25:08 +02:00
Wojciech Mitros	45f5df14e5	strong_consistency: test read forwarding to leader Test the linearizable read forwarding behavior in a single test that exercises all scenarios on one cluster: - CL=QUORUM reads on leader, follower, and non-replica nodes - CL=ONE reads (non-linearizable, no forwarding) - Linearizability: write + CL=QUORUM read from follower (10 iterations) - Coordinator latency histogram metrics for both read types Refs: SCYLLADB-1157	2026-05-23 11:35:37 +02:00
Wojciech Mitros	d07692a7ff	strong_consistency: split coordinator-level read latency metrics Split the latency metrics for strongly consistent reads into two categories: linearizable and non-linearizable. They replace the existing metrics for both types combined - this shouldn't cause issues because the feature is still experimental and both the initial introduction of latency metrics and the split will be a part of the same release. Also fix a test that was using the old metric.	2026-05-23 11:35:37 +02:00
Łukasz Paszkowski	96a992002c	tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks The condition variable predicate for repair tasks unconditionally returned true (introduced in `e5928497ce`), which meant event.wait(pred) never actually suspended: do_until checks the predicate first, and if it's already satisfied, returns immediately without calling the inner wait(). This caused two problems: 1. The while(true) loop busy-spun, polling without blocking between topology changes. 2. During shutdown, event.broken() had no effect because no waiter was registered on the CV. The loop kept spinning, holding the HTTP server's task gate open and preventing http_server::stop() from completing. After ~15 minutes, systemd killed the process with SIGABRT. The fix replaces the synchronous predicate with an async task_finished() helper that dispatches on the task type. Since the repair check is async (for_each_tablet scans every tablet), we cannot use event.wait(Pred). Instead, we register a waiter via event.wait() before running the async check, ensuring no broadcast is missed during the check. event.broken() during shutdown propagates broken_condition_variable to the registered waiter and unblocks the loop promptly. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532 Closes scylladb/scylladb#29485	2026-05-22 16:47:48 +03:00
Raphael S. Carvalho	3ba6184462	repair, test: fix split-repair synchronization test timeout in debug mode The test_split_and_incremental_repair_synchronization[True] test was timing out waiting for 'Finalizing resize decision for table' in debug mode. The root cause is a timing race: the incremental_repair_prepare_wait error injection has a hardcoded 60s auto-expiry timeout (wait_for_message(60s)), but split compactions in debug mode take ~58s per SSTable due to -O0 compilation and scheduler starvation (the maintenance_compaction group gets ~10% of wall-clock time). When the injection auto-expires before split finalization, the repair fails, leaving tablets stuck in transition=repair state. This prevents the topology coordinator from finalizing the split, causing the 600s test timeout. Fix both contributing factors: - Increase the injection timeout from 60s to 10min, giving split compactions ample time to complete before the injection auto-expires. The test explicitly messages the injection to release it (line 2200), so the longer timeout is just a safety net. - Reduce data volume from 256 to 64 rows (and repair data from 256 to 64 rows), producing smaller SSTables that split much faster in debug mode. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2123. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#30004	2026-05-22 15:03:47 +03:00
Patryk Jędrzejczak	082936ce43	Merge 'test: pylib: Convict the node on server_stop()' from Tomasz Grabiec This is about ungraceful stop, where the node is killed. Test cases typically need to wait for other nodes to notice that the node is down before proceeding. By default, that takes about 20s. Can be reduced via config by reducing failure detector threshold, but it's not the best solution: - cannot set the threshold too low, or we'll introduce falkiness due to false positives - so it's still slow (a couple of seconds) - developers forget about it and the test still works This patch speeds this up by adding a way to convict the node immediately after stopping the node, controlled by the "convict" parameter. At the end of the series the "convict" parameter is required, and each test decides what it wants. Commits are split into steps: - the series starts with defaulting to convict=False - each test case sets "convict" explicitly, and changes are split into 3 commits depending on whether convict=True is: useless, beneficial, undesirable - finally, the "convict" parameter is made mandatory There is also a dedicated test for natural failure detection (test_natural_failure_detection in test_gossiper.py) to ensure FD coverage is not lost. Tested on dev-mode cluster/test_tablets_parallel_decommission.py::test_node_lost_during_decommission_drain: Wall clock time reduced from 41s to 16s No backport: enhancement Closes scylladb/scylladb#28495 * https://github.com/scylladb/scylladb: test: gossiper: Add test for natural failure detection test: pylib: Make convict a required parameter in server_stop() test: Annotate server_stop() calls where conviction is harmful test: Annotate server_stop() calls where conviction is beneficial test: Annotate server_stop() calls where conviction is useless test: pylib: Add convict option to server_stop() api: failure_detector: Introduce convict-node API gms: gossiper: Make convict() public and safe to call from any scheduling group api: Extract validate functions to common header	2026-05-22 13:39:50 +02:00
Patryk Jędrzejczak	b7400d20dd	test: add test_stop_before_starting_compaction_manager	2026-05-22 11:58:37 +02:00
Marcin Maliszkiewicz	18dd281e72	Merge 'test: audit: pin empty-keyspace DDL audit behavior' from Andrzej Jackowski `9646ee05bd` changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace. No backport, just a missing test case. Closes scylladb/scylladb#29542 * github.com:scylladb/scylladb: test: audit: pin empty-keyspace DDL audit behavior test: audit: restart server when any non-live config key changes test: audit: rename 'needed' to 'target_config' for clarity	2026-05-22 09:42:34 +02:00
Tomasz Grabiec	445a8b9a3e	test: gossiper: Add test for natural failure detection Add test_natural_failure_detection which verifies that the failure detector detects a killed node as DOWN without using the convict mechanism. Uses the failure_detector_timeout fixture to keep the FD timeout short (2s in release mode). This ensures that natural failure detection continues to work correctly even as other tests adopt the convict mechanism for speed.	2026-05-21 21:33:24 +02:00
Tomasz Grabiec	9b40cf89fe	test: Annotate server_stop() calls where conviction is harmful Add explicit convict=False to server_stop() calls where convicting the node would break or weaken the test. In test_backoff_when_node_fails_task_rpc, the desired behavior is for the node to not be marked as down immediately: # The purpose of this is to simulate a situation when the gossiper # doesn't mark a dead node as such immediately. In raft tests, conviction could trigger voter reassignment while the test wants to test the scenario with voters being still down. In test_tablet_mv_replica_pairing_during_replace, conviction triggers SCYLLADB-1996 (replace fails with "Failed to add server").	2026-05-21 21:33:19 +02:00
Tomasz Grabiec	92416d850a	test: Annotate server_stop() calls where conviction is beneficial Add explicit convict=True to server_stop() calls where the test needs other nodes to detect the stopped node as DOWN in order to proceed. These are cases before remove_node, replace, or explicit waits for failure detection (server_not_sees_other_server, wait_new_coordinator_elected). Convicting immediately speeds up the test.	2026-05-21 21:31:22 +02:00
Tomasz Grabiec	624fe11178	test: Annotate server_stop() calls where conviction is useless Pass convict=False explicitly to server_stop() calls where conviction provides no benefit because there is no consumer of the failure detection: - single-node clusters (no other node to call the API on) - all nodes being stopped concurrently (no live node remains) - immediate restart (no test logic between stop and start depends on other nodes detecting the stopped node as dead) - node stopped for file manipulation or bootstrap abort - majority killed with no quorum on surviving nodes to react - no test logic depends on other nodes detecting the failure This is a no-op change since the default is already convict=False, but makes the intent explicit for each call site.	2026-05-21 21:13:55 +02:00
Patryk Jędrzejczak	1ed3f5c4af	Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped). Two manifestations depending on whether the shutting-down node is the topology coordinator: - Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use(). - Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open. The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all. Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`. Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster. Also includes supporting changes: - error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers - error_injection: add non-shared mode to wait_for_message for per-invocation message semantics - scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked Fixes: SCYLLADB-1842 Refs: scylladb/scylladb#23665 backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1 Closes scylladb/scylladb#29882 * https://github.com/scylladb/scylladb: storage_service: cancel write handlers during drain to prevent shutdown deadlock test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task test: scylla_cluster: allow stop() to bypass start_stop_lock error_injection: add non-shared mode to wait_for_message error_injection: release waiters when injection is disabled	2026-05-21 15:43:36 +02:00
Piotr Dulikowski	6148316f66	Merge 'db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak There is small windows just after view building coordinator releases group0 guard and before it waits on view_building_state_machine's CV, when the coordinator may miss CV broadcast triggered by finished remote work. To fix it, this patch adds a boolean flag, which is set to true before broadcasting the CV and is checked before awaiting on the CV. Fixes SCYLLADB-2029 The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator. Closes scylladb/scylladb#27313 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer db/view/view_building_coordinator: add flag to mark if any remote work was finished	2026-05-21 15:11:58 +02:00
Wojciech Mitros	13c043903d	strong_consistency: cache leader location for non-replica nodes When a non-replica node handles a strongly consistent write, it must forward the request to a replica. If the closest replica is not the leader, the request gets redirected again, causing an extra roundtrip. Add a leader location cache in groups_manager, keyed by raft group_id. After a write request is forwarded, the CQL transport layer records the final node as the leader in the cache. Subsequent write requests from the same node for the same group are forwarded directly to the cached leader, eliminating the extra roundtrip. The cache is only used for writes. Reads can be served by any replica, so they skip the cache and use proximity-based routing instead. Cache entries are validated at use time: if the cached leader is no longer a replica (e.g. after tablet migration), the entry is evicted and the normal closest-replica path is taken. This prevents a scenario where two nodes keep redirecting to each other because both think that the other is the leader but actually both are non-replicas - such loop is broken as soon as the tablet maps are updated. On token_metadata updates, entries for groups that no longer exist (e.g. table dropped, tablet merged) are evicted. Entries for groups that still exist are kept — use-time validation handles staleness. An on_node_resolved callback is propagated through the redirect/bounce path so the transport layer can update the cache generically without coupling to the strong-consistency coordinator. The coordinator creates the callback only for writes (capturing the groups_manager and group_id) and attaches it to the bounce message; the transport layer invokes it once the final node is known, keeping the forwarding infrastructure subsystem-agnostic. We also add a test which verifies that after the initial redirect, following requests to the same node avoid the extra redirect and forward directly to the leader. Fixes: SCYLLADB-1064 Closes scylladb/scylladb#29392	2026-05-21 10:32:56 +02:00
Gleb Natapov	cc034f84c5	schema: ensure committed_by_group0 is set for all non-system tables on boot Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled have committed_by_group0 = null in system_schema.scylla_tables. This causes maybe_delete_schema_version() to delete their version cell, forcing the legacy hash-based schema version computation path. Add ensure_committed_by_group0() which runs on boot and fixes up any non-system tables where committed_by_group0 is not true (null or false): 1. Queries system_schema.scylla_tables for rows where committed_by_group0 is null or false, skipping system keyspaces (system, system_schema). 2. Takes a group0 guard 3. Re-checks after the raft barrier in case another node already fixed it. 4. For each table needing fixup, creates a mutation writing the version cell (from the in-memory schema). The committed_by_group0 = true flag is stamped by add_committed_by_group0_flag() inside announce(). 5. Announces via raft group0. 6. Retries with a small random delay on group0_concurrent_modification. On other nodes, schema_applier will detect these as "altered" tables (scylla_tables mutation changed), but since the actual table definition is unchanged, update_column_family is effectively a no-op. This is a prerequisite for eventually removing the legacy hash-based schema versioning code path. Closes scylladb/scylladb#29911	2026-05-21 10:22:07 +02:00
Patryk Jędrzejczak	cbadc3d675	test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation Snapshot creation and raft log truncation happen asynchronously in the IO fiber after a schema change completes. The test was querying system.raft immediately after the schema change returned, racing with the IO fiber's store_snapshot_descriptor call. Replace immediate assertions with wait_for polling loops: - log_size == 0: wait for log truncation after drop keyspace - new_snap_id != original_snap_id: wait for new snapshot to be persisted Fixes: SCYLLADB-2120 Closes scylladb/scylladb#29967	2026-05-21 10:50:00 +03:00
Artsiom Mishuta	2259307c2e	test.py: remove redundant pytest.mark.asyncio decorators Fixes: SCYLLADB-1935	2026-05-21 10:36:47 +03:00
Andrzej Jackowski	d2bb72438e	test: audit: pin empty-keyspace DDL audit behavior `9646ee05bd` changed behavior of empty keyspace handling and this code path was never tested for CQL audit. Test CREATE/DROP FUNCTION and CREATE/DROP AGGREGATE targeting both an existing keyspace and a nonexistent one to verify both are audited with empty keyspace. Before `9646ee05bd`, an empty keyspace in audit_info would be checked against audit_keyspaces like any other value, silently skipping the statement when "" did not match any configured keyspace. That commit introduced a will_log() helper that treats an empty keyspace as unfilterable, so these DDL statements are now always logged when their category matches. Refs SCYLLADB-1641	2026-05-21 08:49:44 +02:00
Andrzej Jackowski	2c15277d02	test: audit: restart server when any non-live config key changes _check_restart_needed only compared NON_LIVE_AUDIT_KEYS against the running server config, so extra keys like enable_user_defined_functions were silently ignored and never applied. Generalize the check to restart whenever any key outside LIVE_AUDIT_KEYS differs.	2026-05-21 08:49:44 +02:00
Botond Dénes	f8ac8540bd	Merge 'logstor: compare records by timestamp and segment sequence number' from Michael Litvak Add the record timestamp. The timestamp is extracted from the row marker of the mutation when we write it. When inserting a record to index, we compare it with the existing record, and insert it only if it has newer timestamp. Add a segment sequence number that is a global (per-shard) increasing number that is allocated when getting a new segment for write, and is written in buffer headers in the segment. It is used to distinguish between buffers written to different generations of a segment, and for recovery to break ties by keeping the record from the newest segment. Refs https://scylladb.atlassian.net/browse/SCYLLADB-770 no backport - logstor is a new feature Closes scylladb/scylladb#29933 * github.com:scylladb/scylladb: test: logstor: add basic delete test logstor: rewrite segment seq num from streaming logstor: add segment sequence number logstor: get_segment helper logstor: compare records by timestamp	2026-05-21 08:44:18 +03:00
Andrzej Jackowski	29b7bef15d	test: audit: rename 'needed' to 'target_config' for clarity	2026-05-21 07:41:51 +02:00
Petr Gusev	2927f0dd21	storage_service: cancel write handlers during drain to prevent shutdown deadlock When a node shuts down, do_drain() calls stop_transport() which tears down the messaging service. After this point, MUTATION_DONE responses from replicas can no longer reach the coordinator, so any in-flight write_response_handlers will never complete naturally. These handlers hold ERMs referencing stale token_metadata versions. If the topology coordinator calls barrier_and_drain (either on itself or via RPC), it blocks in stale_versions_in_use() waiting for these stale versions to be released. This causes: - On the coordinator node: do_drain -> wait_for_group0_stop deadlock (the topology coordinator fiber is stuck in barrier_and_drain). - On non-coordinator nodes: ss::stop -> uninit_messaging_service deadlock (the barrier_and_drain RPC handler holds the gate open). Fix: cancel all write response handlers on all shards right after stop_transport() in do_drain(). This releases their ERMs and the associated stale token_metadata versions, unblocking stale_versions_in_use(). Heap-allocate _write_handlers_gate and add an allow_new parameter to cancel_all_write_response_handlers(). When allow_new=true (used by do_drain), the gate is closed and swapped with a fresh one — existing handlers are waited on while new handlers can still be created. This avoids blocking internal writes (paxos learn, compaction history updates) that still need to create handlers during the remainder of the drain sequence. When allow_new=false (used by drain_on_shutdown), the gate is closed permanently — no new handlers can be created after final shutdown. Update test_lwt_shutdown to wait for 'Stop transport: done' instead of 'Shutting down storage proxy RPC verbs'. The latter message is now only logged after do_drain() completes, but do_drain() blocks in cancel_all_write_response_handlers() waiting for the background paxos learn handler — which is exactly what the test needs to release before shutdown can proceed. Fixes: SCYLLADB-1842 Refs: scylladb/scylladb#23665	2026-05-20 22:21:45 +02:00
Petr Gusev	5bc3e84d1e	test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown The existing test only covers the case where the shutting-down node is NOT the topology coordinator (deadlocks in uninit_messaging_service). When the node IS the coordinator, the deadlock manifests differently: the topology coordinator fiber calls barrier_and_drain on itself (without messaging), and do_drain -> wait_for_group0_stop blocks because the coordinator can't stop while stale_versions_in_use is waiting on the uncancelled write handler. Run the test twice on the same 2-node cluster (RF=2): - Run 1: target is a non-coordinator - Restore cluster state (restart target, decommission added node) - Run 2: target is the topology coordinator Use CL=ONE so the write completes from the local replica even with the other server's response paused. Mark as xfail since this reproduces bugs not yet fixed on this branch. Refs: SCYLLADB-1842	2026-05-20 17:22:23 +02:00
Petr Gusev	a093be9ca9	test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock The test was written for another case, and was not supposed to reproduce the issue that was fixed in this PR. Fix the test to reproduce the real scenario: 1. Use one_shot=False for pause_before_barrier_and_drain so the injection fires on every barrier_and_drain RPC, not just the first. 2. Let the first barrier_and_drain through (at this point the write handler's ERM version matches the current token_metadata version). 3. Wait for the second barrier_and_drain. Between the two calls, topology_state_load installs a new token_metadata version. The write handler still holds the old version's ERM — now stale. 4. After stop_transport completes, disable the injection (rather than sending a single message) to release the paused handler and any subsequent ones that arrived during stop_transport. The 'disabled' flag in injection_shared_data ensures all waiters wake up. With these changes the test reliably fails (shutdown deadlock within 15s) on the unfixed code and passes on the fixed version from `e0dc73f52a` ('Cancel all write requests on storage_proxy shutdown'). Refs: scylladb/scylladb#23665	2026-05-20 17:22:23 +02:00
Petr Gusev	32002f6443	test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it asyncio cancel() only affects the client-side coroutine. The server-side addserver handler in the cluster manager continues running. If it can't complete (e.g. no raft quorum because the target node is shut down), the orphaned handler blocks _after_test cleanup for 120s. Await the task instead so it completes cleanly (we restart the target node first to restore quorum).	2026-05-20 17:22:16 +02:00
Petr Gusev	fa01f74ae6	test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task Add a 15s timeout around the shutdown_task await. If the timeout fires, the deadlock is reproduced (shutdown hung because stale_versions_in_use blocks on a write handler holding a stale token_metadata version). When the timeout fires, explicitly kill the node via server_stop() so that the manager's _after_test handler does not wait 120s for the stuck stop_gracefully request. Then fail the test with a clear message.	2026-05-20 17:21:56 +02:00
Michał Jadwiszczak	eac9449967	test/test_mv_building: ensure nodes see each other after restart In SCYLLADB-2058 we observed a timeout exception while querying the base table after restarting nodes 2 and 3. Unfortunately, logs don't give us much useful information about the root cause. This patch adds basic checks that nodes see each other after the restart and that the cql connection sees restarted node. It doesn't guarantee that the error won't occur again - in logs from SCYLLADB-2058 we see that each node sees other via gossip after part of the cluster is restarted. In case the error will occur again, this commit also increases logging level of `cql_server` and `storage_proxy`. Refs SCYLLADB-2058 Closes scylladb/scylladb#29951	2026-05-20 14:11:41 +02:00
Marcin Maliszkiewicz	83823149e9	Merge 'audit: implement audit_rules config' from Andrzej Jackowski This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches. To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node. Performance Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads: ``` =============================================================================================================================================================================== Configuration \| Binary \| throughput (tps) \| insns/op \| cpu_cycles/op \| alloc/op \| logal/op \| task/op =============================================================================================================================================================================== audit=none [1] \| baseline \| 206922.4 \| 36591.6 \| 15348.3 \| 58.1 \| 0.0 \| 14.1 audit=none [1] \| this PR \| 207856.4 (+0.5%) \| 36544.9 (-0.1%) \| 15274.0 (-0.5%) \| 58.1 \| 0.0 \| 14.1 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- audit=syslog keyspaces=ks [2] \| baseline \| 94871.8 \| 54163.0 \| 27172.4 \| 72.0 \| 0.0 \| 24.0 audit=syslog keyspaces=ks [2] \| this PR \| 96138.4 (+1.3%) \| 54072.3 (-0.2%) \| 26699.3 (-1.7%) \| 72.0 \| 0.0 \| 24.0 audit=syslog audit-rules=ks [3] \| this PR \| 95142.1 (+0.3%) \| 54457.8 (+0.5%) \| 26953.8 (-0.8%) \| 72.0 \| 0.0 \| 24.0 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- audit=syslog keyspaces=ks-non-existent [4] \| baseline \| 213997.8 \| 36735.6 \| 14848.1 \| 58.1 \| 0.0 \| 14.1 audit=syslog keyspaces=ks-non-existent [4] \| this PR \| 219297.2 (+2.5%) \| 36667.3 (-0.2%) \| 14500.1 (-2.3%) \| 58.1 \| 0.0 \| 14.1 audit=syslog audit-rules=ks-non-existent [5] \| this PR \| 211038.7 (-1.4%) \| 36999.7 (+0.7%) \| 15048.6 (+1.4%) \| 58.1 \| 0.0 \| 14.1 =============================================================================================================================================================================== [1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none" [2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock" [3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks."],"roles":[""]}]' --audit-unix-socket-path "/tmp/audit-null.sock" [4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock" [5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent."],"roles":[""]}]' --audit-unix-socket-path "/tmp/audit-null.sock" audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null` ``` Fixes: SCYLLADB-1430 No backport: new feature Closes scylladb/scylladb#29267 * github.com:scylladb/scylladb: test: alternator: audit: rules filtering and batch bypass test: perf: add --audit-rules option to perf-simple-query docs: add audit rules section to the auditing guide test: audit: cover role and schema cache notifications test: audit: cover audit rules cluster behavior audit: rebuild rule caches on group0 snapshot and role changes audit: refresh rule caches on schema, role, and config changes audit: route matching rules to configured sinks test: cover preprocessed audit rule cache audit: add preprocessed rule matching cache audit: pass sink targets to storage helpers test: audit: cover rule matching semantics audit: add rule matching and sink helpers test: audit: cover audit_rules configuration config: add live audit_rules option test: cover audit rule parsing and validation audit: define audit_rule type with parsing and validation	2026-05-20 14:10:45 +02:00
Gleb Natapov	c2cc7ebf39	test: fix test_cas_semaphore flakiness due to paxos state table creation timeout The test was starting Scylla with --write-request-timeout-in-ms=500 on the command line. This tight timeout also applied to paxos state table creation, which goes through raft and can take longer than 500ms on slow platforms (e.g. aarch64/dev). When the first batch of CAS requests triggered paxos state table creation under error injection, the raft schema change could still be in-flight when the second batch fired, causing spurious WriteTimeout failures unrelated to the semaphore bug being tested. Fix by changing the write timeout at runtime via the REST API: lower it to 500ms only for the error-injection CAS phase (after table creation is done), then restore it to 10000ms before the second batch that must succeed. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2104 Closes scylladb/scylladb#29969	2026-05-20 13:06:17 +02:00
Andrzej Jackowski	f03398fdba	test: audit: cover role and schema cache notifications Verify on a multi-node cluster that role creation/alter/ drop and table/materialized-view create/drop trigger updates to the preprocessed audit-rules cache on every node, and that a matching DML on the newly created table is audited via the cache. Refs SCYLLADB-1430	2026-05-20 06:55:15 +02:00
Andrzej Jackowski	7f61d7662d	test: audit: cover audit rules cluster behavior Cluster-level tests should validate rule matching, live updates, sink routing, role filtering, and error handling without rerunning the broader audit suite. Add audit_rules to LIVE_AUDIT_KEYS so the test framework tracks it as a live-updatable config key. Test that rules with empty categories or roles match nothing, that DML rules coexist with legacy audit config, AUTH rules fire on login events, CQL and REST API update paths reject invalid JSON, per-rule sink routing works for table and syslog, role-based filtering works across sessions, and sink mismatch produces a warning in server logs. Refs SCYLLADB-1430	2026-05-20 06:55:15 +02:00
Tomasz Grabiec	f0dea67a87	Merge 'transport: add per-service-level transport request latency histogram' from Piotr Smaron and Marcin Maliszkiewicz Add a per-scheduling-group latency histogram on the transport level that measures the full CQL request lifetime: from fetching the request buffer until the response is written to the socket. Today latencies are accounted only on the storage proxy level, leaving the time spent in the transport layer (response queue wait + actual I/O) unaccounted. Having both transport and storage proxy latencies allows operators to tell where latency accumulates. The metric is exposed as scylla_transport_cql_request_latency_histogram with the scheduling_group_name label, following the cql_ prefix convention of all other per-SG transport metrics. Fixes: SCYLLADB-1691 New feature, no backport. Closes scylladb/scylladb#29878 * github.com:scylladb/scylladb: test/cluster: add test for per-service-level transport request latency histogram transport: add per-service-level transport request latency histogram	2026-05-20 01:12:14 +02:00
Michael Litvak	eecbead541	test: wait for others_not_see_server before exclude Between stopping a server and excluding it, wait for other nodes to see the server as down, otherwise exclude may see the server as alive and fail. Fixes SCYLLADB-2110 Closes scylladb/scylladb#29966	2026-05-19 19:36:54 +02:00
Piotr Smaron	810ed6eedc	test/cluster: add test for per-service-level transport request latency histogram Verify that the new scylla_transport_cql_request_latency_histogram metric correctly records transport-level request latencies per service level. Uses error injection to pause a request mid-flight and verifies that the histogram is not updated while the request is paused (since the response has not been written yet), and is updated after the request completes. Co-authored-by: Marcin Maliszkiewicz <marcinmal@scylladb.com>	2026-05-19 16:07:33 +02:00
Benny Halevy	97e03762c5	test/cluster/test_keyspace_rf: extend test_create_keyspace_with_default_replication_factor for tablets rack lists Add more racks to dc2 to verify that the default replication factor covers all available racks (rather than e.g. limited to 3). With tablets and rf_rack_valid_keyspaces, verify also the automatically selected rack list. Restrict the extension to non-debug build modes to prevent running out of memory with --repeat=100. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#29931	2026-05-19 10:44:24 +03:00
Andrei Chekun	6414c48fc2	test.py: rewrite resource gather Python tests requires different handling of metrics gathering from cgroup than C++ tests. pytest do not execute each python tests in a separate process, so we can't put it there and get the metrics. The idea is to put the whole pytest process to the cgroup and get the metrics. This will work because pytest runs the threads as as completely separate processes and inside the thread it will run tests consequently. Additionally, to simplify system resource monitor moved to pytest main thread.	2026-05-18 12:23:40 +02:00
Michał Jadwiszczak	c767ac7ef3	test/cluster/test_view_building_coordinator: add reproducer Add test which reproduces scylladb/scylladb#27298	2026-05-18 09:18:56 +02:00
Evgeniy Naydanov	39a10d6d67	test: remove dead suite subclasses and legacy execution pipeline After all test suites migrated to test_config.yaml with type: Python, the specialized suite classes (Topology, CQLApproval, Run, Tool) and the legacy execution pipeline (find_tests, run_test, TestSuite.run, Test.run) became unreachable. Remove all this dead code. Deleted files: - suite/topology.py, suite/cql_approval.py, suite/run.py, suite/tool.py Simplified: - base.py: remove run_test(), read_log(), TestSuite.run(), add_test_list(), build_test_list(), all_tests(), test_count(), SUITE_CONFIG_FILENAME, disabled/flaky test tracking, and dead Test attributes (args, core_args, valid_exit_codes, allure_dir, is_flaky, is_cancelled, etc.) - python.py: remove PythonTestSuite.run(), PythonTest.run(), _prepare_pytest_params(), pattern, test_file_ext, xmlout, server_log, scylla_env setup, and shlex import. Simplify run_ctx() to take no parameters. - runner.py: remove --scylla-log-filename option, print_scylla_log_filename fixture, SUITE_CONFIG_FILENAME import, and suite.yaml probe in TestSuiteConfig.from_pytest_node(). - __init__.py: remove re-exports of deleted classes. - test_config.yaml: Topology -> Python, Approval -> Python. - conftest files: run_ctx(options=...) -> run_ctx(). - docs/dev/testing.md: update to reflect current pytest-based architecture, log paths, and removed features. Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com> Closes scylladb/scylladb#29613	2026-05-17 22:16:31 +03:00
Michael Litvak	c3ab341234	test: logstor: add basic delete test extend the test with basic DELETE queries and verify the results are as expected.	2026-05-17 17:22:43 +02:00
Michael Litvak	c18d616f64	logstor: compare records by timestamp Add the record timestamp. The timestamp is extracted from the row marker of the mutation when we write it. When inserting a record to index, we compare it with the existing record, and insert it only if it has newer timestamp.	2026-05-17 17:22:26 +02:00
Andrzej Jackowski	61e5ec9888	test: storage: retry fusermount3 unmount on teardown After stopping scylla server processes, the FUSE daemon (fuse2fs) may still be processing file handle closures. An immediate fusermount3 -u can fail with 'device busy', causing spurious test failures on teardown. Retry the unmount up to 10 times with 0.5s delay between attempts, and capture stderr for diagnostics. Fixes: SCYLLADB-2049 Closes scylladb/scylladb#29920	2026-05-16 19:36:48 +03:00
Piotr Dulikowski	460cb1656e	Merge 'test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Dario Mirovic The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are: 1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator 2. The test built the whole batch as a single string that takes contiguous memory Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation. This patch series: - Skips iterations that insert zero rows - Creates the table once, truncates it after each test iteration - Switches to prepared statements Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645 Fixes SCYLLADB-1645 CI stability improvement. Backport to versions that have this test. Closes scylladb/scylladb#29759 * github.com:scylladb/scylladb: test: prepare max cells inserts test: reuse max cells schema test: limits: skip empty max cells iterations	2026-05-15 18:12:48 +02:00

1 2 3 4 5 ...

1374 Commits