Files
scylladb/test
Patryk Jędrzejczak 6fa12c6419 Merge '[Backport 2026.2] storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Scylladb[bot]
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).

Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.

The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.

Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.

Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.

Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked

Fixes: SCYLLADB-2163
Refs: scylladb/scylladb#23665

backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1

- (cherry picked from commit bc4dc13e94)
- (cherry picked from commit 324a08295d)
- (cherry picked from commit c88120abca)
- (cherry picked from commit fa01f74ae6)
- (cherry picked from commit 32002f6443)
- (cherry picked from commit a093be9ca9)
- (cherry picked from commit 5bc3e84d1e)
- (cherry picked from commit 2927f0dd21)

Parent PR: #29882

Closes scylladb/scylladb#30008

* https://github.com/scylladb/scylladb:
  storage_proxy: only cancel write handlers with pending remote targets during drain
  storage_service: cancel write handlers during drain to prevent shutdown deadlock
  test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
  test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
  test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
  test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
  test: scylla_cluster: allow stop() to bypass start_stop_lock
  error_injection: add non-shared mode to wait_for_message
  error_injection: release waiters when injection is disabled
2026-05-28 12:49:15 +02:00
..
2026-04-12 19:46:33 +03:00
2026-04-12 19:46:33 +03:00

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cqlpy - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.