scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-03 13:37:04 +00:00

Files

Petr Gusev a2b2a42936 storage_service: cancel write handlers during drain to prevent shutdown deadlock

When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.

If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
  (the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
  deadlock (the barrier_and_drain RPC handler holds the gate open).

Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().

Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.

Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.

Fixes: SCYLLADB-2163
Refs: scylladb/scylladb#23665
(cherry picked from commit 2927f0dd21)

2026-05-21 18:58:06 +00:00

auth_cluster

test/auth_cluster: simulate v1 state in self-heal test

2026-05-14 15:33:39 +03:00

dtest

test: prepare max cells inserts

2026-05-15 19:04:12 +00:00

lwt

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test: run test_mv_admission_control_exception on one shard

2026-05-13 08:56:27 +03:00

object_store

storage_service: Disable snapshots after raft decommission

2026-05-12 11:42:14 +03:00

random_failures

test/random_failures: remove gossip shadow round injection

2026-04-15 16:30:55 +02:00

storage

test: storage: retry fusermount3 unmount on teardown

2026-05-18 12:25:22 +03:00

tasks

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

__init__.py

…

conftest.py

Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz

2026-04-24 09:10:43 +03:00

test_aggregation.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_alternator_proxy_protocol.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_alternator.py

Merge 'Barrier and drain logging' from Gleb Natapov

2026-05-06 10:26:44 +02:00

test_audit.py

Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski

2026-04-22 18:56:28 +02:00

test_automatic_cleanup.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_bad_initial_token.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_batchlog_manager.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_blocked_bootstrap.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented

2026-04-19 11:06:30 +02:00

test_boot_nodes.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_bootstrap_with_quick_group0_join.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_bti_index.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_cdc_generation_clearing.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_cdc_generation_data.py

test: filter benign errors in tests that grep logs during shutdown

2026-04-13 18:33:41 +02:00

test_cdc_generation_publishing.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_cdc_with_alter.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_cdc_with_tablets.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_change_ip.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_change_replication_factor_1_to_0.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_change_rpc_address.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_client_routes.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_cluster_features.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs

2026-04-19 11:06:30 +02:00

test_commitlog_segment_data_resurrection.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_commitlog.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_compaction_backpressure.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_concurrent_schema.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_config_live_updates.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_config.py

test_config: improve logging for wait_for_config API

2026-04-15 14:28:31 +02:00

test_config.yaml

test: cluster: enable migrated audit tests and make them work

2026-03-19 16:07:28 +01:00

test_conflicting_keys_read_repair.py

…

test_coordinator_queue_management.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_counter_write_timeout_metric.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_counters_with_tablets.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_crash_coordinator_before_streaming.py

test: fix flaky test_kill_coordinator_during_op

2026-05-02 16:27:16 +03:00

test_create_table_during_node_shutdown.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_data_resurrection_after_cleanup.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_data_resurrection_in_memtable.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_decommission_kill_then_replace.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_decommission.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_deprecating_cluster_features.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_describe.py

test: auth_cluster: use safe_driver_shutdown() for Cluster teardown

2026-04-21 17:45:11 +02:00

test_different_group0_ids.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_encryption.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_error_becoming_voter.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_failure_after_group0_server_registration.py

raft/group0: fix destroy assertion on startup failure

2026-05-05 10:48:13 +02:00

test_fencing.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_global_ignore_nodes.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_gossiper_empty_self_id_on_shadow_round.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_gossiper_orphan_remover.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_gossiper_race.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_gossiper.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_group0_recovers_after_partial_command_application.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_guardrails.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_hints.py

test/cluster: fix flaky test_hints_consistency_during_replace

2026-04-23 17:03:48 +02:00

test_incremental_repair.py

Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity

2026-05-08 12:24:23 +02:00

test_initial_token.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_internode_compression.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_ip_mappings.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_keyspace_rf.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_left_node_notification.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_logstor.py

db: Remove redundant enable_logstor config option

2026-04-15 14:40:15 +03:00

test_long_join.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_long_query_timeout_erm.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_lwt_semaphore.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_maintenance_mode.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_major_compaction.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_metadata_id.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_multidc.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_mutation_schema_change.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_mv.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented

2026-04-19 11:06:30 +02:00

test_no_dc_rack_change.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_no_removed_node_event_on_ip_change.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_node_isolation.py

Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz

2026-04-24 09:10:43 +03:00

test_node_ops_metrics.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_node_shutdown_waits_for_pending_requests.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_nodetool.py

test.py: remove deprecated skip_mode decorator

2026-01-25 18:17:27 +02:00

test_not_enough_token_owners.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_prepare_race.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_proxy_protocol.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_query_rebounce.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_raft_cluster_features.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_raft_ignore_nodes.py

test/cluster: scale failure_detector_timeout_in_ms by build mode

2026-04-20 15:28:34 +02:00

test_raft_no_quorum.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_raft_recovery_during_join.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_raft_recovery_entry_loss.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_raft_recovery_user_data.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_raft_snapshot_request.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_raft_snapshot_truncation.py

test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation

2026-05-21 16:06:24 +02:00

test_raft_voters.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_random_tables.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_read_repair.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_refresh.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_remove_alive_node.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_remove_rpc_client_with_pending_requests.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_repair.py

repair: Reject repair requests where start and end tokens are equal

2026-05-12 11:58:04 +03:00

test_replace_alive_node.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_replace_with_encryption.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_replace_with_same_ip_twice.py

test/cluster: scale failure_detector_timeout_in_ms by build mode

2026-04-20 15:28:34 +02:00

test_replace.py

test/cluster: scale failure_detector_timeout_in_ms by build mode

2026-04-20 15:28:34 +02:00

test_replica_exceptions.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_rest_api_on_startup.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_restart_cluster.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_resurrection.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_reversed_queries_during_simulated_upgrade_process.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_rpc_compression.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_select_from_mutation_fragments.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_shutdown_hang.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_size_based_load_balancing.py

test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity

2026-04-16 08:38:33 +02:00

test_snapshot_with_tablets.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_snapshot.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_sstable_cleanup_stop.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_sstable_compression_config.py

cql/statement: Create keyspace_metadata with correct initial_tablets count

2026-04-20 17:57:38 +03:00

test_sstable_compression_dictionaries_autotrain.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_sstable_compression_dictionaries_basic.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_sstable_compression_dictionaries_upgrade.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_sstable_set.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_start_bootstrapped_with_invalid_seed.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs

2026-04-19 11:06:30 +02:00

test_streaming_deadlock.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_strong_consistency.py

Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta

2026-04-22 15:48:27 +03:00

test_table_desc_read_barrier.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_table_drop.py

sstables_loader: prevent use-after-free on table drop during streaming

2026-04-20 07:39:51 +03:00

test_tablet_repair_scheduler.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tablet_stats.py

Merge 'topology_coordinator: join tablet load stats refresh in stop()' from Andrzej Jackowski

2026-05-10 13:56:42 +03:00

test_tablets2.py

test: test drop table during streaming

2026-04-15 19:23:00 +02:00

test_tablets_colocation.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented

2026-04-19 11:06:30 +02:00

test_tablets_cql.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tablets_intranode.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tablets_lwt.py

storage_service: cancel write handlers during drain to prevent shutdown deadlock

2026-05-21 18:58:06 +00:00

test_tablets_merge.py

test: fix flaky test_tablets_split_merge_with_many_tables

2026-05-13 09:18:30 +03:00

test_tablets_migration.py

Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz

2026-04-24 09:10:43 +03:00

test_tablets_parallel_decommission.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tablets_removenode.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tablets.py

service: skip load_sketch unload for excluded nodes on RF shrink

2026-05-16 19:18:18 +03:00

test_tls.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tombstone_gc.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_tools_perf.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_failure_recovery.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_ops_encrypted.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_ops_with_rf_rack_valid.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_ops.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_rejoin.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_remove_decom.py

test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs

2026-04-19 11:06:30 +02:00

test_topology_schema.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_topology_smp.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_truncate_concurrent_writes.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_truncate_with_drop.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_truncate_with_tablets.py

test: add test_split_emitted_during_truncate

2026-04-13 11:05:03 +02:00

test_ttl_row.py

test: wait for TTL scheduling sanity metric

2026-05-13 08:59:23 +03:00

test_unfinished_writes_during_shutdown.py

storage_service: cancel write handlers during drain to prevent shutdown deadlock

2026-05-21 18:58:06 +00:00

test_uninitialized_conns_semaphore.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_vector_store.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_view_build_status.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_view_building_coordinator.py

test/cluster/test_view_building_coordinator: fix view_updates_drained predicate

2026-04-23 17:52:22 +03:00

test_vnodes_to_tablets_migration.py

test: cluster: Verify vnodes-to-tablets migration virtual task

2026-04-17 21:13:52 +03:00

test_write_query_during_cql_server_shutdown.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_writes_to_previous_cdc_generations.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_zero_token_nodes_multidc.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_zero_token_nodes_no_replication.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test_zero_token_nodes_topology_ops.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

util.py

test: fix flaky test_kill_coordinator_during_op

2026-05-02 16:27:16 +03:00