scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-30 05:07:05 +00:00

Files

Wojciech Mitros ebaf536449 replica/database: fix cross-shard deadlock in lock_tables_metadata()

lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard.  It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.

When two fibers call lock_tables_metadata() concurrently, this can
deadlock.  parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues.  The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue.  This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).

In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.

This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata().  When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:

  1. activate_large_data_virtual_tables() — calls
     legacy_drop_table_on_all_shards() which calls
     lock_tables_metadata() synchronously via .get()

  2. reload_schema_in_bg() — fires as a background fiber from
     TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
     schema_applier::commit() which also calls lock_tables_metadata()

If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions.  The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.

Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.

Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time

Also added a unit test for concurrent lock_tables_metadata() calls.

Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684

Closes scylladb/scylladb#29678

2026-04-29 21:13:53 +02:00

alternator

Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta

2026-04-22 15:48:27 +03:00

boost

replica/database: fix cross-shard deadlock in lock_tables_metadata()

2026-04-29 21:13:53 +02:00

broadcast_tables

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

cluster

Merge 'audit: fix maintenance socket startup/shutdown ordering' from Andrzej Jackowski

2026-04-29 10:37:38 +02:00

cql

Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk

2026-04-22 01:46:11 +02:00

cqlpy

Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta

2026-04-22 15:48:27 +03:00

ldap

test: ldap: add test for pruner crash during shutdown

2026-04-24 13:34:09 +02:00

lib

compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode

2026-04-20 16:59:09 -03:00

manual

test: auth_cluster: use safe_driver_shutdown() for Cluster teardown

2026-04-21 17:45:11 +02:00

nodetool

test: migrate runtime pytest.skip() to typed skip_env()

2026-04-19 11:09:29 +02:00

perf

audit: split startup into construction and storage phases

2026-04-28 18:58:42 +02:00

pylib

Merge 'test.py: fix test collection bug' from Andrei Chekun

2026-04-28 11:52:35 +03:00

pylib_test

test.py: fix framework test

2026-04-25 18:04:55 +02:00

raft

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

resource

test/ldap: add LDAP filter-injection reproducers

2026-04-08 13:53:49 +02:00

rest_api

test: migrate runtime pytest.skip() to typed skip_env()

2026-04-19 11:09:29 +02:00

scylla_gdb

test: migrate runtime pytest.skip() to typed skip_bug()

2026-04-19 11:10:42 +02:00

unit

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

vector_search

cql3: statement_restrictions: prepare statement_restrictions for capturing this

2026-04-19 20:57:03 +03:00

__init__.py

test.py: delete dead code in test.py

2026-04-16 22:08:31 +02:00

CMakeLists.txt

test/cmake: add missing tests to boost test suite

2026-03-29 16:17:45 +03:00

conftest.py

test.py: remove testpy_test_fixture_scope

2026-04-16 22:08:33 +02:00

pytest.ini

test: exclude pylib_test from default test runs

2026-04-22 11:38:40 +02:00

README.md

…

README.md

Scylla in-source tests.

For details on how to run the tests, see docs/dev/testing.md

Shared C++ utils, libraries are in lib/, for Python - pylib/

alternator - Python tests which connect to a single server and use the DynamoDB API unit, boost, raft - unit tests in C++ cqlpy - Python tests which connect to a single server and use CQL topology* - tests that set up clusters and add/remove nodes cql - approval tests that use CQL and pre-recorded output rest_api - tests for Scylla REST API Port 9000 scylla-gdb - tests for scylla-gdb.py helper script nodetool - tests for C++ implementation of nodetool

If you can use an existing folder, consider adding your test to it. New folders should be used for new large categories/subsystems, or when the test environment is significantly different from some existing suite, e.g. you plan to start scylladb with different configuration, and you intend to add many tests and would like them to reuse an existing Scylla cluster (clusters can be reused for tests within the same folder).

To add a new folder, create a new directory, and then copy & edit its suite.ini.