mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 20:57:00 +00:00

Go to file

Wojciech Mitros ebaf536449 replica/database: fix cross-shard deadlock in lock_tables_metadata()

lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard.  It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.

When two fibers call lock_tables_metadata() concurrently, this can
deadlock.  parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues.  The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue.  This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).

In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.

This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata().  When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:

  1. activate_large_data_virtual_tables() — calls
     legacy_drop_table_on_all_shards() which calls
     lock_tables_metadata() synchronously via .get()

  2. reload_schema_in_bg() — fires as a background fiber from
     TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
     schema_applier::commit() which also calls lock_tables_metadata()

If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions.  The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.

Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.

Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time

Also added a unit test for concurrent lock_tables_metadata() calls.

Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684

Closes scylladb/scylladb#29678

2026-04-29 21:13:53 +02:00

.github

Fix CODEOWNERS to cover nested docs subfolders

2026-04-20 17:55:43 +03:00

abseil @ 255c84dadd

abseil: update to lts_2026_01_07

2026-04-08 12:19:54 +03:00

alternator

alternator: use stream_arn instead of std::string in list_streams

2026-04-22 14:02:53 +02:00

api

storage_service: gate REST-facing async operations during shutdown

2026-04-22 10:30:33 +02:00

audit

audit: assert storage ordering invariants at runtime

2026-04-28 18:58:49 +02:00

auth

auth: make shutdown the exact reverse of startup

2026-04-24 13:34:09 +02:00

bin

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

cdc

treewide: fix spelling errors.

2026-04-21 18:20:26 +03:00

cmake

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

compaction

compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode

2026-04-20 16:59:09 -03:00

conf

conf: pair sstable_format=ms with column_index_size_in_kb=1

2026-04-20 17:53:56 +03:00

cql3

Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski

2026-04-22 18:56:28 +02:00

data_dictionary

db: add columns to system_schema.keyspaces

2026-04-17 09:58:07 +02:00

Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho

2026-04-22 10:21:37 +03:00

debug

…

dht

locator: tablets: Support arbitrary tablet boundaries

2026-04-15 01:25:14 +02:00

dist

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

docs

Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk

2026-04-22 01:46:11 +02:00

ent

encryption: cover system.raft table in system_info_encryption

2026-04-16 13:22:10 +02:00

exceptions

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

gms

gms: add keyspace_multi_rf_change feature

2026-04-17 09:58:05 +02:00

idl

logstor: split log record to header and data

2026-04-16 10:00:35 +03:00

index

Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik

2026-04-15 14:40:15 +03:00

keys

keys: move key_to_str() to keys/keys.hh

2026-04-16 08:42:54 +03:00

lang

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

licenses

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

locator

service: implement make_rf_change_plan

2026-04-17 09:58:07 +02:00

message

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

mutation

alternator: fix Alternator writing unnecesary cdc entries

2026-04-17 18:00:25 +02:00

mutation_writer

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

node_ops

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

pgo

test: auth_cluster: use safe_driver_shutdown() for Cluster teardown

2026-04-21 17:45:11 +02:00

query

Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy

2026-04-15 14:40:15 +03:00

raft

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

readers

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

reloc

treewide: improve bash error reporting

2025-02-10 18:28:52 +03:00

repair

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

replica

replica/database: fix cross-shard deadlock in lock_tables_metadata()

2026-04-29 21:13:53 +02:00

rust

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

schema

alternator/streams: Block tablet merges when Alternator Streams are enabled

2026-04-19 03:54:33 +02:00

scripts

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

seastar @ 4d268e0ef5

Revert "Update seastar submodule"

2026-04-19 15:14:48 +03:00

service

Merge 'paxos_state: keep prepared message alive across statement execution' from Petr Gusev

2026-04-29 17:57:27 +02:00

sstables

sstables: only wipe TemporaryHashes for sstable formats that have it

2026-04-29 08:06:36 +03:00

streaming

Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski

2026-04-20 17:56:36 +03:00

swagger-ui @ 12f1da1082

…

tasks

service: Add virtual task for vnodes-to-tablets migrations

2026-04-17 20:59:05 +03:00

test

replica/database: fix cross-shard deadlock in lock_tables_metadata()

2026-04-29 21:13:53 +02:00

tools

compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode

2026-04-20 16:59:09 -03:00

tracing

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

transport

Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski

2026-04-22 18:56:28 +02:00

types

Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy

2026-04-15 14:40:15 +03:00

unified

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

utils

Update position in dma_read(iovec) in create_file_for_seekable_source

2026-04-23 14:54:20 +03:00

vector_search

vector_search: decrease default connection timeout to 3s

2026-04-17 12:26:39 +03:00

.clang-format

…

.dockerignore

…

.gitattributes

…

.gitignore

.gitignore: add rust target

2025-08-19 13:09:18 +03:00

.gitmodules

build: replace tools/java submodule with packaged cassandra-stress

2025-04-15 10:11:28 +03:00

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

absl-flat_hash_map.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

AGENTS.md

tree: add AGENTS.md router and improve AI instruction files

2026-04-19 21:59:52 +03:00

amplify.yml

…

backlog_controller_fwd.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

backlog_controller.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

build_mode.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

bytes_fwd.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

bytes_ostream.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

bytes.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

bytes.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

cartesian_product.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

client_data.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

client_data.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

clocks-impl.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

clocks-impl.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

CMakeLists.txt

Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov

2026-04-12 00:34:48 +03:00

configure.py

replica/database: fix cross-shard deadlock in lock_tables_metadata()

2026-04-29 21:13:53 +02:00

CONTRIBUTING.md

docs: fix typos and spelling errors

2025-09-30 13:16:49 +02:00

coverage_excludes.txt

…

coverage_sources.list

…

db_clock.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

debug.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

debug.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

default.nix

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

Doxyfile

…

encoding_stats.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

enum_set.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

exported_templates.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

exported_templates.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

fix_system_distributed_tables.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

flake.lock

…

flake.nix

…

gc_clock.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

gdbinit

…

gen_segmented_compress_params.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

HACKING.md

docs: fix typos and spelling errors

2025-09-30 13:16:49 +02:00

hashing_partition_visitor.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

idl-compiler.py

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

inet_address_vectors.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

init.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

init.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

install-dependencies.sh

build: add slirp4netns to dependencies

2026-03-05 17:44:17 +02:00

install.sh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

LICENSE-ScyllaDB-Source-Available.md

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

main.cc

audit: assert storage ordering invariants at runtime

2026-04-28 18:58:49 +02:00

marshal_exception.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

mutation_query.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

mutation_query.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

NOTICE.txt

PowerPC: remove ppc stuff

2025-07-08 10:38:23 +03:00

ORIGIN

…

partition_builder.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

partition_range_compat.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

partition_slice_builder.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

partition_slice_builder.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

query_ranges_to_vnodes.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

query_ranges_to_vnodes.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

reader_concurrency_semaphore_group.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

reader_concurrency_semaphore_group.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

reader_concurrency_semaphore.cc

reader_concurrency_semaphore: drop unused stop_ext_{pre,post}()

2026-04-15 14:40:15 +03:00

reader_concurrency_semaphore.hh

reader_concurrency_semaphore: drop unused stop_ext_{pre,post}()

2026-04-15 14:40:15 +03:00

reader_permit.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

README.md

docs: fix link to docker build README.MD

2026-02-18 12:12:46 +01:00

real_dirty_memory_accounter.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

release.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

release.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

reversibly_mergeable.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

schema_upgrader.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

scylla_post_install.sh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

scylla-gdb.py

scylla-gdb: fix compaction-tasks command for intrusive list

2026-04-29 13:11:13 +03:00

SCYLLA-VERSION-GEN

Update ScyllaDB version to: 2026.3.0-dev

2026-04-26 15:30:13 +03:00

seastarx.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

serialization_visitors.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

serializer_impl.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

serializer.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

serializer.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

service_permit.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

shell.nix

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

sstable_dict_autotrainer.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

sstable_dict_autotrainer.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

sstables_loader.cc

sstables_loader: prevent use-after-free on table drop during streaming

2026-04-20 07:39:51 +03:00

sstables_loader.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

stdafx.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

stdafx.hh

build: drop utils/rolling_max_tracker.hh from precompiled header

2026-04-22 15:46:50 +03:00

supervisor.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

table_helper.cc

table_helper: retry insert prepare on concurrent cache invalidation

2026-04-28 16:03:06 +02:00

table_helper.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

test.py

test.py: enhance error output in case no tests were executed

2026-04-23 14:03:55 +02:00

timeout_config.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

timeout_config.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc_extension.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc_options.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc_options.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc-internals.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

tombstone_gc.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

ubsan-suppressions.supp

…

unimplemented.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

unimplemented.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

validation.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

validation.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

version.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

view_info.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

vint-serialization.cc

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

vint-serialization.hh

LICENSE: Update to version 1.1

2026-04-12 19:46:33 +03:00

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain. This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.6%

Python 26.1%

CMake 0.4%

GAP 0.3%

Shell 0.3%