mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 00:20:47 +00:00

Go to file

Raphael S. Carvalho 16e387d5f9 repair/replica: Fix race window where post-repair data is wrongly promoted to repaired

During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED.  The repair lifecycle is:

  1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
  2. Row-level repair streams missing rows between replicas.
  3. mark_sstable_as_repaired() runs on all replicas, rewriting the
     SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
  4. The coordinator atomically commits sstables_repaired_at=N+1 and
     the end_repair stage to Raft, then broadcasts
     repair_update_compaction_ctrl which calls clear_being_repaired().

The bug lives in the window between steps 3 and 4.  After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N.  The classifier therefore sees:

  is_repaired(N, sst{repaired_at=N+1}) == false
  sst->being_repaired == null   (lost on restart, or not yet set)

and puts them in the UNREPAIRED view.  If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables.  The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.

Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan.  Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED.  This
divergence is never healed by future repairs because the repaired set is
considered authoritative.  The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.

The fix has two layers:

Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes.  This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.

Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory.  It checks:

  sst.repaired_at == sstables_repaired_at + 1
  AND tablet transition kind == tablet_transition_kind::repair

Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft.  Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.

The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.

A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
  - Running repair round 1 to establish sstables_repaired_at=1.
  - Injecting delay_end_repair_update to hold the race window open.
  - Running repair round 2 so all replicas complete mark_sstable_as_repaired
    (repaired_at=2) but the coordinator has not yet committed step 4.
  - Writing post-repair keys to all replicas and flushing servers[1] to
    create an SSTable with repaired_at=0 on disk.
  - Restarting servers[1] so being_repaired is lost from memory.
  - Waiting for autocompaction to merge the two SSTables on servers[1].
  - Asserting that the merged SSTable contains post-repair keys (the bug)
    and that servers[0] and servers[2] do not see those keys as repaired.

NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check.  I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29244

2026-04-09 11:42:28 +03:00

.github

.github/workflows/call_validate_pr_author_email.yml: add missing workflow permissions

2026-04-08 12:19:55 +03:00

abseil @ 255c84dadd

abseil: update to lts_2026_01_07

2026-04-08 12:19:54 +03:00

alternator

Merge 'alternator: fix batch write item squashing cdc entries' from Radosław Cybulski

2026-04-07 17:49:23 +03:00

api

Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron

2026-04-09 08:37:51 +03:00

audit

audit: replace batch dynamic_cast with static_cast

2026-01-26 18:14:38 +01:00

auth

Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz

2026-03-19 14:37:22 +01:00

bin

…

cdc

cdc: drop usage of cdc_local table and v1 generation definition

2026-03-10 10:39:59 +02:00

cmake

build: drop -fexperimental-assignment-tracking clang option

2025-12-22 14:33:48 +02:00

compaction

replica: Pick any compaction group for resharding

2026-03-24 11:06:38 +02:00

conf

docs: add missing rack value for internode_compression parameter

2026-04-08 12:19:54 +03:00

cql3

vector_search: fix SELECT on local vector index

2026-03-30 16:46:48 +02:00

data_dictionary

strong consistency: replace local consistency with global

2026-04-08 12:52:32 +02:00

Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron

2026-04-09 08:37:51 +03:00

debug

…

dht

dht: Introduce raw_token

2026-03-18 16:25:20 +01:00

dist

scylla_swap_setup: Remove Before=swap.target dependency from swap unit

2026-04-05 15:07:50 +03:00

docs

docs: add missing rack value for internode_compression parameter

2026-04-08 12:19:54 +03:00

ent

Merge 'ldap: fix double-free of LDAPMessage in poll_results()' from Andrzej Jackowski

2026-04-07 17:27:43 +02:00

exceptions

…

gms

feature_service: Add vnodes_to_tablets_migrations feature

2026-03-24 11:06:38 +02:00

idl

logstor: change index to btree by token per table

2026-03-18 19:24:28 +01:00

index

index: fix DESC INDEX for vector index

2026-03-30 16:46:48 +02:00

keys

…

lang

cql: vector: fix vector dimension type

2026-02-26 14:46:53 +02:00

licenses

utils: license: import crypt_sha512.c from musl to the project

2025-12-10 15:36:18 +01:00

locator

token_metadata: Clear _topology_change_info gently

2026-04-08 12:19:54 +03:00

message

transport: add remote statement preparation for CQL forwarding

2026-03-12 19:43:35 +01:00

mutation

mutation/collection_mutation: don't copy the serialized collection

2026-03-12 13:57:40 +02:00

mutation_writer

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

node_ops

tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children

2026-03-18 15:37:24 +01:00

pgo

Update pgo profiles - aarch64

2026-04-05 16:58:02 +03:00

query

…

raft

Merge 'raft: include demoted voters in read barrier during joint config' from Qian Cheng

2026-04-08 12:37:27 +02:00

readers

compaction: resharding_compaction: add vnodes_resharding option

2026-03-24 11:06:38 +02:00

reloc

…

repair

repair/replica: Fix race window where post-repair data is wrongly promoted to repaired

2026-04-09 11:42:28 +03:00

replica

repair/replica: Fix race window where post-repair data is wrongly promoted to repaired

2026-04-09 11:42:28 +03:00

rust

cmake: fix precompiled header (PCH) creation

2026-03-24 15:53:40 +02:00

schema

Merge 'cql3: fix DESCRIBE INDEX WITH INTERNALS name' from Piotr Smaron

2026-04-09 08:37:51 +03:00

scripts

scripts/base36-uuid: dump date in UTC

2026-04-08 12:19:55 +03:00

seastar @ 4d268e0ef5

Update seastar submodule

2026-03-10 22:06:58 +02:00

service

strong consistency: replace local consistency with global

2026-04-08 12:52:32 +02:00

sstables

Merge 'storage: implement sstable clone for object storage' from Ernest Zaslavsky

2026-04-08 09:35:10 +03:00

streaming

Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk

2026-03-13 09:55:55 +02:00

swagger-ui @ 12f1da1082

…

tasks

tasks: fix indentation

2026-03-18 15:37:24 +01:00

test

repair/replica: Fix race window where post-repair data is wrongly promoted to repaired

2026-04-09 11:42:28 +03:00

tools

scylla-nodetool: Add migrate-to-tablets subcommand

2026-03-25 19:11:29 +02:00

tracing

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

transport

Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros

2026-03-13 15:03:10 +01:00

types

cql3: fix null handling in data_value formatting

2026-04-01 14:15:18 +02:00

unified

dist: tune tcp_mem to 3% of total memory in scylla-kernel-conf package

2026-03-05 12:51:04 +03:00

utils

s3_client: pass through abort_source in copy_object

2026-04-07 18:16:52 +03:00

vector_search

vector_search: fix race condition on connection timeout

2026-03-13 16:28:22 +01:00

.clang-format

…

.dockerignore

…

.gitattributes

…

.gitignore

…

.gitmodules

…

.gitorderfile

…

.mailmap

…

absl-flat_hash_map.cc

…

absl-flat_hash_map.hh

…

amplify.yml

…

backlog_controller_fwd.hh

db/config: introduce new config parameter compaction_max_shares

2025-11-24 12:52:29 -03:00

backlog_controller.hh

db/config: introduce new config parameter compaction_max_shares

2025-11-24 12:52:29 -03:00

build_mode.hh

…

bytes_fwd.hh

…

bytes_ostream.hh

…

bytes.cc

…

bytes.hh

…

cartesian_product.hh

…

client_data.cc

…

client_data.hh

service/client_state and alternator/server: use cached values for driver_name and driver_version fields

2025-12-20 12:26:22 -05:00

clocks-impl.cc

…

clocks-impl.hh

…

CMakeLists.txt

node_ops: remove topology over node ops code

2026-02-25 10:08:32 +02:00

configure.py

abseil: update to lts_2026_01_07

2026-04-08 12:19:54 +03:00

CONTRIBUTING.md

…

coverage_excludes.txt

…

coverage_sources.list

…

db_clock.hh

…

debug.cc

storage_service: Check raft rpc scheduling group from debug namespace

2026-02-03 06:34:03 +02:00

debug.hh

storage_service: Check raft rpc scheduling group from debug namespace

2026-02-03 06:34:03 +02:00

default.nix

…

Doxyfile

…

encoding_stats.hh

…

enum_set.hh

auth: add possibilty to check for any permission in set

2025-10-03 16:55:57 +02:00

exported_templates.cc

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

exported_templates.hh

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

fix_system_distributed_tables.py

…

flake.lock

…

flake.nix

…

gc_clock.hh

…

gdbinit

…

gen_segmented_compress_params.py

…

HACKING.md

…

hashing_partition_visitor.hh

…

idl-compiler.py

idl-compiler.py: raise TypeError instead of raw str

2026-01-13 08:33:17 +02:00

inet_address_vectors.hh

…

init.cc

db: add logstor experimental feature flag

2026-03-18 19:24:26 +01:00

init.hh

Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov"

2026-01-05 08:53:41 +02:00

install-dependencies.sh

build: add slirp4netns to dependencies

2026-03-05 17:44:17 +02:00

install.sh

install.sh: fix REST API paths for nonroot installations

2026-02-27 15:32:54 +02:00

LICENSE-ScyllaDB-Source-Available.md

…

main.cc

Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak

2026-03-20 00:18:09 +02:00

marshal_exception.hh

…

mutation_query.cc

…

mutation_query.hh

…

NOTICE.txt

…

ORIGIN

…

partition_builder.hh

…

partition_range_compat.hh

…

partition_slice_builder.cc

…

partition_slice_builder.hh

…

query_ranges_to_vnodes.cc

…

query_ranges_to_vnodes.hh

…

reader_concurrency_semaphore_group.cc

reader_concurrency_semaphore: Add preemptive_abort_factor to constructors

2026-01-28 14:20:01 +01:00

reader_concurrency_semaphore_group.hh

reader_concurrency_semaphore: Add preemptive_abort_factor to constructors

2026-01-28 14:20:01 +01:00

reader_concurrency_semaphore.cc

reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory

2026-03-13 09:50:05 +01:00

reader_concurrency_semaphore.hh

test: verify signal() detects resource negative leak in rcs

2026-03-20 09:21:20 +03:00

reader_permit.hh

permit_reader: Add a new state: preemptive_aborted

2026-01-28 14:20:01 +01:00

README.md

docs: fix link to docker build README.MD

2026-02-18 12:12:46 +01:00

real_dirty_memory_accounter.hh

…

release.cc

…

release.hh

…

reversibly_mergeable.hh

…

schema_upgrader.hh

…

scylla_post_install.sh

…

scylla-gdb.py

abseil: update to lts_2026_01_07

2026-04-08 12:19:54 +03:00

SCYLLA-VERSION-GEN

Update ScyllaDB version to: 2026.2.0-dev

2026-01-25 11:09:17 +02:00

seastarx.hh

…

serialization_visitors.hh

…

serializer_impl.hh

…

serializer.cc

…

serializer.hh

…

service_permit.hh

…

shell.nix

…

sstable_dict_autotrainer.cc

dictionary compression: add missing co_awaits on get_units

2026-02-18 16:40:40 +01:00

sstable_dict_autotrainer.hh

…

sstables_loader.cc

sstables_loader: use new sstable add path

2026-03-23 10:33:04 +03:00

sstables_loader.hh

streaming: refactor get_sstables_for_tablets to make it accessible

2025-12-08 12:30:23 +02:00

stdafx.cc

Add precompiled headers to CMakeLists.txt

2025-11-21 12:27:41 +02:00

stdafx.hh

utils: add rolling max tracker

2026-03-12 08:56:41 +01:00

supervisor.hh

…

table_helper.cc

transport/messages: hold pinned prepared entry in PREPARE result

2026-03-10 14:17:57 +02:00

table_helper.hh

…

test.py

test.py: introduce new scheduler for choosing job count

2026-04-01 11:11:15 +03:00

timeout_config.cc

…

timeout_config.hh

…

tombstone_gc_extension.hh

…

tombstone_gc_options.cc

…

tombstone_gc_options.hh

tombstone_gc_options: add C++ friendly constructor

2026-03-03 14:09:28 +02:00

tombstone_gc-internals.hh

…

tombstone_gc.cc

tombstone_gc: don't use real-db for validation and determining default

2026-04-07 13:56:24 +03:00

tombstone_gc.hh

tombstone_gc: tombstone_gc_state::for_tests(): remove unused param

2026-03-12 10:01:42 +01:00

ubsan-suppressions.supp

…

unimplemented.cc

…

unimplemented.hh

…

validation.cc

…

validation.hh

…

version.hh

…

view_info.hh

…

vint-serialization.cc

vint: Use std::countl_zero()

2026-03-18 16:25:21 +01:00

vint-serialization.hh

vint: Use std::countl_zero()

2026-03-18 16:25:21 +01:00

README.md

Scylla

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain. This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Developer documentation for more information on building Scylla.
Build documentation on how to build Scylla binaries, tests, and packages.
Docker image build documentation for information on how to build Docker images.

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.

Languages

C++ 72.8%

Python 25.9%

CMake 0.4%

GAP 0.3%

Shell 0.3%