Raphael S. Carvalho 16e387d5f9 repair/replica: Fix race window where post-repair data is wrongly promoted to repaired
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED.  The repair lifecycle is:

  1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
  2. Row-level repair streams missing rows between replicas.
  3. mark_sstable_as_repaired() runs on all replicas, rewriting the
     SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
  4. The coordinator atomically commits sstables_repaired_at=N+1 and
     the end_repair stage to Raft, then broadcasts
     repair_update_compaction_ctrl which calls clear_being_repaired().

The bug lives in the window between steps 3 and 4.  After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N.  The classifier therefore sees:

  is_repaired(N, sst{repaired_at=N+1}) == false
  sst->being_repaired == null   (lost on restart, or not yet set)

and puts them in the UNREPAIRED view.  If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables.  The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.

Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan.  Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED.  This
divergence is never healed by future repairs because the repaired set is
considered authoritative.  The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.

The fix has two layers:

Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes.  This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.

Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory.  It checks:

  sst.repaired_at == sstables_repaired_at + 1
  AND tablet transition kind == tablet_transition_kind::repair

Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft.  Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.

The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.

A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
  - Running repair round 1 to establish sstables_repaired_at=1.
  - Injecting delay_end_repair_update to hold the race window open.
  - Running repair round 2 so all replicas complete mark_sstable_as_repaired
    (repaired_at=2) but the coordinator has not yet committed step 4.
  - Writing post-repair keys to all replicas and flushing servers[1] to
    create an SSTable with repaired_at=0 on disk.
  - Restarting servers[1] so being_repaired is lost from memory.
  - Waiting for autocompaction to merge the two SSTables on servers[1].
  - Asserting that the merged SSTable contains post-repair keys (the bug)
    and that servers[0] and servers[2] do not see those keys as repaired.

NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check.  I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29244
2026-04-09 11:42:28 +03:00
2026-04-08 12:19:54 +03:00
2026-03-18 16:25:20 +01:00
2026-04-05 16:58:02 +03:00
2026-03-10 22:06:58 +02:00
2026-03-18 15:37:24 +01:00
2026-04-08 12:19:54 +03:00
2026-04-08 12:19:54 +03:00
2026-03-12 08:56:41 +01:00

Scylla

Slack Twitter

What is Scylla?

Scylla is the real-time big data database that is API-compatible with Apache Cassandra and Amazon DynamoDB. Scylla embraces a shared-nothing approach that increases throughput and storage capacity to realize order-of-magnitude performance improvements and reduce hardware costs.

For more information, please see the ScyllaDB web site.

Build Prerequisites

Scylla is fairly fussy about its build environment, requiring very recent versions of the C++23 compiler and of many libraries to build. The document HACKING.md includes detailed information on building and developing Scylla, but to get Scylla building quickly on (almost) any build machine, Scylla offers a frozen toolchain. This is a pre-configured Docker image which includes recent versions of all the required compilers, libraries and build tools. Using the frozen toolchain allows you to avoid changing anything in your build machine to meet Scylla's requirements - you just need to meet the frozen toolchain's prerequisites (mostly, Docker or Podman being available).

Building Scylla

Building Scylla with the frozen toolchain dbuild is as easy as:

$ git submodule update --init --force --recursive
$ ./tools/toolchain/dbuild ./configure.py
$ ./tools/toolchain/dbuild ninja build/release/scylla

For further information, please see:

Running Scylla

To start Scylla server, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --workdir tmp --smp 1 --developer-mode 1

This will start a Scylla node with one CPU core allocated to it and data files stored in the tmp directory. The --developer-mode is needed to disable the various checks Scylla performs at startup to ensure the machine is configured for maximum performance (not relevant on development workstations). Please note that you need to run Scylla with dbuild if you built it with the frozen toolchain.

For more run options, run:

$ ./tools/toolchain/dbuild ./build/release/scylla --help

Testing

Build with the latest Seastar Check Reproducible Build clang-nightly

See test.py manual.

Scylla APIs and compatibility

By default, Scylla is compatible with Apache Cassandra and its API - CQL. There is also support for the API of Amazon DynamoDB™, which needs to be enabled and configured in order to be used. For more information on how to enable the DynamoDB™ API in Scylla, and the current compatibility of this feature as well as Scylla-specific extensions, see Alternator and Getting started with Alternator.

Documentation

Documentation can be found here. Seastar documentation can be found here. User documentation can be found here.

Training

Training material and online courses can be found at Scylla University. The courses are free, self-paced and include hands-on examples. They cover a variety of topics including Scylla data modeling, administration, architecture, basic NoSQL concepts, using drivers for application development, Scylla setup, failover, compactions, multi-datacenters and how Scylla integrates with third-party applications.

Contributing to Scylla

If you want to report a bug or submit a pull request or a patch, please read the contribution guidelines.

If you are a developer working on Scylla, please read the developer guidelines.

Contact

  • The community forum and Slack channel are for users to discuss configuration, management, and operations of ScyllaDB.
  • The developers mailing list is for developers and people interested in following the development of ScyllaDB to discuss technical topics.
Description
No description provided
Readme 329 MiB
Languages
C++ 72.8%
Python 25.9%
CMake 0.4%
GAP 0.3%
Shell 0.3%