Commit Graph

10 Commits

Author SHA1 Message Date
Raphael S. Carvalho
474e962e01 compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
  only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
  compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
  all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
  repaired-only optimization is active; used by get_max_purgeable_timestamp() in
  compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
  is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
  process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
  D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
  tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
  compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
  landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
  writing T_base so D_base is staged on servers[0] via row-sync; blocks the
  view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
  (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
  in repaired set); asserts no resurrection; releases injection; waits for staging to
  complete; asserts no resurrection after a second flush+compaction. Demonstrates that
  the read-before-write in stream_view_replica_updates() makes the optimization safe even
  when staging fires after T_mv has been GC'd.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 16:59:09 -03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Botond Dénes
81e214237f Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.

Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.

However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.

The new mechanism creates a clone sstable with a fresh generation:

- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction

This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.

Backport is not required, it is a new feature

Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453

Closes scylladb/scylladb#28338

* github.com:scylladb/scylladb:
  docs: document components_digests subcomponent and trailing digest in Scylla.db
  sstable_compaction_test: Add tests for perform_component_rewrite
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: replace rewrite_statistics with new rewrite component mechanism
  sstables: add new rewrite component mechanism for safe sstable component rewriting
  compaction: add compaction_group_view method to specify sstable version
  sstables: add null_data_sink and serialized_checksum for checksum-only calculation
  sstables: extract default write open flags into a constant
  sstables: Add write_simple_with_digest for component checksumming
  sstables: Extract file writer closing logic into separate methods
  sstables: Implement CRC32 digest-only writer
2026-03-10 16:02:53 +02:00
Botond Dénes
f3ee6a0bd1 compaction: use tombstone_gc_state with value semantics
Instead of passing around references to it, pass around values.

This object is now designed to be used as a value-type, after recent
refactoring.
2026-03-03 14:09:27 +02:00
Taras Veretilnyk
4aa0a3acf9 compaction: add compaction_group_view method to specify sstable version
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
2026-02-26 22:38:55 +01:00
Raphael S. Carvalho
2dae0a7380 compaction: Preserve state of input sstable in maybe_split_new_sstable()
This is crucial with MVs, since the splitting must preserve the state of
the original sstable. We want the sstable to be in staging dir, so it's
excluded when calculating the diff for performing pushes to view
replicas.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-12-12 16:59:50 -03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00