mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-24 18:40:38 +00:00
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. Implementation: - Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view. - Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)). - Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all compaction groups in the storage group. - Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from all compaction groups across all storage groups (needed for multi-tablet tables). - Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the repaired-only optimization is active; used by get_max_purgeable_timestamp() in compaction.cc to bypass the memtable shadow check. - is_tombstone_gc_repaired_only() private helper gates both methods: requires is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion. - Add error injection "view_update_generator_pause_before_processing" in process_staging_sstables() to support testing the staging-delay scenario. - New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view landed in the repaired set via the hints-before-snapshot path. - New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before writing T_base so D_base is staged on servers[0] via row-sync; blocks the view-update-generator with an error injection; writes T_base + T_mv; runs MV repair (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view in repaired set); asserts no resurrection; releases injection; waits for staging to complete; asserts no resurrection after a second flush+compaction. Demonstrates that the read-before-write in stream_view_replica_updates() makes the optimization safe even when staging fires after T_mv has been GC'd. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
83 lines
3.7 KiB
C++
83 lines
3.7 KiB
C++
/*
|
|
* Copyright (C) 2021-present ScyllaDB
|
|
*
|
|
*/
|
|
|
|
/*
|
|
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include <seastar/core/condition-variable.hh>
|
|
|
|
#include "schema/schema_fwd.hh"
|
|
#include "sstables/open_info.hh"
|
|
#include "compaction_descriptor.hh"
|
|
|
|
class reader_permit;
|
|
|
|
namespace sstables {
|
|
class sstable_set;
|
|
class sstables_manager;
|
|
struct sstable_writer_config;
|
|
}
|
|
|
|
namespace compaction {
|
|
class compaction_strategy;
|
|
class compaction_strategy_state;
|
|
class compaction_backlog_tracker;
|
|
|
|
class compaction_group_view {
|
|
public:
|
|
virtual ~compaction_group_view() {}
|
|
virtual dht::token_range token_range() const noexcept = 0;
|
|
virtual const schema_ptr& schema() const noexcept = 0;
|
|
// min threshold as defined by table.
|
|
virtual unsigned min_compaction_threshold() const noexcept = 0;
|
|
virtual bool compaction_enforce_min_threshold() const noexcept = 0;
|
|
virtual future<lw_shared_ptr<const sstables::sstable_set>> main_sstable_set() const = 0;
|
|
virtual future<lw_shared_ptr<const sstables::sstable_set>> maintenance_sstable_set() const = 0;
|
|
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
|
|
// Returns true when tombstone GC considers only the repaired sstable set, meaning the
|
|
// memtable does not need to be consulted (its data is always newer than any GC-eligible tombstone).
|
|
virtual bool skip_memtable_for_tombstone_gc() const noexcept = 0;
|
|
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
|
|
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
|
|
virtual compaction_strategy& get_compaction_strategy() const noexcept = 0;
|
|
virtual compaction_strategy_state& get_compaction_strategy_state() noexcept = 0;
|
|
virtual reader_permit make_compaction_reader_permit() const = 0;
|
|
virtual sstables::sstables_manager& get_sstables_manager() noexcept = 0;
|
|
virtual sstables::shared_sstable make_sstable(sstables::sstable_state) const = 0;
|
|
virtual sstables::shared_sstable make_sstable(sstables::sstable_state, sstables::sstable_version_types) const = 0;
|
|
virtual sstables::sstable_writer_config configure_writer(sstring origin) const = 0;
|
|
virtual api::timestamp_type min_memtable_timestamp() const = 0;
|
|
virtual api::timestamp_type min_memtable_live_timestamp() const = 0;
|
|
virtual api::timestamp_type min_memtable_live_row_marker_timestamp() const = 0;
|
|
virtual bool memtable_has_key(const dht::decorated_key& key) const = 0;
|
|
virtual future<> on_compaction_completion(compaction_completion_desc desc, sstables::offstrategy offstrategy) = 0;
|
|
virtual bool is_auto_compaction_disabled_by_user() const noexcept = 0;
|
|
virtual bool tombstone_gc_enabled() const noexcept = 0;
|
|
virtual tombstone_gc_state get_tombstone_gc_state() const noexcept = 0;
|
|
virtual compaction_backlog_tracker& get_backlog_tracker() = 0;
|
|
virtual const std::string get_group_id() const noexcept = 0;
|
|
virtual seastar::condition_variable& get_staging_done_condition() noexcept = 0;
|
|
virtual dht::token_range get_token_range_after_split(const dht::token& t) const noexcept = 0;
|
|
virtual int64_t get_sstables_repaired_at() const noexcept = 0;
|
|
};
|
|
|
|
} // namespace compaction
|
|
|
|
namespace fmt {
|
|
|
|
template <>
|
|
struct formatter<compaction::compaction_group_view> : formatter<string_view> {
|
|
template <typename FormatContext>
|
|
auto format(const compaction::compaction_group_view& t, FormatContext& ctx) const {
|
|
auto s = t.schema();
|
|
return fmt::format_to(ctx.out(), "{}.{} compaction_group={}", s->ks_name(), s->cf_name(), t.get_group_id());
|
|
}
|
|
};
|
|
|
|
} // namespace fmt
|