Files
scylladb/repair/task_manager_module.hh
Asias He 0d7e518a26 repair: Add tablet incremental repair support
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results

    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472
2025-08-18 11:01:21 +08:00

295 lines
12 KiB
C++

/*
* Copyright (C) 2022-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "node_ops/node_ops_ctl.hh"
#include "repair/repair.hh"
#include "service/topology_guard.hh"
#include "streaming/stream_reason.hh"
#include "tasks/task_manager.hh"
namespace repair {
class repair_task_impl : public tasks::task_manager::task::impl {
protected:
streaming::stream_reason _reason;
public:
repair_task_impl(tasks::task_manager::module_ptr module, tasks::task_id id, unsigned sequence_number, std::string scope, std::string keyspace, std::string table, std::string entity, tasks::task_id parent_id, streaming::stream_reason reason) noexcept
: tasks::task_manager::task::impl(module, id, sequence_number, std::move(scope), std::move(keyspace), std::move(table), std::move(entity), parent_id)
, _reason(reason) {
_status.progress_units = "ranges";
}
virtual std::string type() const override {
return format("{}", _reason);
}
protected:
repair_uniq_id get_repair_uniq_id() const noexcept {
return repair_uniq_id{
.id = _status.sequence_number,
.task_info = tasks::task_info(_status.id, _status.shard)
};
}
virtual future<> run() override = 0;
};
class user_requested_repair_task_impl : public repair_task_impl {
private:
lw_shared_ptr<locator::global_static_effective_replication_map> _germs;
std::vector<sstring> _cfs;
dht::token_range_vector _ranges;
std::vector<sstring> _hosts;
std::vector<sstring> _data_centers;
std::unordered_set<locator::host_id> _ignore_nodes;
bool _small_table_optimization;
std::optional<int> _ranges_parallelism;
gms::gossiper& _gossiper;
public:
user_requested_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, std::string keyspace, std::string entity, lw_shared_ptr<locator::global_static_effective_replication_map> germs, std::vector<sstring> cfs, dht::token_range_vector ranges, std::vector<sstring> hosts, std::vector<sstring> data_centers, std::unordered_set<locator::host_id> ignore_nodes, bool small_table_optimization, std::optional<int> ranges_parallelism, gms::gossiper& gossiper) noexcept
: repair_task_impl(module, id.uuid(), id.id, "keyspace", std::move(keyspace), "", std::move(entity), tasks::task_id::create_null_id(), streaming::stream_reason::repair)
, _germs(germs)
, _cfs(std::move(cfs))
, _ranges(std::move(ranges))
, _hosts(std::move(hosts))
, _data_centers(std::move(data_centers))
, _ignore_nodes(std::move(ignore_nodes))
, _small_table_optimization(small_table_optimization)
, _ranges_parallelism(ranges_parallelism)
, _gossiper(gossiper)
{}
virtual tasks::is_abortable is_abortable() const noexcept override {
return tasks::is_abortable::yes;
}
tasks::is_user_task is_user_task() const noexcept override;
protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
};
class data_sync_repair_task_impl : public repair_task_impl {
private:
dht::token_range_vector _ranges;
std::unordered_map<dht::token_range, repair_neighbors> _neighbors;
optimized_optional<abort_source::subscription> _abort_subscription;
size_t _cfs_size = 0;
public:
data_sync_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, std::string keyspace, std::string entity, dht::token_range_vector ranges, std::unordered_map<dht::token_range, repair_neighbors> neighbors, streaming::stream_reason reason, shared_ptr<node_ops_info> ops_info)
: repair_task_impl(module, id.uuid(), id.id, "keyspace", std::move(keyspace), "", std::move(entity), tasks::task_id::create_null_id(), reason)
, _ranges(std::move(ranges))
, _neighbors(std::move(neighbors))
{
if (ops_info && ops_info->as) {
_abort_subscription = ops_info->as->subscribe([this] () noexcept {
abort();
});
}
}
virtual tasks::is_abortable is_abortable() const noexcept override {
return tasks::is_abortable(!_abort_subscription);
}
protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
};
class tablet_repair_task_impl : public repair_task_impl {
private:
sstring _keyspace;
std::vector<sstring> _tables;
std::vector<tablet_repair_task_meta> _metas;
optimized_optional<abort_source::subscription> _abort_subscription;
std::optional<int> _ranges_parallelism;
size_t _metas_size = 0;
gc_clock::time_point _flush_time;
service::frozen_topology_guard _topo_guard;
bool _skip_flush;
public:
tablet_repair_sched_info sched_info;
public:
tablet_repair_task_impl(tasks::task_manager::module_ptr module, repair_uniq_id id, sstring keyspace, tasks::task_id parent_id, std::vector<sstring> tables, streaming::stream_reason reason, std::vector<tablet_repair_task_meta> metas, std::optional<int> ranges_parallelism, service::frozen_topology_guard topo_guard, bool skip_flush = false)
: repair_task_impl(module, id.uuid(), id.id, "keyspace", keyspace, "", "", parent_id, reason)
, _keyspace(std::move(keyspace))
, _tables(std::move(tables))
, _metas(std::move(metas))
, _ranges_parallelism(ranges_parallelism)
, _topo_guard(topo_guard)
, _skip_flush(skip_flush)
{
}
virtual tasks::is_abortable is_abortable() const noexcept override {
return tasks::is_abortable(!_abort_subscription);
}
gc_clock::time_point get_flush_time() const { return _flush_time; }
tasks::is_user_task is_user_task() const noexcept override;
virtual future<> release_resources() noexcept override;
private:
size_t get_metas_size() const noexcept;
protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
};
class shard_repair_task_impl : public repair_task_impl {
public:
repair_service& rs;
seastar::sharded<replica::database>& db;
seastar::sharded<netw::messaging_service>& messaging;
service::migration_manager& mm;
gms::gossiper& gossiper;
private:
locator::effective_replication_map_ptr erm;
public:
dht::token_range_vector ranges;
std::vector<sstring> cfs;
std::vector<table_id> table_ids;
repair_uniq_id global_repair_id;
std::vector<sstring> data_centers;
std::vector<sstring> hosts;
std::unordered_set<locator::host_id> ignore_nodes;
std::unordered_map<dht::token_range, repair_neighbors> neighbors;
uint64_t nr_ranges_finished = 0;
size_t nr_failed_ranges = 0;
int ranges_index = 0;
repair_stats _stats;
std::unordered_set<sstring> dropped_tables;
bool _hints_batchlog_flushed = false;
std::unordered_set<locator::host_id> nodes_down;
bool _small_table_optimization = false;
size_t small_table_optimization_ranges_reduced_factor = 1;
private:
bool _aborted = false;
std::optional<sstring> _failed_because;
std::optional<semaphore> _user_ranges_parallelism;
uint64_t _ranges_complete = 0;
gc_clock::time_point _flush_time;
service::frozen_topology_guard _frozen_topology_guard;
service::topology_guard _topology_guard = {service::null_topology_guard};
public:
tablet_repair_sched_info sched_info;
public:
shard_repair_task_impl(tasks::task_manager::module_ptr module,
tasks::task_id id,
const sstring& keyspace,
repair_service& repair,
locator::effective_replication_map_ptr erm_,
const dht::token_range_vector& ranges_,
std::vector<table_id> table_ids_,
repair_uniq_id parent_id_,
const std::vector<sstring>& data_centers_,
const std::vector<sstring>& hosts_,
const std::unordered_set<locator::host_id>& ignore_nodes_,
streaming::stream_reason reason_,
bool hints_batchlog_flushed,
bool small_table_optimization,
std::optional<int> ranges_parallelism,
gc_clock::time_point flush_time,
service::frozen_topology_guard topo_guard,
tablet_repair_sched_info sched_info = tablet_repair_sched_info());
void check_failed_ranges();
void check_in_abort_or_shutdown();
repair_neighbors get_repair_neighbors(const dht::token_range& range);
gc_clock::time_point get_flush_time() const { return _flush_time; }
void update_statistics(const repair_stats& stats) {
_stats.add(stats);
}
const std::vector<sstring>& table_names() {
return cfs;
}
const std::string& get_keyspace() const noexcept {
return _status.keyspace;
}
streaming::stream_reason reason() const noexcept {
return _reason;
}
bool hints_batchlog_flushed() const {
return _hints_batchlog_flushed;
}
locator::effective_replication_map_ptr get_erm();
size_t get_total_rf() {
return get_erm()->get_replication_factor();
}
future<> repair_range(const dht::token_range& range, table_info table);
size_t ranges_size() const noexcept;
virtual future<> release_resources() noexcept override;
protected:
future<> do_repair_ranges();
virtual future<tasks::task_manager::task::progress> get_progress() const override;
future<> run() override;
};
// The repair::task_manager_module tracks ongoing repair operations and their progress.
// A repair which has already finished successfully is dropped from this
// table, but a failed repair will remain in the table forever so it can
// be queried about more than once (FIXME: reconsider this. But note that
// failed repairs should be rare anwyay).
class task_manager_module : public tasks::task_manager::module {
private:
repair_service& _rs;
// Note that there are no "SUCCESSFUL" entries in the "status" map:
// Successfully-finished repairs are those with id <= repair_module::_sequence_number
// but aren't listed as running or failed the status map.
std::unordered_map<int, repair_status> _status;
// Map repair id into repair_info.
std::unordered_map<int, tasks::task_id> _repairs;
std::unordered_set<tasks::task_id> _pending_repairs;
// The semaphore used to control the maximum
// ranges that can be repaired in parallel.
named_semaphore _range_parallelism_semaphore;
seastar::condition_variable _done_cond;
void start(repair_uniq_id id);
void done(repair_uniq_id id, bool succeeded);
public:
static constexpr size_t max_repair_memory_per_range = 32 * 1024 * 1024;
task_manager_module(tasks::task_manager& tm, repair_service& rs, size_t max_repair_memory) noexcept;
repair_service& get_repair_service() noexcept {
return _rs;
}
repair_uniq_id new_repair_uniq_id() noexcept {
return repair_uniq_id{
.id = new_sequence_number(),
.task_info = tasks::task_info(tasks::task_id::create_random_id(), this_shard_id())
};
}
repair_status get(int id) const;
void check_in_shutdown();
void add_shard_task_id(int id, tasks::task_id ri);
void remove_shard_task_id(int id);
tasks::task_manager::task_ptr get_shard_task_ptr(int id);
std::vector<int> get_active() const;
size_t nr_running_repair_jobs();
void abort_all_repairs();
named_semaphore& range_parallelism_semaphore();
future<> run(repair_uniq_id id, std::function<void ()> func);
future<repair_status> repair_await_completion(int id, std::chrono::steady_clock::time_point timeout);
float report_progress();
future<bool> is_aborted(const tasks::task_id& uuid, shard_id shard);
};
}