mirror of
https://github.com/scylladb/scylladb.git
synced 2026-04-27 20:05:10 +00:00
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes #22472
Closes scylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
188 lines
9.7 KiB
C++
188 lines
9.7 KiB
C++
/*
|
|
* Copyright (C) 2018-present ScyllaDB
|
|
*/
|
|
|
|
/*
|
|
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include <seastar/core/sstring.hh>
|
|
#include <seastar/core/future.hh>
|
|
#include <seastar/core/sharded.hh>
|
|
#include <unordered_map>
|
|
#include <functional>
|
|
#include <set>
|
|
#include <unordered_set>
|
|
#include "seastarx.hh"
|
|
#include "db/schema_features.hh"
|
|
#include "gms/feature.hh"
|
|
|
|
namespace db {
|
|
class system_keyspace;
|
|
}
|
|
namespace service { class storage_service; }
|
|
|
|
namespace gms {
|
|
|
|
class gossiper;
|
|
class feature_service;
|
|
class i_endpoint_state_change_subscriber;
|
|
|
|
struct feature_config {
|
|
std::set<sstring> disabled_features;
|
|
};
|
|
|
|
class unsupported_feature_exception : public std::runtime_error {
|
|
public:
|
|
unsupported_feature_exception(std::string what)
|
|
: runtime_error(std::move(what))
|
|
{}
|
|
};
|
|
|
|
bool is_test_only_feature_enabled();
|
|
|
|
using namespace std::literals;
|
|
|
|
/**
|
|
* A gossip feature tracks whether all the nodes the current one is
|
|
* aware of support the specified feature.
|
|
*
|
|
* A pointer to `cql3::query_processor` can be optionally supplied
|
|
* if the instance needs to persist enabled features in a system table.
|
|
*/
|
|
class feature_service final : public peering_sharded_service<feature_service> {
|
|
void register_feature(feature& f);
|
|
void unregister_feature(feature& f);
|
|
friend class feature;
|
|
std::unordered_map<sstring, std::reference_wrapper<feature>> _registered_features;
|
|
std::unordered_set<sstring> _suppressed_features;
|
|
|
|
feature_config _config;
|
|
|
|
future<> enable_features_on_startup(db::system_keyspace&);
|
|
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
|
|
void initialize_suppressed_features_set();
|
|
#endif
|
|
public:
|
|
explicit feature_service(feature_config cfg);
|
|
~feature_service() = default;
|
|
future<> stop();
|
|
future<> enable(std::set<std::string_view> list);
|
|
db::schema_features cluster_schema_features() const;
|
|
std::set<std::string_view> supported_feature_set() const;
|
|
|
|
// Key in the 'system.scylla_local' table, that is used to
|
|
// persist enabled features
|
|
static constexpr const char* ENABLED_FEATURES_KEY = "enabled_features";
|
|
|
|
public:
|
|
gms::feature user_defined_functions { *this, "UDF"sv };
|
|
gms::feature alternator_streams { *this, "ALTERNATOR_STREAMS"sv };
|
|
gms::feature alternator_ttl { *this, "ALTERNATOR_TTL"sv };
|
|
gms::feature range_scan_data_variant { *this, "RANGE_SCAN_DATA_VARIANT"sv };
|
|
gms::feature cdc_generations_v2 { *this, "CDC_GENERATIONS_V2"sv };
|
|
gms::feature user_defined_aggregates { *this, "UDA"sv };
|
|
// Historically max_result_size contained only two fields: soft_limit and
|
|
// hard_limit. It was somehow obscure because for normal paged queries both
|
|
// fields were equal and meant page size. For unpaged queries and reversed
|
|
// queries soft_limit was used to warn when the size of the result exceeded
|
|
// the soft_limit and hard_limit was used to throw when the result was
|
|
// bigger than this hard_limit. To clean things up, we introduced the third
|
|
// field into max_result_size. It's name is page_size. Now page_size always
|
|
// means the size of the page while soft and hard limits are just what their
|
|
// names suggest. They are no longer interpreted as page size. This is not
|
|
// a backwards compatible change so this new cluster feature is used to make
|
|
// sure the whole cluster supports the new page_size field and we can safely
|
|
// send it to replicas.
|
|
gms::feature separate_page_size_and_safety_limit { *this, "SEPARATE_PAGE_SIZE_AND_SAFETY_LIMIT"sv };
|
|
// Replica is allowed to send back empty pages to coordinator on queries.
|
|
gms::feature empty_replica_pages { *this, "EMPTY_REPLICA_PAGES"sv };
|
|
gms::feature empty_replica_mutation_pages { *this, "EMPTY_REPLICA_MUTATION_PAGES"sv };
|
|
gms::feature supports_raft_cluster_mgmt { *this, "SUPPORTS_RAFT_CLUSTER_MANAGEMENT"sv };
|
|
gms::feature tombstone_gc_options { *this, "TOMBSTONE_GC_OPTIONS"sv };
|
|
gms::feature parallelized_aggregation { *this, "PARALLELIZED_AGGREGATION"sv };
|
|
gms::feature keyspace_storage_options { *this, "KEYSPACE_STORAGE_OPTIONS"sv };
|
|
gms::feature typed_errors_in_read_rpc { *this, "TYPED_ERRORS_IN_READ_RPC"sv };
|
|
gms::feature uda_native_parallelized_aggregation { *this, "UDA_NATIVE_PARALLELIZED_AGGREGATION"sv };
|
|
gms::feature aggregate_storage_options { *this, "AGGREGATE_STORAGE_OPTIONS"sv };
|
|
gms::feature collection_indexing { *this, "COLLECTION_INDEXING"sv };
|
|
gms::feature large_collection_detection { *this, "LARGE_COLLECTION_DETECTION"sv };
|
|
gms::feature range_tombstone_and_dead_rows_detection { *this, "RANGE_TOMBSTONE_AND_DEAD_ROWS_DETECTION"sv };
|
|
gms::feature truncate_as_topology_operation { *this, "TRUNCATE_AS_TOPOLOGY_OPERATION"sv };
|
|
gms::feature secondary_indexes_on_static_columns { *this, "SECONDARY_INDEXES_ON_STATIC_COLUMNS"sv };
|
|
gms::feature tablets { *this, "TABLETS"sv };
|
|
gms::feature table_digest_insensitive_to_expiry { *this, "TABLE_DIGEST_INSENSITIVE_TO_EXPIRY"sv };
|
|
// If this feature is enabled, schema versions are persisted by the group 0 command
|
|
// that modifies schema instead of being calculated as a digest (hash) by each node separately.
|
|
// The feature controls both the 'global' schema version (the one gossiped as application_state::SCHEMA)
|
|
// and the per-table schema versions (schema::version()).
|
|
// The feature affects non-Raft mode as well (e.g. during RECOVERY), where we send additional
|
|
// tombstones and flags to schema tables when performing schema changes, allowing us to
|
|
// revert to the digest method when necessary (if we must perform a schema change during RECOVERY).
|
|
gms::feature group0_schema_versioning { *this, "GROUP0_SCHEMA_VERSIONING"sv };
|
|
gms::feature supports_consistent_topology_changes { *this, "SUPPORTS_CONSISTENT_TOPOLOGY_CHANGES"sv };
|
|
gms::feature host_id_based_hinted_handoff { *this, "HOST_ID_BASED_HINTED_HANDOFF"sv };
|
|
gms::feature topology_requests_type_column { *this, "TOPOLOGY_REQUESTS_TYPE_COLUMN"sv };
|
|
gms::feature native_reverse_queries { *this, "NATIVE_REVERSE_QUERIES"sv };
|
|
gms::feature zero_token_nodes { *this, "ZERO_TOKEN_NODES"sv };
|
|
gms::feature view_build_status_on_group0 { *this, "VIEW_BUILD_STATUS_ON_GROUP0"sv };
|
|
gms::feature views_with_tablets { *this, "VIEWS_WITH_TABLETS"sv };
|
|
gms::feature group0_limited_voters { *this, "GROUP0_LIMITED_VOTERS"sv };
|
|
gms::feature compaction_history_upgrade { *this, "COMPACTION_HISTORY_UPGRADE"};
|
|
|
|
// Whether to allow fragmented commitlog entries. While this is a node-local feature as such, hide
|
|
// behind a feature to ensure an upgrading cluster appears to be at least functional before using,
|
|
// to avoid data loss if rolling back in a dirty state, but also because it changes which/how mutations
|
|
// can be applied to a given node - i.e. with it on, a node can accept larger, say, schema mutations,
|
|
// whereas without it, it will fail the insert - i.e. for things like raft etc _all_ nodes should
|
|
// have it or none, otherwise we can get partial failures on writes.
|
|
gms::feature fragmented_commitlog_entries { *this, "FRAGMENTED_COMMITLOG_ENTRIES"sv };
|
|
gms::feature maintenance_tenant { *this, "MAINTENANCE_TENANT"sv };
|
|
|
|
gms::feature tablet_incremental_repair { *this, "TABLET_INCREMENTAL_REPAIR"sv };
|
|
gms::feature tablet_repair_scheduler { *this, "TABLET_REPAIR_SCHEDULER"sv };
|
|
gms::feature tablet_merge { *this, "TABLET_MERGE"sv };
|
|
gms::feature tablet_rack_aware_view_pairing { *this, "TABLET_RACK_AWARE_VIEW_PAIRING"sv };
|
|
|
|
gms::feature tablet_migration_virtual_task { *this, "TABLET_MIGRATION_VIRTUAL_TASK"sv };
|
|
gms::feature tablet_resize_virtual_task { *this, "TABLET_RESIZE_VIRTUAL_TASK"sv };
|
|
|
|
// A feature just for use in tests. It must not be advertised unless
|
|
// the "features_enable_test_feature" injection is enabled.
|
|
// This feature MUST NOT be advertised in release mode!
|
|
gms::feature test_only_feature { *this, "TEST_ONLY_FEATURE"sv };
|
|
gms::feature address_nodes_by_host_ids { *this, "ADDRESS_NODES_BY_HOST_IDS"sv };
|
|
|
|
gms::feature in_memory_tables { *this, "IN_MEMORY_TABLES"sv };
|
|
gms::feature workload_prioritization { *this, "WORKLOAD_PRIORITIZATION"sv };
|
|
gms::feature colocated_tablets { *this, "COLOCATED_TABLETS"sv };
|
|
gms::feature file_stream { *this, "FILE_STREAM"sv };
|
|
gms::feature compression_dicts { *this, "COMPRESSION_DICTS"sv };
|
|
gms::feature tablet_options { *this, "TABLET_OPTIONS"sv };
|
|
gms::feature tablet_load_stats_v2 { *this, "TABLET_LOAD_STATS_V2"sv };
|
|
gms::feature sstable_compression_dicts { *this, "SSTABLE_COMPRESSION_DICTS"sv };
|
|
gms::feature repair_based_tablet_rebuild { *this, "REPAIR_BASED_TABLET_REBUILD"sv };
|
|
gms::feature enforced_raft_rpc_scheduling_group { *this, "ENFORCED_RAFT_RPC_SCHEDULING_GROUP"sv };
|
|
gms::feature load_and_stream_abort_rpc_message { *this, "LOAD_AND_STREAM_ABORT_RPC_MESSAGE"sv };
|
|
gms::feature topology_global_request_queue { *this, "TOPOLOGY_GLOBAL_REQUEST_QUEUE"sv };
|
|
gms::feature lwt_with_tablets { *this, "LWT_WITH_TABLETS"sv };
|
|
gms::feature repair_msg_split { *this, "REPAIR_MSG_SPLIT"sv };
|
|
public:
|
|
|
|
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;
|
|
|
|
static std::set<sstring> to_feature_set(sstring features_string);
|
|
future<> enable_features_on_join(gossiper&, db::system_keyspace&, service::storage_service&);
|
|
future<> on_system_tables_loaded(db::system_keyspace& sys_ks);
|
|
|
|
// Performs the feature check.
|
|
// Throws an unsupported_feature_exception if there is a feature either
|
|
// in `enabled_features` or `unsafe_to_disable_features` that is not being
|
|
// currently supported by this node.
|
|
void check_features(const std::set<sstring>& enabled_features, const std::set<sstring>& unsafe_to_disable_features);
|
|
};
|
|
|
|
} // namespace gms
|