The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
Flush failure with seastar::named_gate_closed_exception is expected
if a respective compaction group was already stopped.
Lower the severity of a log in dirty_memory_manager::flush_one
for this exception.
Fixes: https://github.com/scylladb/scylladb/issues/25037.
Closesscylladb/scylladb#25355
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.
Fixes#24248
* Enhancement, no backport needed
Closesscylladb/scylladb#24512
* github.com:scylladb/scylladb:
streaming: stream_blob: use the table sstable_generation_generator
replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
sstables: sstable_generation_generator: stop tracking highest generation
replica: table: get rid of update_sstables_known_generation
sstables: sstable_directory: stop tracking highest_generation
replica: distributed_loader: stop tracking highest_generation
sstables: sstable_generation: get rid of uuid_identifiers bool class
sstables_manager: drop uuid_sstable_identifiers
feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
test: cql_query_test: add test_sstable_load_mixed_generation_type
test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
test: database_test: move table_dir helper to test/lib/test_utils
To combine the max purgable values, instead of just combining the
timestamp values. The former way is still correct, but loses the
timestamp explosion optimization, which allows the cache reader to drop
timestamps from the overlap checks.
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
No need to start a local sharded generator.
Can just use the table's sstable generation generator
to make new sstables now that it's stateless and doesn't
depend on the highest generation found (including the uploaded
sstables).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not needed anymore.
With that database::_sstable_generation_generator can
be a regular member rather than optional and initialized
later.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not needed anymore as we always generate
uuid generations.
Move highest_generation_seen(sharded<sstables::sstable_directory>& directory)
to sstables/sstable_directory module.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is returning constant sstables::uuid_identifiers::yes now,
so let the callers just use the constant (to be dropped
in a following patch).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Wired the unrepaired, repairing and repaired views into compaction_group.
Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.
Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.
Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.
From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.
The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.
Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to support incremental repair, we'll allow each
replica::compaction_group to have two logical compaction groups
(or logical sstable sets), one for repaired, another for unrepaired.
That means we have to adapt a few places to work with
compaction_group_view instead, such that no logical compaction
group is missed when doing table or tablet wide operations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since there will be only one physical sstable set, it makes sense to move
backlog tracker to replica::compaction_group. With incremental repair,
it still makes sense to compute backlog accounting both logical sets,
since the compound backlog influences the overall read amplification,
and the total backlog across repaired and unrepaired sets can help
driving decisions like giving up on incremental repair when unrepaired
set is almost as large as the repaired set, causing an amplification
of 2.
Also it's needed for correctness because a sstable can move quickly
across the logical sets, and having one tracker for each logical
set could cause the sstable to not be erased in the old set it
belonged to;
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Refs #22733
* No backport required
Closesscylladb/scylladb#25222
* github.com:scylladb/scylladb:
locator: abstract_replication_strategy: implement local_replication_strategy
locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
locator: abstract_replication_map: rename make_effective_replication_map
locator: abstract_replication_map: rename calculate_effective_replication_map
replica: database: keyspace: rename {create,update}_effective_replication_map
locator: effective_replication_map_factory: rename create_effective_replication_map
locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
keyspace: rename get_vnode_effective_replication_map
dht: range_streamer: use naked e_r_m pointers
storage_service: use naked e_r_m pointers
alternator: ttl: use naked e_r_m pointers
locator: abstract_replication_strategy: define is_local
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to *_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to create_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prefer for specializing the local replication strategy,
local effective replication map, et. al byt defining
an is_local() predicate, similar to uses_tablets().
Note that is_vnode_based() still applies to local replication
strategy.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
Backport is not required, it is new functionality
Closesscylladb/scylladb#25063
* github.com:scylladb/scylladb:
docs: Add documentation for the nodetool dropquarantinedsstables command
nodetool: add command for dropping quarantine sstables
rest_api: add endpoint which drops all quarantined sstables
TRUNCATE TABLE performs a memtable flush and then discards the sstables
of the table being truncated. It collects the highest replay position
for both of these. When the highest replay position of the discarded
sstables is higher than the highest replay position of the flushed
memtable, that means that we have had writes during truncate which have
been flushed to disk independently of the truncate process. We check for
this and trigger an on_internal_error() which throws an exception,
informing the user that writing data concurrently with TRUNCATE TABLE is
not advised.
The problem with this is that truncate is also called from DROP KEYSPACE
and DROP TABLE. These are raft operations and exceptions thrown by them
are caught by the (...) exception handler in the raft applier fiber,
which then exits leaving the node without the ability to execute
subsequent raft commands.
This commit changes the on_internal_error() into a warning log entry. It
also outputs to keyspace/table names, the truncated_at timepoint, the
offending replay positions which caused the check to fail.
Fixes: #25173Fixes: #25013
The plural overload of this method logs an error when the sstable add
fails. This is unnecessary, the caller is expected to catch and handle
exceptions. Furthermore, this unconditional error log results in
sporadic test failures, due to the unexpected error in the logs on
shutdown.
Fixes: #24850Closesscylladb/scylladb#25235
When running compactions are aborted by the aforementioned helper, in logs there appear a line like
"Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation".
With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other.
Closesscylladb/scylladb#25136
* https://github.com/scylladb/scylladb:
compaction: Pass "reason" to perform_task_on_all_files()
compaction: Pass "reason" to run_with_compaction_disabled()
compaction: Pass "reason" to stop_and_disable_compaction()
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.
Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
* When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
* This breaks LWT correctness in certain scenarios.
* Failing test cases demonstrating this:
* test_lwt_state_is_preserved_on_tablet_migration
* test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
* Tablets may be placed on arbitrary shards by the tablet balancer.
* Accessing Paxos state in system tables could require a shard jump, degrading performance.
We move Paxos state into a dedicated Paxos table, colocated with the base table:
* Each base table gets its own Paxos state table.
* This table is lazily created on the first LWT operation.
* Its tablets are colocated with those of the base table, ensuring:
* Co-migration during tablet movement
* Co-rebuilding with the base table
* Shard alignment for local access to Paxos state
Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].
This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]:
* Tablet migration vs LWT correctness
* Paxos table sharding
Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.
References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)
Backport: not needed because this is a new feature.
Closesscylladb/scylladb#24819
* github.com:scylladb/scylladb:
create_keyspace: fix warning for tablets
docs: fix lwt.rst
docs: fix tablets.rst
alternator: enable LWT
random_failures: enable execute_lwt_transaction
test_tablets_lwt: add test_paxos_state_table_permissions
test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
test_tablets_lwt: test timeout creating paxos state table
test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
test_tablets_lwt: migrate test_lwt_support_with_tablets
test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
test_tablets_lwt: add simple test for LWT
check_internal_table_permissions: handle Paxos state tables
client_state: extract check_internal_table_permissions
paxos_store: handle base table removal
database: get_base_table_for_tablet_colocation: handle paxos state table
paxos_state: use node_local_only mode to access paxos state
query_options: add node_local_only mode
storage_proxy: handle node_local_only in query
storage_proxy: handle node_local_only in mutate
storage_proxy: introduce node_local_only flag
abstract_replication_strategy: remove unused using
storage_proxy: add coordinator_mutate_options
storage_proxy: rename create_write_response_handler -> make_write_response_handler
storage_proxy: simplify mutate_prepare
paxos_state: lazily create paxos state table
migration_manager: add timeout to start_group0_operation and announce
paxos_store: use non-internal queries
qp: make make_internal_options public
paxos_store: conditional cf_id filter
paxos_store: coroutinize
feature_service: add LWT_WITH_TABLETS feature
paxos_state: inline system_keyspace functions into paxos_store
paxos_state: extract state access functions into paxos_store
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft.
Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit().
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/19649
Fixes https://github.com/scylladb/scylladb/issues/24531Closesscylladb/scylladb#24886
[avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>]
* github.com:scylladb/scylladb:
test: add type creation to test_snapshot
storage_service: always wake up load balancer on update tablet metadata
db: schema_applier: call destroy also when exception occurs
db: replica: simplify seeding ERM during shema change
db: remove cleanup from add_column_family
db: abort on exception during schema commit phase
db: make user defined types changes atomic
replica: db: make keyspace schema changes atomic
db: atomically apply changes to tables and views
replica: make truncate_table_on_all_shards get whole schema from table_shards
service: split update_tablet_metadata into two phases
service: pull out update_tablet_metadata from migration_listener
db: service: add store_service dependency to schema_applier
service: simplify load_tablet_metadata and update_tablet_metadata
db: don't perform move on tablet_hint reference
replica: split add_column_family_and_make_directory into steps
replica: db: split drop_table into steps
db: don't move map references in merge_tables_and_views()
db: introduce commit_on_shard function
db: access types during schema merge via special storage
replica: make non-preemptive keyspace create/update/delete functions public
replica: split update keyspace into two phases
replica: split creating keyspace into two functions
db: rename create_keyspace_from_schema_partition
db: decouple functions and aggregates schema change notification from merging code
db: store functions and aggregates change batch in schema_applier
db: decouple tables and views schema change notifications from merging code
db: store tables and views schema diff in schema_applier
db: decouple user type schema change notifications from types merging code
service: unify keyspace notification functions arguments
db: replica: decouple keyspace schema change notifications to a separate function
db: add class encapsulating schema merging
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.
Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.
Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.
Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.
Fixes: #23911.
Closesscylladb/scylladb#24582
Since we abort now on failure during schema commit
there is no need for cleanup as it only manages in-memory
state.
Explicit cf.stop was added to code paths outside of schema
merging to avoid unnecessary regressions.
Now all keyspace related schema changes are observable
on given shard as they would be applied atomically.
This is achieved by commit_on_shard() function being
non-preemptive (no futures, no co_awaits).
In the future we'll extend this to the whole schema
and also other subsystems.
In this commit we make use of splitted functions introduced before.
Pattern is as follows:
- in merge_tables_and_views we call some preparatory functions
- in schema_applier::update we call non-yielding step
- in schema_applier::post_commit we call cleanups and other finalizing async
functions
Additionally we introduce frozen_schema_diff because converting
schema_ptr to global_schema_ptr triggers schema registration and
with atomic changes we need to place registration only in commit
phase. Schema freezing is the same method global_schema_ptr uses
to transport schema across shards (via schema_registry cache).