When finalizing a tablet split, all data must have been moved into
split-ready compaction groups before the storage groups can be remapped
to the new tablet count. If split-unready groups still hold data at that
point, handle_tablet_split_completion() calls on_internal_error(), which
previously only reported the tablet and table IDs — giving no insight
into why the split-unready groups were not empty.
Add fmt::formatter specializations for compaction_group and storage_group
so the full state of the offending storage_group is included in the error
message. The storage_group formatter emits:
main=<cg>, merging=[<cg>...], split_ready=[<cg>...]
Each compaction_group formatter emits:
[sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>]
where sstable_desc includes filename, origin, identifier and originating
host, memtable_empty reflects whether all memtables have been flushed,
and sstable_add_gate count reveals whether an in-flight sstable add is
holding data in the group.
Supporting changes:
- compaction_group: add memtable_empty() const noexcept (delegates to
memtable_list::empty()) and a const overload of sstable_add_gate()
so both are accessible from a const compaction_group reference inside
the formatter.
- Promote sstable_desc from a local lambda in compaction_group_for_sstable
to a static free function so it is reusable by the formatter.
Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29178
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
repaired-only optimization is active; used by get_max_purgeable_timestamp() in
compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
writing T_base so D_base is staged on servers[0] via row-sync; blocks the
view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
(fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
in repaired set); asserts no resurrection; releases injection; waits for staging to
complete; asserts no resurrection after a second flush+compaction. Demonstrates that
the read-before-write in stream_view_replica_updates() makes the optimization safe even
when staging fires after T_mv has been GC'd.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.
* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.
Refs SCYLLADB-770
no backport - still a new and experimental feature
Closesscylladb/scylladb#29207
* github.com:scylladb/scylladb:
test: logstor: additional logstor tests
docs/dev: add logstor on-disk format section
logstor: add version and crc to buffer header
test: logstor: tablet split/merge and migration
logstor: enable tablet balancing
logstor: streaming of logstor segments using stream_blob
logstor: add take_logstor_snapshot
logstor: segment input/output stream
logstor: implement compaction_group::cleanup
logstor: tablet split
logstor: tablet merge
logstor: add compaction reenabler
logstor: add segment header
logstor: serialize writes to active segment
replica: extend compaction_group functions for logstor
replica: add compaction_group_for_logstor_segment
logstor: code cleanup
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
add the function table::take_logstor_snapshot that is similar to
take_storage_snapshot for sstables.
given a token range, for each storage group in the range, it flushes the
separator buffers and then makes a snapshot of all segments in the sg's
compaction groups while disabling compaction.
the segment snapshot holds a reference to the segment so that it won't
be freed by compaction, and it provides an input stream for reading the
segment.
this will be used for tablet migration to stream the segments.
implement tablet merge with logstor.
disable compaction for the new compaction group, then merge the merging
compaction groups by merging their logstor segments set into the new cg
- simply merging the segment histogram.
add the function table::compaction_group_for_logstor_segment that we use
when recovering a segment to find the compaction group for a segment
based on its token range, similarly to compaction_group_for_sstable for
sstables.
extract the common logic from compaction_group_for_sstable to a common
function compaction_group_for_token_range that finds a compaction group
for a token range.
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.
Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.
Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.
to use, add to config:
```
enable_logstor: true
experimental_features:
- logstor
```
create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```
INSERT, SELECT, DELETE work as expected
UPDATE not supported yet
no backport - new feature
Closesscylladb/scylladb#28706
* github.com:scylladb/scylladb:
logstor: trigger separator flush for buffers that hold old segments
docs/dev: add logstor documentation
logstor: recover segments into compaction groups
logstor: range read
logstor: change index to btree by token per table
logstor: move segments to replica::compaction_group
db: update dirty mem limits dynamically
logstor: track memory usage
logstor: logstor stats api
logstor: compaction buffer pool
logstor: separator: flush buffer when full
logstor: hold segment until index updates
logstor: truncate table
logstor: enable/disable compaction per table
logstor: separator buffer pool
test: logstor: add separator and compaction tests
logstor: segment and separator barrier
logstor: separator debt controller
logstor: compaction controller
logstor: recovery: recover mixed segments using separator
logstor: wait for pending reads in compaction
logstor: separator
logstor: compaction groups
logstor: cache files for read
logstor: recovery: initial
logstor: add segment generation
logstor: reserve segments for compaction
logstor: index: buckets
logstor: add buffer header
logstor: add group_id
logstor: record generation
logstor: generation utility
logstor: use RIPEMD-160 for index key
test: add test_logstor.py
api: add logstor compaction trigger endpoint
replica: add logstor to db
schema: add logstor cf property
logstor: initial commit
db: disable tablet balancing with logstor
db: add logstor experimental feature flag
`storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id.
The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup.
This series fixes that by:
1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed.
2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active.
3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand.
Improvements. No backport is required.
Closesscylladb/scylladb#28963
* github.com:scylladb/scylladb:
replica/table: avoid computing token range side in storage_group_of() on hot path
replica/compaction_group: add lazy select_compaction_group() overloads
locator/tablets: add tablet_map::get_tablet_range_side()
A compaction group has a separator buffer that holds the mixed segments
alive until the separator buffer is flushed. A mixed segment can be
freed only after all separator buffers that hold writes from the segment
are flushed.
Typically a separator buffer is flushed when it becomes full. However
it's possible for example that one compaction groups is filled slower
than others and holds many segments.
To fix this we trigger a separator flush periodically for separator
buffers that hold old segments. We track the active segment sequence
number and for each separator buffer the oldest sequence number it
holds.
Change the primary index to be a btree that is ordered by token,
similarly to a memtable, and create a index per-table instead of a
single global index.
Add a segment_set member to replica::compaction_group that manages the
logstor segments that belong to the compaction group, similarly to how
it manages sstables. Add also a separator buffer in each compaction
group.
When writing a mutation to a compaction group, the mutation is written
to the active segment and to the separator buffer of the compaction
group, and when the separator buffer is flushed the segment is added to
the compaction_group's segment set.
When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage.
Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages.
For tablet size collection, we should include all the tablets which are currently taking up disk space. This means:
- on leaving replica, include all tablets until the `cleanup` stage
- on pending replica, include all tablets starting with the `write_both_read_new` and later stages
While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1
Fixes: SCYLLADB-829
Closesscylladb/scylladb#28587
* github.com:scylladb/scylladb:
load_stats: add filtering for tablet sizes
load_stats: move tablet filtering for table size computation
load_stats: bring the comment and code in sync
Change `storage_group::select_compaction_group()` to accept a token
(and tablet_map) and compute the tablet range side only when
splitting_mode() is active.
Add an overload for selecting the compaction group for an sstable
spanning a token range.
Consider this:
- repair takes the lock holder
- tablet merge filber destories the compaction group and the compaction state
- repair fails
- repair destroy the lock holder
This is observed in the test:
```
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28]
...
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair
repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551]
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair
repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found)
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge
....
scylla[10793] Segmentation fault on shard 28, in scheduling group streaming
```
The rwlock in compaction_state could be destroyed before the lock holder
of the rwlock is destroyed. This causes user after free when the lock
the holder is destroyed.
To fix it, users of repair lock will now be waited when a compaction
group is being stopped.
That way, compaction group - which controls the lifetime of rwlock -
cannot be destroyed while the lock is held.
Additionally, the merge completion fiber - that might remove groups -
is properly serialized with incremental repair.
The issue can be reproduced using sanitize build consistently and can not
be reproduced after the fix.
Fixes#27365
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This patch moves the table size tablet filtering code from a lambda in
storage_service::load_stats_for_tablet_based_tables() to the code
section where it will be used:
tablet_storage_group_manager::table_load_stats()
This is needed to better accomodate the next commit which will add
code for filtering tablets for tablet sizes.
Strongly consistent writes require knowing the maximum timestamp of
locally applied mutations to guarantee monotonically increasing
timestamps for subsequent writes.
This commit adds a function that returns the maximum timestamp for a
given tablet.
Why it is safe to use this function with deleted cells:
* Tombstones are included in memtable.get_max_timestamp() calculations.
* The maximum timestamp of a memtable is used to initialize the maximum
timestamp of the resulting sstable.
* During compaction, a new sstable’s maximum timestamp is initialized as
the maximum of the contributing sstables.
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
- table, storage_group: add compaction_group_count
- And use to reserve vector capacity before adding an item per compaction_group
- table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
- compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead.
* Improvement, no backport needed
Closesscylladb/scylladb#27479
* github.com:scylladb/scylladb:
table: reduce allocations by using for_each_compaction_group rather than compaction_groups()
replica: storage_group: rename compaction_groups to compaction_groups_immediate
We want the invariant that after ACK, all sealed sstables will be split.
If check-and-attach is not atomic, this sequence is possible:
1) no split decision set.
2) Unsplit sstable is checked, no need to split, sealed.
3) split decision is set and ACKed
4) unsplit sstable is attached
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since the function must only be used on new sstables, it should
be renamed to something describing its usage should be restricted.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Can potentially lead to unnecessary abort.
compaction_groups() and for_each_compaction_group() can throw.
Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
Every table and sstable set keeps track of the total file size
of contained sstables.
Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.
We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.
This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.
This is the first change in the size based load balancing series.
Closesscylladb/scylladb#26035
Some files in compaction/ have using namespace {compaction,sstables}
clauses, some even in headers. This is considered bad practice and
muddies the namespace use. Remove them.
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them
To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.
Fixes#25551.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#25563
This patch addes incremental_repair support in compaction.
- The sstables are split into repaired and unrepaired set.
- Repaired and unrepaired set compact sperately.
- The repaired_at from sstable and sstables_repaired_at from
system.tablets table are used to decide if a sstable is repaired or
not.
- Different compactions tasks, e.g., minor, major, scrub, split, are
serialized with tablet repair.
Wired the unrepaired, repairing and repaired views into compaction_group.
Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.
Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.
Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.
From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.
The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to support incremental repair, we'll allow each
replica::compaction_group to have two logical compaction groups
(or logical sstable sets), one for repaired, another for unrepaired.
That means we have to adapt a few places to work with
compaction_group_view instead, such that no logical compaction
group is missed when doing table or tablet wide operations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since there will be only one physical sstable set, it makes sense to move
backlog tracker to replica::compaction_group. With incremental repair,
it still makes sense to compute backlog accounting both logical sets,
since the compound backlog influences the overall read amplification,
and the total backlog across repaired and unrepaired sets can help
driving decisions like giving up on incremental repair when unrepaired
set is almost as large as the repaired set, causing an amplification
of 2.
Also it's needed for correctness because a sstable can move quickly
across the logical sets, and having one tracker for each logical
set could cause the sstable to not be erased in the old set it
belonged to;
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Destructor of database_sstable_write_monitor, which is created
in table::try_flush_memtable_to_sstable, tries to get the compaction
state of the processed compaction group. If at this point
the compaction group is already stopped (and the compaction state
is removed), e.g. due to concurrent tablet merge, an exception is
thrown and a node coredumps.
Add flush gate to compaction group to wait for flushes in
compaction_group::stop. Hold the gate in seal function in
table::make_memtable_list. seal function is turned into
a coroutine to ensure it won't throw.
Wait until async_gate is closed before flushing, to ensure that
all data is written into sstables. Stop ongoing compactions
beforehand.
Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber.
Stop method already flushes the compaction group.
Fixes: #23911.
Closesscylladb/scylladb#24582
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.
Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault
In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.
Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.
Fixes#23162.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24058
There are two semaphores in table for synchronizing changes to sstable list:
sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.
sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.
they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.
problem:
A = tablet cleanup
B = take_snapshot()
1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk
Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.
But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.
That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.
The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.
So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.
Fixes#23049.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23117
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh` is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22439
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).
Users can see running resize tasks - finished tasks are not
presented with the task manager API.
A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.
Fixes: #21366.
Fixes: #21367.
No backport, new feature
Closesscylladb/scylladb#21891
* github.com:scylladb/scylladb:
test: boost: check resize_task_info in tablet_test.cc
test: add tests to check revoked resize virtual tasks
test: add tests to check the list of resize virtual tasks
test: add tests to check spilt and merge virtual tasks status
test: test_tablet_tasks: generalize functions
replica: service: add split virtual task's children
replica: service: pass parent info down to storage_group::split
tasks: children of virtual tasks aren't internal by default
tasks: initialize shard in task_info ctor
service: extend tablet_virtual_task::abort
service: retrun status_helper struct from tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::wait
tasks: add suspended task state
service: extend tablet_virtual_task::get_status
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: add service::task_manager_module::get_nodes
tasks: add task_manager::get_nodes
tasks: drop noexcept from module::get_nodes
replica: service: add resize_task_info static column to system.tablets
locator: extend tablet_task_info to cover resize tasks
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.
Fixes: #21480Closesscylladb/scylladb#22251