scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	20fe1e6f68	replica: Improve diagnostics when tablet split fails due to non-empty split-unready groups When finalizing a tablet split, all data must have been moved into split-ready compaction groups before the storage groups can be remapped to the new tablet count. If split-unready groups still hold data at that point, handle_tablet_split_completion() calls on_internal_error(), which previously only reported the tablet and table IDs — giving no insight into why the split-unready groups were not empty. Add fmt::formatter specializations for compaction_group and storage_group so the full state of the offending storage_group is included in the error message. The storage_group formatter emits: main=<cg>, merging=[<cg>...], split_ready=[<cg>...] Each compaction_group formatter emits: [sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>] where sstable_desc includes filename, origin, identifier and originating host, memtable_empty reflects whether all memtables have been flushed, and sstable_add_gate count reveals whether an in-flight sstable add is holding data in the group. Supporting changes: - compaction_group: add memtable_empty() const noexcept (delegates to memtable_list::empty()) and a const overload of sstable_add_gate() so both are accessible from a const compaction_group reference inside the formatter. - Promote sstable_desc from a local lambda in compaction_group_for_sstable to a static free function so it is reusable by the formatter. Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29178	2026-05-11 16:59:05 +03:00
Raphael S. Carvalho	474e962e01	compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. Implementation: - Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view. - Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)). - Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all compaction groups in the storage group. - Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from all compaction groups across all storage groups (needed for multi-tablet tables). - Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the repaired-only optimization is active; used by get_max_purgeable_timestamp() in compaction.cc to bypass the memtable shadow check. - is_tombstone_gc_repaired_only() private helper gates both methods: requires is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion. - Add error injection "view_update_generator_pause_before_processing" in process_staging_sstables() to support testing the staging-delay scenario. - New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view landed in the repaired set via the hints-before-snapshot path. - New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before writing T_base so D_base is staged on servers[0] via row-sync; blocks the view-update-generator with an error injection; writes T_base + T_mv; runs MV repair (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view in repaired set); asserts no resurrection; releases injection; waits for staging to complete; asserts no resurrection after a second flush+compaction. Demonstrates that the read-before-write in stream_view_replica_updates() makes the optimization safe even when staging fires after T_mv has been GC'd. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-20 16:59:09 -03:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	22949bae52	Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine. * tablet merge simply merges the histograms of segments of one compaction group with another. * for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups. * for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it. * we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range. Refs SCYLLADB-770 no backport - still a new and experimental feature Closes scylladb/scylladb#29207 * github.com:scylladb/scylladb: test: logstor: additional logstor tests docs/dev: add logstor on-disk format section logstor: add version and crc to buffer header test: logstor: tablet split/merge and migration logstor: enable tablet balancing logstor: streaming of logstor segments using stream_blob logstor: add take_logstor_snapshot logstor: segment input/output stream logstor: implement compaction_group::cleanup logstor: tablet split logstor: tablet merge logstor: add compaction reenabler logstor: add segment header logstor: serialize writes to active segment replica: extend compaction_group functions for logstor replica: add compaction_group_for_logstor_segment logstor: code cleanup	2026-04-12 16:11:12 +03:00
Michael Litvak	b71762d5da	counters: reuse counter IDs by rack For counter updates, use a counter ID that is constructed from the node's rack instead of the node's host ID. A rack can have at most two active tablet replicas at a time: a single normal tablet replica, and during tablet migration there are two active replicas, the normal and pending replica. Therefore we can have two unique counter IDs per rack that are reused by all replicas in the rack. We construct the counter ID from the rack UUID, which is constructed from the name "dc:rack". The pending replica uses a deterministic variation of the rack's counter ID by negating it. This improves the performance and size of counter cells by having less unique counter IDs and less counter shards in a counter cell. Previously the number of counter shards was the number of different host_id's that updated the counter, which can be typically the number of nodes in the cluster and continue growing indefinitely when nodes are replaced. with the rack-based counter id the number of counter shards will be at most twice the number of different racks (including removed racks, which should not be significant). Fixes SCYLLADB-356	2026-04-09 13:08:02 +02:00
Michael Litvak	78426ae31b	logstor: add take_logstor_snapshot add the function table::take_logstor_snapshot that is similar to take_storage_snapshot for sstables. given a token range, for each storage group in the range, it flushes the separator buffers and then makes a snapshot of all segments in the sg's compaction groups while disabling compaction. the segment snapshot holds a reference to the segment so that it won't be freed by compaction, and it provides an input stream for reading the segment. this will be used for tablet migration to stream the segments.	2026-03-31 18:45:08 +02:00
Michael Litvak	5de39afc24	logstor: tablet merge implement tablet merge with logstor. disable compaction for the new compaction group, then merge the merging compaction groups by merging their logstor segments set into the new cg - simply merging the segment histogram.	2026-03-31 18:40:57 +02:00
Michael Litvak	d3db967802	replica: add compaction_group_for_logstor_segment add the function table::compaction_group_for_logstor_segment that we use when recovering a segment to find the compaction group for a segment based on its token range, similarly to compaction_group_for_sstable for sstables. extract the common logic from compaction_group_for_sstable to a common function compaction_group_for_token_range that finds a compaction group for a token range.	2026-03-31 18:40:56 +02:00
Botond Dénes	5573c3b18e	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier()	2026-03-20 09:05:52 +02:00
Avi Kivity	6b259babeb	Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables. Main flows and components: * The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks. * The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable. * On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO. * On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record. * We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage. * The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments. * Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group. Currently this mode is experimental and requires an experimental flag to be enabled. Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl. to use, add to config: ``` enable_logstor: true experimental_features: - logstor ``` create a table: ``` CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor'; ``` INSERT, SELECT, DELETE work as expected UPDATE not supported yet no backport - new feature Closes scylladb/scylladb#28706 * github.com:scylladb/scylladb: logstor: trigger separator flush for buffers that hold old segments docs/dev: add logstor documentation logstor: recover segments into compaction groups logstor: range read logstor: change index to btree by token per table logstor: move segments to replica::compaction_group db: update dirty mem limits dynamically logstor: track memory usage logstor: logstor stats api logstor: compaction buffer pool logstor: separator: flush buffer when full logstor: hold segment until index updates logstor: truncate table logstor: enable/disable compaction per table logstor: separator buffer pool test: logstor: add separator and compaction tests logstor: segment and separator barrier logstor: separator debt controller logstor: compaction controller logstor: recovery: recover mixed segments using separator logstor: wait for pending reads in compaction logstor: separator logstor: compaction groups logstor: cache files for read logstor: recovery: initial logstor: add segment generation logstor: reserve segments for compaction logstor: index: buckets logstor: add buffer header logstor: add group_id logstor: record generation logstor: generation utility logstor: use RIPEMD-160 for index key test: add test_logstor.py api: add logstor compaction trigger endpoint replica: add logstor to db schema: add logstor cf property logstor: initial commit db: disable tablet balancing with logstor db: add logstor experimental feature flag	2026-03-20 00:18:09 +02:00
Botond Dénes	4981e72607	Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski `storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id. The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup. This series fixes that by: 1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed. 2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active. 3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand. Improvements. No backport is required. Closes scylladb/scylladb#28963 * github.com:scylladb/scylladb: replica/table: avoid computing token range side in storage_group_of() on hot path replica/compaction_group: add lazy select_compaction_group() overloads locator/tablets: add tablet_map::get_tablet_range_side()	2026-03-19 14:27:12 +02:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	a9d0211a64	logstor: change index to btree by token per table Change the primary index to be a btree that is ordered by token, similarly to a memtable, and create a index per-table instead of a single global index.	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Tomasz Grabiec	518470e89e	Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage. Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages. For tablet size collection, we should include all the tablets which are currently taking up disk space. This means: - on leaving replica, include all tablets until the `cleanup` stage - on pending replica, include all tablets starting with the `write_both_read_new` and later stages While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1 Fixes: SCYLLADB-829 Closes scylladb/scylladb#28587 * github.com:scylladb/scylladb: load_stats: add filtering for tablet sizes load_stats: move tablet filtering for table size computation load_stats: bring the comment and code in sync	2026-03-13 13:08:11 +01:00
Tomasz Grabiec	7706c9e8c4	replica: table: Propagate old erm to storage group merge	2026-03-12 22:45:01 +01:00
Łukasz Paszkowski	419e9aa323	replica/compaction_group: add lazy select_compaction_group() overloads Change `storage_group::select_compaction_group()` to accept a token (and tablet_map) and compute the tablet range side only when splitting_mode() is active. Add an overload for selecting the compaction group for an sstable spanning a token range.	2026-03-09 17:59:36 +01:00
Asias He	225b10b683	repair: Fix rwlock in compaction_state and lock holder lifecycle Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-03 21:05:15 -03:00
Ferenc Szili	d0a5a1d5d0	load_stats: move tablet filtering for table size computation This patch moves the table size tablet filtering code from a lambda in storage_service::load_stats_for_tablet_based_tables() to the code section where it will be used: tablet_storage_group_manager::table_load_stats() This is needed to better accomodate the next commit which will add code for filtering tablets for tablet sizes.	2026-02-26 11:07:53 +01:00
Petr Gusev	a8350b274e	table: add get_max_timestamp_for_tablet Strongly consistent writes require knowing the maximum timestamp of locally applied mutations to guarantee monotonically increasing timestamps for subsequent writes. This commit adds a function that returns the maximum timestamp for a given tablet. Why it is safe to use this function with deleted cells: * Tombstones are included in memtable.get_max_timestamp() calculations. * The maximum timestamp of a memtable is used to initialize the maximum timestamp of the resulting sstable. * During compaction, a new sstable’s maximum timestamp is initialized as the maximum of the contributing sstables.	2026-01-21 14:56:00 +01:00
Benny Halevy	5be6b80936	replica: table, storage_group, compaction_group: add needs_flush Table needs flush if not all its memtable lists are empty. To be used in the next patch for a unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:41:22 +02:00
Pavel Emelyanov	d892140655	Merge 'Reduce allocations when traversing compaction_groups' from Benny Halevy - table, storage_group: add compaction_group_count - And use to reserve vector capacity before adding an item per compaction_group - table: reduce allocations by using for_each_compaction_group rather than compaction_groups() - compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead. * Improvement, no backport needed Closes scylladb/scylladb#27479 * github.com:scylladb/scylladb: table: reduce allocations by using for_each_compaction_group rather than compaction_groups() replica: storage_group: rename compaction_groups to compaction_groups_immediate	2025-12-29 16:26:33 +03:00
Benny Halevy	0e27ee67d2	replica: storage_group: rename compaction_groups to compaction_groups_immediate To better reflect that it returns a materialized vector of compaction_group ptrs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-24 21:19:26 +02:00
Raphael S. Carvalho	27d460758f	replica: Account for sstables being added before ACKing split We want the invariant that after ACK, all sealed sstables will be split. If check-and-attach is not atomic, this sequence is possible: 1) no split decision set. 2) Unsplit sstable is checked, no need to split, sealed. 3) split decision is set and ACKed 4) unsplit sstable is attached Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1fdc410e24	Rename maybe_split_sstable() to maybe_split_new_sstable() Since the function must only be used on new sstables, it should be renamed to something describing its usage should be restricted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Tomasz Grabiec	0e51a1f812	replica: Remove unnecessary noexcept Can potentially lead to unnecessary abort. compaction_groups() and for_each_compaction_group() can throw. Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:51:35 +01:00
Tomasz Grabiec	8b807b299e	replica: Remove noexcept from compaction_groups() functions They can throw during merge, when the number of compaction groups is higher than 3. Callers can deal with that, so we shouldn't abort.	2025-12-10 14:48:23 +01:00
Tomasz Grabiec	07ff659849	replica: Remove noexcept from storage_group::for_each_compaction_group They don't really have to be noexcept. And "action" may actually throw, leading to abort. It was observed to throw when creating memtable readers: terminate called after throwing an instance of 'utils::memory_limit_reached' what(): kill limit triggered on semaphore sl:users by permit xxx Aborting on shard 4, in scheduling group sl:users. std::terminate() at ??:0 __clang_call_terminate at main.cc:0 replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143 query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>) at ./replica/table.cc:3583 replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158 seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267 seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 Fixes #27475 Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:48:11 +01:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Raphael S. Carvalho	149f9d8448	replica: Fix race between drop table and merge completion handling Consider this: 1) merge finishes, wakes up fiber to merge compaction groups 2) drop table happens, which in turn invokes truncate underneath 3) merge fiber stops old groups 4) truncate disables compaction on all groups, but the ones stopped 5) truncate performs a check that compaction has been disabled on all groups, including the ones stopped 6) the check fails because groups being stopped didn't have compaction explicitly disabled on them To fix it, the check on step 6 will ignore groups that have been stopped, since those are not eligible for having compaction explicitly disabled on them. The compaction check is there, so ongoing compaction will not propagate data being truncated, but here it happens in the context of drop table which doesn't leave anything behind. Also, a group stopped is somewhat equivalent to compaction disabled on it, since the procedure to stop a group stops all ongoing compaction and eventually removes its state from compaction manager. Fixes #25551. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#25563	2025-08-22 10:19:43 +03:00
Asias He	be15972006	compaction: Move compaction_reenabler to compaction_reenabler.hh So it can be used without bringing the whole compaction/compaction_manager.hh.	2025-08-18 11:01:22 +08:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Raphael S. Carvalho	beaaf00fac	test: Add test that compaction doesn't cross logical group boundary Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:01 +03:00
Raphael S. Carvalho	d351b0726b	replica: Introduce views in compaction_group for incremental repair Wired the unrepaired, repairing and repaired views into compaction_group. Also the repaired filter was wired, so tablet_storage_group_manager can implement the procedure to classify the sstable. Based on this classifier, we can decide which view a sstable belongs to, at any given point in time. Additionally, we made changes changes to compaction_group_view to return only sstables that belong to the underlying view. From this point on, repaired, repairing and unrepaired sets are connected to compaction manager through their views. And that guarantees sstables on different groups cannot be compacted together. Repairing view specifically has compaction disabled on it altogether, we can revert this later if we want, to allow repairing sstables to be compacted with one another. The benefit of this logical approach is having the classifier as the single source of truth. Otherwise, we'd need to keep the sstable location consistest with global metadata, creating complexity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	20c3301a1a	treewide: Futurize estimation of pending compaction tasks This is to allow futurization of compaction_group_view method that retrieves sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	af3592c658	replica: Allow compaction_group to have more than one view In order to support incremental repair, we'll allow each replica::compaction_group to have two logical compaction groups (or logical sstable sets), one for repaired, another for unrepaired. That means we have to adapt a few places to work with compaction_group_view instead, such that no logical compaction group is missed when doing table or tablet wide operations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	e78295bff1	Move backlog tracker to replica::compaction_group Since there will be only one physical sstable set, it makes sense to move backlog tracker to replica::compaction_group. With incremental repair, it still makes sense to compute backlog accounting both logical sets, since the compound backlog influences the overall read amplification, and the total backlog across repaired and unrepaired sets can help driving decisions like giving up on incremental repair when unrepaired set is almost as large as the repaired set, causing an amplification of 2. Also it's needed for correctness because a sstable can move quickly across the logical sets, and having one tracker for each logical set could cause the sstable to not be erased in the old set it belonged to; Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Aleksandra Martyniuk	2ec54d4f1a	replica: hold compaction group gate during flush Destructor of database_sstable_write_monitor, which is created in table::try_flush_memtable_to_sstable, tries to get the compaction state of the processed compaction group. If at this point the compaction group is already stopped (and the compaction state is removed), e.g. due to concurrent tablet merge, an exception is thrown and a node coredumps. Add flush gate to compaction group to wait for flushes in compaction_group::stop. Hold the gate in seal function in table::make_memtable_list. seal function is turned into a coroutine to ensure it won't throw. Wait until async_gate is closed before flushing, to ensure that all data is written into sstables. Stop ongoing compactions beforehand. Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber. Stop method already flushes the compaction group. Fixes: #23911. Closes scylladb/scylladb#24582	2025-07-13 12:35:19 +03:00
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Raphael S. Carvalho	fedd838b9d	replica: Fix race of some operations like cleanup with snapshot There are two semaphores in table for synchronizing changes to sstable list: sstable_set_mutation_sem: used to serialize two concurrent operations updating the list, to prevent them from racing with each other. sstable_deletion_sem: A deletion guard, used to serialize deletion and iteration over the list, to prevent iteration from finding deleted files on disk. they're always taken in this order to avoid deadlocks: sstable_set_mutation_sem -> sstable_deletion_sem. problem: A = tablet cleanup B = take_snapshot() 1) A acquires sstable_set_mutation_sem for updating list 2) A acquires sstable_deletion_sem, then delete sstable before updating list 3) A releases sstable_deletion_sem, then yield 4) B acquires sstable_deletion_sem 5) B iterates through list and bumps sstable deleted in step 2 6) B fails since it cannot find the file on disk Initial reaction is to say that no procedure must delete sstable before updating the list, that's true. But we want a iteration, running concurrently to cleanup, to not find sstables being removed from the system. Otherwise, e.g. snapshot works with sstables of a tablet that was just cleaned up. That's achieved by serializing iteration with list update. Since sstable_deletion_sem is used within the scope of deletion only, it's useless for achieving this. Cleanup could acquire the deletion sem when preparing list updates, and then pass the "permit" to deletion function, but then sstable_deletion_sem would essentially become sstable_set_mutation_sem, which was created exactly to protect the list update. That being said, it makes sense to merge both semaphores. Also things become easier to reason about, and we don't have to worry about deadlocks anymore. The deletion goes through sstable_list_builder, which holds a permit throughout its lifetime, which guarantees that list updates and deletion are atomic to other concurrent operations. The interface becomes less error prone with that. It allowed us to find discard_sstables() was doing deletion without any permit, meaning another race could happen between truncate and snapshot. So we're fixing race of (truncate\|cleanup) with take_snapshot, as far as we know. It's possible another unknown races are fixed as well. Fixes #23049. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23117	2025-03-06 11:00:48 +02:00
Kefu Chai	57b14220ce	tree: remove unused "#include"s these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`, use `using seastar::future` to include `future` in the global namespace, this makes `readers/mutation_reader.hh` a header exposing `future<>`, but this is not a good practice, because, unlike `seastarx.hh` or `seastar/core/future.hh`, `reader/mutation_reader.hh` is not responsible for exposing seastar declarations. so, we trade the using statement for `#include "seastarx.hh"` in that file to decouple the source files including it from this header because of this statement. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22439	2025-01-28 14:12:06 +03:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Botond Dénes	55963f8f79	replica: remove noexcept from token -> tablet resolution path The methods to resolve a key/token/range to a table are all noexcept. Yet the method below all of these, `storage_group_for_id()` can throw. This means that if due to any mistake a tablet without local replica is attempted to be looked up, it will result in a crash, as the exception bubbles up into the noexcept methods. There is no value in pretending that looking up the tablet replica is noexcept, remove the noexcept specifiers so that any bad lookup only fails the operation at hand and doesn't crash the node. This is especially relevant to replace, which still has a window where writes can arrive for tablets that don't (yet) have a local replica. Currently, this results in a crash. After this patch, this will only fail the writes and the replace can move on. Fixes: #21480 Closes scylladb/scylladb#22251	2025-01-17 11:24:09 +03:00

1 2 3

131 Commits