scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	ee87b66033	replica: Demote log level on split failure during shutdown Dtest failed with: table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db of origin memtable due to std::runtime_error (Cannot split .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction disabled, reason might be out of space prevention), it will be unlinked... The reason is that the error above is being triggered when the cause is shutdown, not out of space prevention. Let's distinguish between the two cases and log the error with warning level on shutdown. Fixes https://github.com/scylladb/scylladb/issues/24850. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-16 12:03:17 -03:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Botond Dénes	6364e35403	replica/table: add get_tombstone_gc_state() Shorthand for get_compaction_manager().get_shared_tombstone_gc_state().get_tombstone_gc_state().	2026-03-03 14:09:28 +02:00
Botond Dénes	f3ee6a0bd1	compaction: use tombstone_gc_state with value semantics Instead of passing around references to it, pass around values. This object is now designed to be used as a value-type, after recent refactoring.	2026-03-03 14:09:27 +02:00
Taras Veretilnyk	51c345aaf6	sstables: add new rewrite component mechanism for safe sstable component rewriting Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup. However, this approach won't work once component digests are stored in scylla_metadata, as replacing a component like Statistics will require atomically updating both the component and scylla_metadata with the new digest—impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it. - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.	2026-02-26 22:38:55 +01:00
Raphael S. Carvalho	992bfb9f63	compaction: Fail split of new sstable if manager is disabled If manager has been disabled due to out of space prevention, it's important to throw an exception rather than silently not splitting the new sstable. Not splitting a sstable when needed can cause correctness issue when finalizing split later. It's better to fail the writer (e.g. repair one) which will be retried than making caller think everything succeeded. The new replica::table::add_new_sstable_and_update_cache() will now unlink the new sstable on failure, so the table dir will not be left with sstables not loaded. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	1fdc410e24	Rename maybe_split_sstable() to maybe_split_new_sstable() Since the function must only be used on new sstables, it should be renamed to something describing its usage should be restricted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar	468b800e89	compaction_manager:config: introduce max_shares Introduce an updateable value `max_shares` to compaction manager's config. Also add a method `update_max_shares()` that applies the latest `max_shares` value to the compaction controller’s `max_shares`. This new variable will be connected to a config parameter in the next patch. Refs #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:38 -03:00
Lakshmi Narayanan Sreethar	4d442f48db	compaction/compaction_descriptor: introduce compaction_type::Major Introduce a new compaction_type enum : `Major`. This type will be used by the next patches to differentiate between major compaction and regular compaction (compaction_type::Compaction). Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-29 19:21:53 +05:30
Botond Dénes	9c85046f93	sstables,compaction: move compaction exceptions to compaction/ sstables/exceptions.hh still hosts some compaction specific exception types. Move them over to the new compaction/exceptions.hh, to make the compaction module more self-contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Pavel Emelyanov	d69a51f42a	compaction: Use function when filtering compaction tasks for stopping The compaction_manager::stop_compaction() method internally walks the list of tasks and compares each task's compacting_table (which is compaction group view pointer) with the given one. In case this stop_compaction() method is called via API for a specific table, the method walks the list of tasks for every compaction group from the table, thus resulting in nr_groups * nr_tasks complexity. Not terrible, but not nice either. The proposal is to pass filtering function into the inner do_stop_ongoing_compactions() method. Some users will pass a simple "return true" lambda, but those that need to stop compactions for a specitif table (e.g. -- the API handler) will effectively walk the list of tasks once comparing the given compaction group's schema with the target table one (spoiler: eventually this place will also be simplified not to mess with replica::table at all). One ugliness with the change is the way "scope" for logging message is collected. If all tasks belong to the same table, then "for table ..." is printed in logs. With the change the scope is no longer known instantly and is evaluated dynamically while walking the list of tasks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25846	2025-09-16 23:40:47 +03:00
Raphael S. Carvalho	b607b1c284	compaction: Fix stop of sstable cleanup The interface suggests the whole sstable cleanup is aborted with 'nodetool stop CLEANUP', but it is currently stopping only the ongoing cleanup task, and the compaction manager will retry the task since the error is not propagated all the way back to the caller. With raft topology, the coordinator should retry it though since cleanup became mandatory with automatic cleanup. So it's only fixing the usage where cleanup is issued manually. The stop exception is only propagated to the caller of cleanup. When stopping tasks during shutdown, the exception is swallowed and the error only returned to the caller. Fixes #20823. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24996	2025-09-11 08:55:10 +03:00
Botond Dénes	6116f9e11b	Merge 'Compaction tasks progress' from Aleksandra Martyniuk Determine the progress of compaction tasks that have children. The progress of a compaction task is calculated using the default get_progress method. If the expected_total_workload method is implemented, the default progress is computed as: (sum of child task progresses) / (expected total workload) If expected_total_workload is not defined, progress is estimated based on children progresses. However, in this case, the total progress may increase over time as the task executes. All compaction tasks, except for reshape tasks, implement the expected_children_number method. To compute expected_total_workload, iterate over all SSTables covered by the task and sum their sizes. Note that expected_total_workload is just an approximation and the real workload may differ if SStables set for the keyspace/table/compaction group changes. Reshape tasks are an exception, as their scope is determined during execution. Hence, for these tasks expected_total_workload isn't defined and their progress (both total and completed) is determined based on currently created children. Fixes: https://github.com/scylladb/scylladb/issues/8392. Fixes: https://github.com/scylladb/scylladb/issues/6406. Fixes: https://github.com/scylladb/scylladb/issues/7845. New feature, no backport needed Closes scylladb/scylladb#15158 * github.com:scylladb/scylladb: test: add compaction task progress test compaction: set progress unit for compaction tasks compaction: find expected workload for reshard tasks compaction: find expected workload for global cleanup compaction tasks compaction: find expected workload for global major compaction tasks compaction: find expected workload for keyspace compaction tasks compaction: find expected workload for shard compaction tasks compaction: find expected workload for table compaction tasks compaction: return empty progress when compaction_size isn't set compaction: update compaction_data::compaction_size at once tasks: do not check expected workload for done task	2025-09-03 13:23:42 +03:00
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Łukasz Paszkowski	40c40be8a6	compaction_manager: Replace enabled/disabled states with running state Using a single state variable to keep track whether compaction manager is enabled/disabled is insufficient, as multiple services may independently request compactions to be disabled. To address this, a counter is introduced to record how many times the compaction manager has been drained. The manager is considered enabled only when this counter reaches zero. Introducing a counter, enabled and disabled states become obsolete. So they are replaced with a single running state.	2025-08-29 13:47:01 +02:00
Aleksandra Martyniuk	926753e8bf	compaction: find expected workload for table compaction tasks Add compaction_task_impl::get_table_task_workload that sums the bytes in all sstables in the table. This function is used to find the expected workload of the following compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 10:41:22 +02:00
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Asias He	be15972006	compaction: Move compaction_reenabler to compaction_reenabler.hh So it can be used without bringing the whole compaction/compaction_manager.hh.	2025-08-18 11:01:22 +08:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Raphael S. Carvalho	61cb02f580	compaction: Allow view to be added with compaction disabled Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	9d3755f276	replica: Futurize retrieval of sstable sets in compaction_group_view This will allow upcoming work to gently produce a sstable set for each compaction group view. Example: repaired and unrepaired. Locking strategy for compaction's sstable selection: Since sstable retrieval path became futurized, tasks in compaction manager will now hold the write lock (compaction_state::lock) when retrieving the sstable list, feeding them into compaction strategy, and finally registering selected sstables as compacting. The last step prevents another concurrent task from picking the same sstable. Previously, all those steps were atomic, but we have seen stall in that area in large installations, so futurization of that area would come sooner or later. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	e78295bff1	Move backlog tracker to replica::compaction_group Since there will be only one physical sstable set, it makes sense to move backlog tracker to replica::compaction_group. With incremental repair, it still makes sense to compute backlog accounting both logical sets, since the compound backlog influences the overall read amplification, and the total backlog across repaired and unrepaired sets can help driving decisions like giving up on incremental repair when unrepaired set is almost as large as the repaired set, causing an amplification of 2. Also it's needed for correctness because a sstable can move quickly across the logical sets, and having one tracker for each logical set could cause the sstable to not be erased in the old set it belonged to; Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Pavel Emelyanov	2df1945f2a	compaction: Pass "reason" to perform_task_on_all_files() This tells "cleanup", "rewrite" and "split" reasons from each other Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:53:10 +03:00
Pavel Emelyanov	08c8c03a20	compaction: Pass "reason" to run_with_compaction_disabled() This tells "cleanup" (done via try_perform_cleanup) and prepares the ground for more callers (see next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:52:09 +03:00
Pavel Emelyanov	db46da45d2	compaction: Pass "reason" to stop_and_disable_compaction() This tells "truncate" operation from other reasons Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:51:16 +03:00
Łukasz Paszkowski	dc6f8881b8	system_keyspace: Extract compaction_history struct Move the compaction_history_entry struct to a seperate file. The intent of this change is to later re-use it in scylla-nodetool as it currently defines its own structure that is very similar.	2025-05-14 08:31:40 +02:00
Łukasz Paszkowski	342e9a3f5c	compaction/compaction_manager: update_history accepts compaction_result as rvalue The compaction_result struct holding compaction's results and statistics is obtained immediatelly before the update_history is called. Move it instead of passing a cont reference.	2025-05-14 08:31:40 +02:00
Benny Halevy	fba88bdd62	database, compaction_manager, large_data_handler: use pluggable<system_keysapce> To allow safe plug and unplug of the system_keyspace. This patch follows-up on `917fdb9e53` (more specifically - `f9b57df471`) Since just keeping a shared_ptr<system_keyspace> doesn't prevent stopping the system_keyspace shards, while using the `pluggable` interface allows safe draining of outstanding async calls on shutdown, before stopping the system_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-05 08:27:23 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Kefu Chai	50fbab29ca	compaction: remove unused "#include" we don't use `std::list` in compaction/compaction_manager.hh, neither is this header responsible for exposing the declarations in `<list>`. so let's stop `#include` this header. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21436	2024-11-07 10:25:27 +03:00
Benny Halevy	c08ba8af68	compaction_manager: stop_tasks, stop_ongoing_compactions: ignore errors stop() methods, like destructors must always succeed, and returning errors from them is futile as there is nothing else we can do with them but continue with shutdown. Leaked errors on the stop path may cause termination on shutdown, when called in a deferred action destructor. Fixes scylladb/scylladb#21298 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-11-03 10:52:58 +02:00
Botond Dénes	e942c074f2	compaction/compaction_manager: make _tasks an intrusive list _tasks is currently std::list<shared_ptr<compaction_task_executor>>, but it has no role in keeping the instances alive, this is done by the fibers which create the task (and pin a shared ptr instance). This lends itself to an intrusive list, avoiding that extra allocation upon push_back(). Using an intrusive list also makes it simpler and much cheaper (O(1) vs. O(N)) to remove tasks from the _tasks list. This will be made use of in the next patch. Code using _task has to be updated because the value_type changes from shared_ptr<compaction_task_executor> to compaction_task_executor&.	2024-11-03 10:17:11 +02:00
Kamil Braun	4d99cd2055	Merge 'raft: fast tombstone GC for group0-managed tables' from Emil Maskovsky Add the gossip state for broadcasting the nodes state_id. Implemented the Group0 state broadcaster (based on the gossip) that will broadcast the state id of each node and check the minimal state id for the tombstone GC. When there is a change in the tombstone GC minimal state id, the state broadcaster will update the tombstone GC time for the group0-managed tables. The main component of the change is the newly added `group0_state_id_handler` that keeps track, broadcasts and receives the last group0 state_ids across all nodes and sets the tombstone GC deletion time accordingly: * on each group0 change applied, the state_id handler broadcasts the state_id as a gossip state (only if the value has changed) * the handler checks for the node state ids every refresh period (configurable, 1h by default) * on every check, the handler figures out the lowest state_id (timeuuid), which is state_id that all of the nodes already have * the timestamp of this minimum state_id is then used to set the tombstone GC deletion time * the tombstone GC calculation then uses that deletion time to provide the GC time back to the callers, e.g. when doing the compaction * (as the time for tombstone GC calculation has the 1s granularity we actually deduce 1s from the determined timestamp, because it can happen that there were some newer mutations received in the same second that were not distributed across the nodes yet) This change introduces a new flag to the static schema descriptor (`is_group0_table`) that is being checked for this newly added mode in the tombstone GC. We also add a check (in non-release builds only) on every group0 modification that the table has this flag set. The group0 tombstone GC handling is similar to the "repair" tombstone GC mode in a sense (that the tombstone GC time is determined according to a reconciliation action), however it is not explicitly visible to (nor editable by) the user. And also the tombstone GC calculation is much simpler than the "repair" mode calculation - for example, we always use the whole range (as opposed to the "repair" mode that can have specific repair times set for specific ranges). We use the group0 configuration to determine the set of nodes (both current and previous in case of joint configuration) - we need to make sure that we account for all the group0 nodes (if any node didn't provide the state_id yet, the current check round will be skipped, i.e. no GC will be done until all known nodes provide their state_id timestamp value). Also note that the group0 state_id handling works on all nodes independently, i.e. each node might have its own (possibly different) state depending on the gossip application state propagation. This is however not a problem, as some nodes might be behind, but they will catch up eventually, and this solution has the benefit of being distributed (as opposed to having a central point to handle the state, like for example the topology coordinator that has been considered in the early stages of the design). Fixes: scylladb/scylla#15607 New feature, should not be backported. Closes scylladb/scylladb#20394 * github.com:scylladb/scylladb: raft: add the check for the group0 tables raft: fast tombstone GC for group0-managed tables tombstone_gc: refactor the repair map raft: flag the group0-managed tables gossip: broadcast the group0 state id raft/test: add test for the group0 tombstone GC treewide: code cleanup and refactoring	2024-10-11 11:52:27 +02:00
Avi Kivity	bb1867c7c7	Merge 'sstables: Add digest checking in the validation path of the sstable layer' from Nikos Dragazis This PR builds upon the PR for checksum validation (#20207) to further enhance scrub's corruption detection capabilities by validating digests as well. The digest (full checksum) is the checksum over the entire data, as opposed to per-chunk checksums which apply to individual chunks. Until now, digests were not examined on any code paths. This PR integrates digest checking into the compressed/checksummed data sources as an optional feature and enables it only through the validation path of the sstable layer (`sstable::validate()`). The validation path is used by the following tools: * scrub in validate mode * `sstable validate` All other reads, including normal user reads, are unaffected by this change. The PR consists of: * Extensions to the compressed and checksummed data sources to support digest checking. The data sources receive the expected digest as a parameter and calculate the actual digest incrementally across multiple get() calls. The check happens on the get() call that reaches EOF and results to an exception if the digest is invalid. A digest check requires reading the whole file range. Therefore, a partial read or skip() is treated as an internal error. * A new shareable digest component loaded on demand by the validation code. No lifecycle management. * Grouping of old scrub/validate tests for compressed and uncompressed SSTables to reduce code duplication. * scrub/validate tests for SSTables with valid checksums but invalid digests, and SSTables with no digests at all. * scrub/validate tests with 3.x Cassandra SSTables to ensure compatibility. Refs #19058. New feature, no backport is needed. Closes scylladb/scylladb#20720 * github.com:scylladb/scylladb: test: Test scrub/validate with SSTables from Cassandra compaction: Make quarantine optional for perform_sstable_scrub() test: Make random schema optional in scrub_test_framework test: Add tests for invalid digests test: Merge scrub/validate tests for compressed and uncompressed cases sstables: Verify digests on validation path sstables: Check if digest component exists sstables: Add digest in the SSTable components sstables: Add digest check in compressed data source sstables: Add digest check in checksummed data source	2024-10-09 21:33:08 +03:00
Emil Maskovsky	74bd79bbb3	tombstone_gc: refactor the repair map Move the repair_map definition to the tombstone_gc file where it is mostly being used. Refactor and add the accessors and setters for the group0 tombstone GC time.	2024-10-08 20:53:54 +02:00
Nikos Dragazis	7090e2597f	compaction: Make quarantine optional for perform_sstable_scrub() Allow `perform_sstable_scrub()` to disable quarantine for invalid SSTables detected by scrub in validate mode. This is already supported by the lower-level function `scrub_sstables_validate_mode()` via the flag `quarantine_sstables` and is being used by sstable-scrub. Propagate the flag up to `perform_sstable_scrub()`. This will allow to test scrub/validate against read-only SSTables from the source tree. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2024-10-07 15:21:38 +03:00
Raphael S. Carvalho	93815e0649	replica: Fix tombstone GC during tablet split preparation During split prepare phase, there will be more than 1 compaction group with overlapping token range for a given replica. Assume tablet 1 has sstable A containing deleted data, and sstable B containing a tombstone that shadows data in A. Then split starts: 1) sstable B is split first, and moved from main (unsplit) group to a split-ready group 2) now compaction runs in split-ready group before sstable A is split tombstone GC logic today only looks at underlying group, so compaction is step 2 will discard the deleted data in A, since it belongs to another group (the unsplit one), and so the tombstone can be purged incorrectly. To fix it, compaction will now work with all uncompacting sstables that belong to the same replica, since tombstone GC requires all sstables that possibly contain shadowed data to be available for correct decision to be made. Fixes #20044. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-10-02 11:26:13 -03:00
Lakshmi Narayanan Sreethar	84d06a13c7	api: compaction: add `consider_only_existing_data` option Added a new parameter `consider_only_existing_data` to major compaction API endpoints. When enabled, major compaction will: - Force-flush all tables. - Force a new active segment in the commit log. - Compact all existing SSTables and garbage-collect tombstones by only checking the SSTables being compacted. Memtables, commit logs, and other SSTables not part of the compaction will not be checked, as they will only contain newer data that arrived after the compaction started. The `consider_only_existing_data` is passed down to the compaction descriptor's `gc_check_only_compacting_sstables` option to ensure that only the existing data is considered for garbage collection. The option is also passed to the `maybe_flush_commitlog` method to make sure all the tables are flushed and a new active segment is created in the commit log. Fixes #19728 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-09-05 17:25:45 +05:30
Botond Dénes	b2c07c9b6f	Merge 'compaction: change compaction stop reason ' from Aleksandra Martyniuk Currently "table removal" is logged as a reason of compaction stop for table drop, tablet cleanup and tablet split. Modify log to reflect the reason. Closes scylladb/scylladb#20042 * github.com:scylladb/scylladb: test: add test to check compaction stop log compaction: fix compaction group stop reason	2024-08-26 13:40:07 +03:00
Pavel Emelyanov	38edbebb10	compaction_manager: Keep flush-all-before-major option on own config Currently the major compaction task impl grabs this (non-updateable) value from db::config. That's not good, all services including compaction manager have their own configs from which they take options. Said that, this patch puts the said option onto compaction_manager::config, makes use of it and configures one from db::config on start (and tests). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#20174	2024-08-23 10:31:55 +03:00
Aleksandra Martyniuk	5005e19de7	compaction: fix compaction group stop reason compaction_manager::remove passes "table removal" as a reason of stopping ongoing compactions, but currently remove method is also called when a tablet is migrated or split. Pass the actual reason of compaction stop, so that logs aren't misleading.	2024-08-21 12:42:09 +02:00
Raphael S. Carvalho	239344ab55	compaction: Allow "offline" sstable to be split In order to fix the race between split and repair, we must introduce the ability to split an "offline" sstable, one that wasn't added to any of the table's sstable set yet. It's not safe to split a sstable after adding it to the set, because a failure to split can result in unsplit data left in the set, causing split to fail down the road, since the coordinator thinks this replica has only split data in the set. Refs #19378. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-12 17:27:16 -03:00
Aleksandra Martyniuk	c456a43173	compaction: replace optional<task_info> with task_info param compaction_manager::perform_compaction does not create task manager task for compaction if parent_info is set to std::nullopt. Currently, we always want to create task manager task for compaction. Remove optional from task info parameters which start compaction. Track all compactions with task manager.	2024-08-02 14:38:46 +02:00
Kefu Chai	e87b64b7bb	compaction: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-07-02 14:06:42 +08:00
Aleksandra Martyniuk	3463f495b1	tasks: fix tasks abort Currently if task_manager::task::impl::abort preempts before children are recursively aborted and then the task gets unregistered, we hit use after free since abort uses children vector which is no longer alive. Modify abort method so that it goes over all tasks in task manager and aborts those with the given parent. Fixes: #19304.	2024-06-18 13:39:29 +02:00
Kefu Chai	eb9216ef11	compaction: do not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16707	2024-01-10 11:07:36 +02:00

1 2 3 4 5 ...

267 Commits