scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 06:05:53 +00:00

Author	SHA1	Message	Date
Vlad Zolotarov	79b0654d60	time_window_compaction_strategy: put expired sstables in a separate compaction task It's much more efficient to have a separate compaction task that consists completely from expired sstables and make sure it gets a unique "weight" than mixing expired sstables with non-expired sstables adding an unpredictable latency to an eviction event of an expired sstable. This change also improves the visibility of eviction events because now they are always going to appear in the log as compactions that compact into an empty set. Fixes #9533 Signed-off-by: Vlad Zolotarov <vladz@scylladb.com> Closes #9534	2021-10-31 17:54:40 +02:00
Benny Halevy	531e32957d	compaction: time_window_compaction_strategy: get_reshaping_job: consider disjointness only when trimming With `062436829c`, we return all input sstables in strict mode if they are dosjoint even if they don't need reshaping at all. This leads to an infinite reshape loop when uploading sstables with TWCS. The optimization for disjoint sstables is worth it also in relaxed mode, so this change first makes sorting of the input sstables by first_key order independent of reshape_mode, and then it add a check for sstable_set_overlapping_count before trimming either the multi_window vector or any single_window bucket such that we don't trim the list if the candidates are disjoint. Adjust twcs_reshape_with_disjoint_set_test accordingly. And also add some debug logging in time_window_compaction_strategy::get_reshaping_job so one can figure out what's going on there. Test: unit(dev) DTest: cdc_snapshot_operation.py:CDCSnapshotOperationTest.test_create_snapshot_with_collection_list_with_base_rows_delete_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211025071828.509082-1-bhalevy@scylladb.com>	2021-10-28 14:35:51 +03:00
Raphael S. Carvalho	ec1a55ffae	compaction/TWCS: reduce write amp for reshape of sstables spanning multiple windows TWCS can reshape at most 32 sstables spanning multiple windows, in a single compaction round. Which sstables are compacted together, when there are more than 32 sstables, is random. If sstables with overlapping windows are compacted together, then write amplification can be reduced because we may be able to push all the data to a window W in a single compaction round, so we'll not have to perform another compaction round later in W, to reduce its number of files. This is also very good to reduce the amount of transient file descriptors opened, because TWCS reshape first reshapes all sstables spanning multiple windows, so if all windows temporarily grow large in number of files, then there's a risk which file descriptors can be exhausted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Reviewed-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211013203046.233540-3-raphaelsc@scylladb.com>	2021-10-18 16:40:57 +03:00
Raphael S. Carvalho	062436829c	compaction/TWCS: optimize reshape for disjoint sstables spanning multiple windows After `a4053dbb72`, data segregation is postponed to offstrategy, so reshape procedure is called with disjoint sstables which belong to different windows, so let's extend the optimization for disjoint sstables which span more than one window. In this way, write amplification is reduced for offstrategy compaction, as all disjoint sstables will be compacted at once. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211013203046.233540-2-raphaelsc@scylladb.com>	2021-10-18 16:40:57 +03:00
Benny Halevy	5483269dfb	compaction_manager: pass owned_ranges via cleanup/upgrade options So they can be easily computed using an async task before constructing the compaction object in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:17:46 +03:00
Botond Dénes	cc65c9d0da	compaction: scrub/segregate: adjust partition-estimate as buckets accumulate Scrub compaction in segregate mode can split the input sstable into as many as hundreds or even thousands of output sstables in the extreme case. But even at a few dozen output sstables, most of these will only have a few partitions with a few rows. These sstables however will still have their bloom filter allocated according to the original partition-count estimate, causing memory bloat or even OOM in the extreme case. This patch solves this by aggressively adjusting the partition count downwards after the second bucket has been created. Each subsequent bucket will halve the partition estimate, which will quickly reach 1. Fixes: #9463 Closes #9464	2021-10-12 12:44:42 +03:00
Michael Livshin	e88891a8af	avoid race between compaction and table stop Also add a debug-only compaction-manager-side assertion that tests that no new compaction tasks were submitted for a table that is being removed (debug-only because not constant-time). Fixes #9448. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com> Message-Id: <20211007110416.159110-1-michael.livshin@scylladb.com>	2021-10-07 14:36:39 +03:00
Raphael S. Carvalho	59693e6da3	compaction_manager: make rewrite_sstables() bail out when asked to stop rewrite_sstables() can be asked to stop either on shutdown or on an user-triggered comapction which forces all ongoing compaction to stop, like scrub. turns out we weren't actually bailing out from do_until() when task cannot proceed. So rewrite_sstables() potentially runs into an infinite loop which in turn causes shutdown or something else waiting on it to hang forever. found this while auditting code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>	2021-10-07 10:46:22 +03:00
Avi Kivity	1bac93e075	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-10-04 13:09:31 +03:00
Botond Dénes	61e7d3de90	Merge 'Cleanup compaction_stop_exception' from Benny Halevy The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception` and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown or api event vs. compaction aborting due to scrub validation error. While at it, cleanup the existing retry logic related to `compaction_stop_exception`. Test: unit(dev) Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267) Closes #9321 * github.com:scylladb/scylla: compaction: split compaction_aborted_exception from compaction_stopped_exception compaction_manager: maybe_stop_on_error: rely on retry=false default compaction_manager: maybe_stop_on_error: sync return value with error message. compaction: drop retry parameter from compaction_stop_exception compaction_manager: move errors stats accounting to maybe_stop_on_error	2021-10-04 07:27:11 +03:00
Raphael S. Carvalho	9067a13eac	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:57 -03:00
Raphael S. Carvalho	87ce0c5d43	compaction_manager: use task when stopping a given compaction type compaction_info will eventually only be used for exporting data about ongoing compactions, so task must be used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:52 -03:00
Raphael S. Carvalho	cbd78be2dd	compaction: remove start_size and end_size from compaction_info those stats aren't used in compaction stats API and therefore they can be removed. end_size is added to compaction_result (needed for updating history) and start_size can be calculated in advance. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:45 -03:00
Raphael S. Carvalho	18f703e94b	compaction_manager: introduce helpers for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:41 -03:00
Raphael S. Carvalho	d4572a1bb5	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:37 -03:00
Raphael S. Carvalho	38df9c68f8	compaction: kill sstables field in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:33 -03:00
Raphael S. Carvalho	90cfe895d4	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:29 -03:00
Raphael S. Carvalho	4ce745e0b6	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:21 -03:00
Raphael S. Carvalho	efed06e2e4	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:15:00 -03:00
Raphael S. Carvalho	1f5b17fdc5	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:13:20 -03:00
Raphael S. Carvalho	52302c3238	compaction_manager: prevent unbounded growth of pending tasks There will be unbounded growth of pending tasks if they are submitted faster than retiring them. That can potentially happen if memtables are frequently flushed too early. It was observed that this unbounded growth caused task queue violations as the queue will be filled with tons of tasks being reevaluated. By avoiding duplication in pending task list for a given table T, growth is no longer unbounded and consequently reevaluation is no longer aggressive. Refs #9331. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>	2021-09-30 16:49:52 +03:00
Botond Dénes	970fe9a339	mutation_writer: partition_based_splitting_writer: limit number of max buckets Recently we observed an OOM caused by the partition based splitting writer going crazy, creating 1.7K buckets while scrubbing an especially broken sstable. To avoid situations like that in the future, this patch provides a max limit for the number of live buckets. When the number of buckets reach this number, the largest bucket is closed and replaced by a bucket. This will end up creating more output sstables during scrub overall, but now they won't all be written at the same time causing insane memory pressure and possibly OOM. Scrub compaction sets this limit to 100, the same limit the TWCS's timestamp based splitting writer uses (implemented through the classifier - time_window_compaction_strategy::max_data_segregation_window_count). Fixes: #9400 Tests: unit(dev) Closes #9401	2021-09-29 16:31:29 +03:00
Raphael S. Carvalho	9718173598	compaction: Update backlog tracker correctly when schema is updated Currently the following can happen: 1) there's ongoing compaction with input sstable A, so sstable set and backlog tracker both contains A. 2) ongoing compaction replaces input sstable A by B, so sstable set contains only B now. 3) schema is updated, so a new backlog tracker is built without A because sstable set now contains only B. 4) ongoing compaction tries to remove A from tracker, but it was excluded in step 3. 5) tracker can now have a negative value if table is decreasing in size, which leads to log(<negative number>) == -NaN This problem happens because backlog tracker updates are decoupled from sstable set updates. Given that the essential content of backlog tracker should be the same as one of sstable set, let's move tracker management to table. Whenever sstable set is updated, backlog tracker will be updated with the same changes, making their management less error prone. Fixes #9157 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:15:29 -03:00
Raphael S. Carvalho	afd45b9f49	compaction: Don't leak backlog of input sstable when compaction strategy is changed The generic backlog formula is: ALL + PARTIAL - COMPACTING With transfer_ongoing_charges() we already ignore the effect of ongoing compactions on COMPACTING as we judge them to be pointless. But ongoing compactions will run to completion, meaning that output sstables will be added to ALL anyway, in the formula above. With stop_tracking_ongoing_compactions(), input sstables are never removed from the tracker, but output sstables are added, which means we end up with duplicate backlog in the tracker. By removing this tracking mechanism, pointless ongoing compaction will be ignored as expected and the leaks will be fixed. Later, the intention is to force a stop on ongoing compactions if strategy has changed as they're pointless anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:03:28 -03:00
Raphael S. Carvalho	05126cfe29	compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() This new function makes it easier to remove monitor of exhausted sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:01:40 -03:00
Raphael S. Carvalho	35050a8217	compaction: simplify removal of monitors by switching to unordered_map, removal of generated monitors is made easier. this is a preparatory change for patch which will remove monitor for all exhausted sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 13:59:30 -03:00
Avi Kivity	d7ac699a55	Revert "Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael" This reverts commit `b5cf0b4489`, reversing changes made to `e8493e20cb`. It causes segmentation faults when sstable readers are closed. Fixes #9388.	2021-09-26 18:31:49 +03:00
Avi Kivity	bf94c06fc7	Revert "Merge "simplifications and layer violation fix for compaction manager" from Raphael" This reverts commit `7127c92acc`, reversing changes made to `88480ac504`. We need to revert `b5cf0b4489` to fix #9388, and this stands in the way. Ref #9388.	2021-09-26 18:30:36 +03:00
Raphael S. Carvalho	5bf51ced14	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:56:18 -03:00
Raphael S. Carvalho	6e7729fa21	compaction_manager: use task when stopping a given compaction type compaction_info will eventually only be used for exporting data about ongoing compactions, so task must be used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:53:53 -03:00
Raphael S. Carvalho	6d1170ac94	compaction: remove start_size and end_size from compaction_info those stats aren't used in compaction stats API and therefore they can be removed. end_size is added to compaction_result (needed for updating history) and start_size can be calculated in advance. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:41:13 -03:00
Raphael S. Carvalho	2353f40f63	compaction_manager: introduce helpers for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:39 -03:00
Raphael S. Carvalho	6820fbf460	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:36 -03:00
Raphael S. Carvalho	d73a241a4e	compaction: kill sstables field in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:32 -03:00
Raphael S. Carvalho	b6b4042faf	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:11 -03:00
Raphael S. Carvalho	98f8673d4e	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:25:51 -03:00
Raphael S. Carvalho	0885376a85	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:00:49 -03:00
Raphael S. Carvalho	7688d0432c	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 09:56:01 -03:00
Raphael S. Carvalho	ff38f59f67	compaction: Update backlog tracker correctly when schema is updated Currently the following can happen: 1) there's ongoing compaction with input sstable A, so sstable set and backlog tracker both contains A. 2) ongoing compaction replaces input sstable A by B, so sstable set contains only B now. 3) schema is updated, so a new backlog tracker is built without A because sstable set now contains only B. 4) ongoing compaction tries to remove A from tracker, but it was excluded in step 3. 5) tracker can now have a negative value if table is decreasing in size, which leads to log(<negative number>) == -NaN This problem happens because backlog tracker updates are decoupled from sstable set updates. Given that the essential content of backlog tracker should be the same as one of sstable set, let's move tracker management to table. Whenever sstable set is updated, backlog tracker will be updated with the same changes, making their management less error prone. Fixes #9157 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:54:41 -03:00
Raphael S. Carvalho	0a3049908c	compaction: Don't leak backlog of input sstable when compaction strategy is changed The generic back formula is: ALL + PARTIAL - COMPACTING With transfer_ongoing_charges() we already ignore the effect of ongoing compactions on COMPACTING as we judge them to be pointless. But ongoing compactions will run to completion, meaning that output sstables will be added to ALL anyway, in the formula above. With stop_tracking_ongoing_compactions(), input sstables are never removed from the tracker, but output sstables are added, which means we end up with duplicate backlog in the tracker. By removing this tracking mechanism, pointless ongoing compaction will be ignored as expected and the leaks will be fixed. Later, the intention is to force a stop on ongoing compactions if strategy has changed as they're pointless anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:36:05 -03:00
Raphael S. Carvalho	3dc1821287	compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() This new function makes it easier to remove monitor of exhausted sstables. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:16:41 -03:00
Raphael S. Carvalho	28ba8bde80	compaction: simplify removal of monitors by switching to unordered_map, removal of generated monitors is made easier. this is a preparatory change for patch which will remove monitor for all exhausted sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-20 15:06:37 -03:00
Benny Halevy	fa46bf3499	compaction: split compaction_aborted_exception from compaction_stopped_exception Indicate whether the compaction job should be aborted due to an error using a new, compaction_aborted_exception type, vs. compaction_stopped_exception that indicates the task should be stopped due to some external event that doesn't indicate an error (like shutdown or api call). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	eebe14e7bc	compaction_manager: maybe_stop_on_error: rely on retry=false default No need to set retry to false again in various catch clauses. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	ca2bb89180	compaction_manager: maybe_stop_on_error: sync return value with error message. It is misleading to set retry to true in the following statement and return it later on when the `will_stop` parameter is true. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	a1fe40278b	compaction: drop retry parameter from compaction_stop_exception Drop the retry parameter from compaction_stop_exception as it is always false. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:30 +03:00
Benny Halevy	9800dbe871	compaction_manager: move errors stats accounting to maybe_stop_on_error Currently, _stats.errors is not accounted for non-retryable errors like storage_io_error. Fixes #9354 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-09-19 12:20:22 +03:00
Avi Kivity	daf028210b	build: enable -Winconsistent-missing-override warning This warning can catch a virtual function that thinks it overrides another, but doesn't, because the two functions have different signatures. This isn't very likely since most of our virtual functions override pure virtuals, but it's still worth having. Enable the warning and fix numerous violations. Closes #9347	2021-09-15 12:55:54 +03:00
Raphael S. Carvalho	acba3bd3c4	sstables: give a more descriptive name to compaction_options the name compaction_options is confusing as it overlaps in meaning with compaction_descriptor. hard to reason what are the exact difference between them, without digging into the implementation. compaction_options is intended to only carry options specific to a give compaction type, like a mode for scrub, so let's rename it to compaction_type_options to make it clearer for the readers. [avi: adjust for scrub changes] Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>	2021-09-12 11:21:33 +03:00
Benny Halevy	389ef9316f	compaction: scrub/validate: prevent printing non-utf8 partition keys Corrupt keys might be printed as non-utf8 strings to the log, and that, in turn, may break applications reading the logs, such as Python (3.7) For example: ``` Traceback (most recent call last): File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown self.cleanUpCluster() File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster matches = node.grep_log(expr) File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log for line in f: File "/usr/lib64/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte ``` Test: unit(dev) DTest: scrub_with_one_node_expect_data_loss_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>	2021-09-12 10:52:18 +03:00

1 2

94 Commits