scylladb

Author	SHA1	Message	Date
Botond Dénes	7e0b51ff23	Merge 'Overhaul compaction_manager::task' from Benny Halevy The series overhauls the compaction_manager::task design and implementation by properly layering the functionality between the compaction_manager that deals with generic task execution, and the per-task business logic that is defined in a set of classes derived from the generic task class. While at it, the series introduces `task::state` and a set of helper functions to manage it to prevent leaks in the statistics, fixing #9974. Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`. Test: sstable_compaction_test Dtest: compaction_test.py compaction_additional_test.py Fixes #9974 Closes #10122 * github.com:scylladb/scylla: compaction_manager: use coroutine::switch_to compaction_manager::task: drop _compaction_running compaction_manager: move per-type logic to derived task compaction_manager: task: add state enum compaction_manager: task: add maybe_retry compaction_manager: reevaluate_postponed_compactions: mark as noexcept compaction_manager: define derived task types compaction_manager: register_metrics: expose postponed_compactions compaction_manager: register_metrics: expose failed_compactions compaction_manager: register_metrics: expose _stats.completed_tasks compaction: add documentation for compaction_type to string conversions compaction: expose to_string(compaction_type) compaction_manager: task: standardize task description in log messages compaction_manager: refactor can_proceed compaction_manager: pass compaction_manager& to task ctor compaction_manager: use shared_ptr<task> rather than lw_shared_ptr compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once compaction_manager: use compaction_state::lock only to synchronize major and regular compaction	2022-03-10 13:33:56 +02:00
Benny Halevy	5e1fda7e1d	compaction_manager: use coroutine::switch_to Saving an allocation for running the functor as a task in the switched-to scheduling group. Also, switch to the desired scheduling group at the beginning of the task so that the higher level logic, like getting the list of sstables to compact will be performed under the desired scheduling group, not only the compaction code itself. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	8c66916652	compaction_manager::task: drop _compaction_running Replace the _compaction_running boolean member by calculating _state == state::active now that setup_new_compaction switches state to `active` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	a2a5e530f0	compaction_manager: move per-type logic to derived task Move the business logic into the task specific classes. Separating initialization during task construction, from the compaction_done task, moved into a do_run() method, and in some cases moving a lambda function that was called per table (as in rewrite_sstables) into a private method of the derived class. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:20:01 +02:00
Benny Halevy	2e6ce43a97	compaction_manager: task: add state enum Add an enum class representing the task state machine and a switch_state function to transition between the states and update the corresponding compaction_manager stats counters. Refs #9974 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 12:19:59 +02:00
Benny Halevy	9c59d66b7e	compaction_manager: task: add maybe_retry Replacing and combining compaction_manager methods: maybe_stop_on_error and put_task_to_sleep. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:37 +02:00
Benny Halevy	ee32be3aa5	compaction_manager: reevaluate_postponed_compactions: mark as noexcept To simplify error handling in following patches that will coroutinize task logic. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:37 +02:00
Benny Halevy	72162ed653	compaction_manager: define derived task types Turn task into a class, defining a clear hierarchy of private, protected, and public methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 11:35:35 +02:00
Benny Halevy	37694422dc	compaction_manager: register_metrics: expose postponed_compactions Provide a metric counting the number of tables with postponed compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	089d4442d8	compaction_manager: register_metrics: expose failed_compactions Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	8081f951d0	compaction_manager: register_metrics: expose _stats.completed_tasks Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	ffc314d506	compaction: add documentation for compaction_type to string conversions Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	28a74a2e90	compaction: expose to_string(compaction_type) To be used in the next patch to generate a string dscription from the compaction_type. In theory, we could use compaction_name() btu the latter returns the compaction type in all-upper case and that is very different from what we print to the log today. The all-upper strings are used for the api layer, e.g. to stop tasks of a particular compaction type. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	20a8609392	compaction_manager: task: standardize task description in log messages Define task::describe and use it via operator<< to print the task metadata to the log in a standard way. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:18 +02:00
Benny Halevy	59863b317f	compaction_manager: refactor can_proceed Move the task-internal parts of can_proceed to a respective compaction_manager::task method, preparing for turning it into a class with a proper hierarchy of access to private members. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	33b2731a4a	compaction_manager: pass compaction_manager& to task ctor And use it to get the compaction state of the table to compact. It will be used in a later patch to manage the task state from task methods. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	20067b1050	compaction_manager: use shared_ptr<task> rather than lw_shared_ptr Prepare for defining per compaction type tasks derived from compaction_manager::task. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	cb2403e917	compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once Like all other maintenance operations, acquire the _maintenance_ops_sem once for the whole task, rather than for each sstable. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Benny Halevy	d0f693a517	compaction_manager: use compaction_state::lock only to synchronize major and regular compaction Maintenance operations like cleanup, upgrade, reshape, and reshard are serialized serialized with major compaction using the _maintenance_ops_sem and they need no further synchronization with regular compaction by acquiring the per-table read lock.. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-10 08:39:17 +02:00
Botond Dénes	105bf8888a	sstables: convert mx writer to v2 The sstables::sstable class has two methods for writing sstables: 1) sstable_writer get_writer(...); 2) future<> write_components(flat_mutation_reader, ...); (1) directly exposes the writer type, so we have to update all users of it (there is not that many) in this same patch. We defer updating users of (2) to a follow-up commits.	2022-03-10 07:03:49 +02:00
Botond Dénes	7a37e30310	mutation_reader: convert compacting reader v2 Its input was already a v2 reader, now itself is also a v2 reader. With this commit, compaction.cc is finally v2 all-the-way.	2022-03-10 07:03:46 +02:00
Benny Halevy	11ea2ffc3c	compaction_manager: rewrite_sstables: do not acquire table write lock Since regular compaction may run in parallel no lock is required per-table. We still acquire a read lock in this patch, for backporting purposes, in case the branch doesn't contain `6737c88045`. But it can be removed entirely in master in a follow-up patch. This should solve some of the slowness in cleanup compaction (and likely in upgrade sstables seen in #10060, and possibly #10166. Fixes #10175 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10177	2022-03-09 09:13:46 +02:00
Benny Halevy	c7de2e0682	compaction: log info message when interrupting compaction Info messages are logged when compaction jobs start and finish but there is no message logged when the job is interrupted, e.g. when stopped by the compaction_manager. Refs scylladb/scylla-dtest#2468 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-03-07 11:43:58 +02:00
Benny Halevy	3b5ba5c1a9	compaction_manager: stop_tasks: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-3-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Benny Halevy	95cf4c1c6f	compaction_manager: coroutinize stop_tasks Simplify the function by implementing it as a coroutine, ensuring the input vector, holding the shared task ptrs, is kept alive throughout the lifetime of the function (instead of using do_with to achieve that) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-2-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Benny Halevy	d1d3c620b2	compaction_manager: embed task_stop into stop_tasks task_stop is called exclusively from stop_tasks, Now that stop_tasks calls task::stop() directly, there is no need for this separation, so open-code task_stop in stop_tasks, using coroutines. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302081547.2205813-1-bhalevy@scylladb.com>	2022-03-02 15:44:10 +02:00
Benny Halevy	0764e511bb	compaction_manager: perform_offstrategy: run_offstrategy_compaction in maintenance scheduling group It was assumed that offstrategy compaction is always triggered by streaming/repair where it would inherit the caller's scheduling group. However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see how the expiration of this timer will inherit anything from streaming/repair. Also, since `d309a86`, offstrategy compaction may be triggered by the api where it will run in the default scheduling group. The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`. Fixes #10151 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>	2022-03-02 15:36:28 +02:00
Benny Halevy	c6e0245f87	compaction_manager: get rid of the disable method It is unused. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220302080632.2183782-1-bhalevy@scylladb.com>	2022-03-02 11:13:39 +03:00
Raphael S. Carvalho	2dba0670ad	compaction: Fix time_window_backlog_tracker::replace_sstables() Introduced in commit: `ddd693c6d7` We're not emplacing newer windows in the tracker, causing std::out_of_range when replacing sstables for windows. Let's fix the logic and add an unit test to cover this. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>	2022-03-02 10:08:40 +02:00
Benny Halevy	1e15caa158	compaction_manager: setup_new_compaction: allow setting output_run_identifier Currently the output_run_identifier is assigned right after the calling setup_new_compaction. Move setting the uuid to setup_new_compaction to simplify the flow. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083643.1845096-1-bhalevy@scylladb.com>	2022-03-02 09:50:59 +02:00
Benny Halevy	c9e06f1246	compaction_manager: task: get rid of the stopping member Instead, rely solely on compaction_data.abort source that is task::stop now uses to stop the task. This makes task stopping permanent, so it can't be undone (as used to be the case where task_stop set stopping to false after waiting for compaction_done, to allow rerite_sstables's task to be created before calling run_with_compaction_disabled, and start running after it - which is no longer the case) Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083535.1844829-1-bhalevy@scylladb.com>	2022-03-01 16:46:09 +02:00
Benny Halevy	222389e0f5	compaction_manager: rewrite_sstables: retrieve sstable with compaction disabled before making task Currently, rewrite_sstables retrieved the sstables under run_with_compaction_disabled, after it's created a task for itself. This makes little sense as this task have not started running yet and therefore does not need to be stopped by run_with_compaction_disabled. This is currently worked around by setting task->stopping = false in task_stop(). This change just moves task create in rewrite_sstables till after the sstables are retrieved and the deferred cleanup of _stats.pending_tasks till after it's first adjusted. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20220301083409.1844500-1-bhalevy@scylladb.com>	2022-03-01 16:45:33 +02:00
Benny Halevy	1768aae603	compaction_manager: rewrite_sstables: construct compacting_sstable_registration with compaction_manager& Rather than using a std::optional<compacting_sstable_registration> for lazy construction, construct the object early and call register_compacting when the sstables to register are available. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Benny Halevy	1584c50710	compaction_manager: compacting_sstable_registration: keep a compaction_manager& Rather than a compaction_manager* so that in the next patch it could be constructed with just that and the caller can call register_compacting when it has the sstables to register ready. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Benny Halevy	c008fb137b	compaction_manager: use unordered_set for compacting sstables registration It is more efficient than using a vector as the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-28 13:52:03 +02:00
Botond Dénes	d8833de3bb	Merge "Redefine Compaction Backlog to tame compaction aggressiveness" From Raphael S. Carvalho " Problem statement ================= Today, compaction can act much more aggressive than it really has to, because the strategy and its definition of backlog are completely decoupled. The backlog definition for size-tiered, which is inherited by all strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the world must reach the state of zero amplification. But that's unrealistic and goes against the intent amplification defined by the compaction strategy. For example, size tiered is a write oriented strategy which allows for extra space amplification for compaction to keep up with the high write rate. It can be seen today, in many deployments, that compaction shares is either close to 1000, or even stuck at 1000, even though there's nothing to be done, i.e. the compaction strategy is completely satisfied. When there's a single sstable per tier, for example. This means that whenever a new compaction job kicks in, it will act much more aggressive because of the high shares, caused by false backlog of the existing tables. This translates into higher P99 latencies and reduced throughput. Solution ======== This problem can be fixed, as proposed in the document "Fixing compaction aggressiveness due to suboptimal definition of zero backlog by controller" [1], by removing backlog of tiers that don't have to be compacted now, like a tier that has a single file. That's about coupling the strategy goal with the backlog definition. So once strategy becomes satisfied, so will the controller. Low-efficiency compaction, like compacting 2 files only or cross-tier, only happens when system is under little load and can proceed at a slower pace. Once efficient jobs show up, ongoing compactions, even if inefficient, will get more shares (as efficient jobs add to the backlog) so compaction won't fall behind. With this approach, throughput and latency is improved as cpu time is no longer stolen (unnecessarily) from the foreground requests. [1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ Results ======= Test sequentially populates 3 tables and then run a mixed workload on them, where disk:memory ratio (usage) reaches ~30:1 at the peak. Please find graphs here: https://user-images.githubusercontent.com/1409139/153687219-32368a35-ac63-461b-a362-64dbe8449a00.png 1) Patched version started at ~01:30 2) On population phase, throughput increase and lower P99 write latency can be clearly observed. 3) On mixed phase, throughput increase and lower P99 write and read latency can also be clearly observed. 4) Compaction CPU time sometimes reach ~100% because of the delay between each loader. 5) On unpatched version, it can be seen that backlog keeps growing even when though strategies become satisfied, so compaction is using much more CPU time in comparison. Patched version correctly clears the backlog. Can also be found at: github.com/raphaelsc/scylla.git compaction-controller-v5 tests: UNIT(dev, debug). " * 'compaction-controller-v5' of https://github.com/raphaelsc/scylla: tests: Add compaction controller test test/lib/sstable_utils: Set bytes_on_disk for fake SSTables compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component compaction: Redefine compaction backlog to tame compaction aggressiveness compaction_backlog_tracker: Batch changes through a new replacement interface table: Disable backlog tracker when stopping table compaction_backlog_tracker: make disable() public compaction_backlog_tracker: Clear tracker state when disabled compaction: Add normalized backlog metric compaction: make size_tiered_compaction_strategy static	2022-02-25 09:21:08 +02:00
Benny Halevy	e2894bc762	compaction_manager: task: use plain UUID Now that a null uuid is defined to be logically false there's no need to use an optional UUID. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-25 07:26:11 +02:00
Raphael S. Carvalho	a8caa67937	compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component For describing data size, we use unsigned types. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:45 -03:00
Raphael S. Carvalho	1d9f53c881	compaction: Redefine compaction backlog to tame compaction aggressiveness Today, compaction can act much more aggressive than it really has to, because the strategy and its definition of backlog are completely decoupled. The backlog definition for size-tiered, which is inherited by all strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the world must reach the state of zero amplification. But that's unrealistic and goes against the intent amplification defined by the compaction strategy. For example, size tiered is a write oriented strategy which allows for extra space amplification for compaction to keep up with the high write rate. It can be seen today, in many deployments, that compaction shares is either close to 1000, or even stuck at 1000, even though there's nothing to be done, i.e. the compaction strategy is completely satisfied. When there's a single sstable per tier, for example. This means that whenever a new compaction job kicks in, it will act much more aggressive because of the high shares, caused by false backlog of the existing tables. This translates into higher P99 latencies and reduced throughput. Solution ======== This problem can be fixed, as proposed in the document "Fixing compaction aggressiveness due to suboptimal definition of zero backlog by controller" [1], by removing backlog of tiers that don't have to be compacted now, like a tier that has a single file. That's about coupling the strategy goal with the backlog definition. So once strategy becomes satisfied, so will the controller. Low-efficiency compaction, like compacting 2 files only or cross-tier, only happens when system is under little load and can proceed at a slower pace. Once efficient jobs show up, ongoing compactions, even if inefficient, will get more shares (as efficient jobs add to the backlog) so compaction won't fall behind. With this approach, throughput and latency is improved as cpu time is no longer stolen (unnecessarily) from the foreground requests. [1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ Fixes #4588. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 18:57:38 -03:00
Raphael S. Carvalho	ddd693c6d7	compaction_backlog_tracker: Batch changes through a new replacement interface This new interface allows table to communicate multiple changes in the SSTable set with a single call, which is useful on compaction completion for example. With this new interface, the size tiered backlog tracker will be able to know when compaction completed, which will allow it to recompute tiers and their backlog contribution, if any. Without it, tiered tracker would have to recompute tiers for every change, which would be terribly expensive. The old remove/add interface are being removed in favor of the new one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 15:34:16 -03:00
Raphael S. Carvalho	26350c8591	compaction_backlog_tracker: make disable() public Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:50 -03:00
Raphael S. Carvalho	c15e055612	compaction_backlog_tracker: Clear tracker state when disabled If the tracker is disabled, we never get to access the underlying implementation anymore. It makes sense to clear _impl on disable(). So table::stop() can call its backlog tracker's disable method, clearing all its state. This is important for clean shutdown, as any sstable in tracker state may cause sstable manager to hang when being stopped. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:39 -03:00
Raphael S. Carvalho	a70ce7ecb3	compaction: Add normalized backlog metric Normalized backlog metric is important for understanding the controller behavior as the controller acts on normalized backlog for yielding an output, not the raw backlog value in bytes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:33 -03:00
Raphael S. Carvalho	89eb563c94	compaction: make size_tiered_compaction_strategy static Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-02-24 13:40:29 -03:00
Pavel Emelyanov	dfb980e5f5	Merge 'compaction_manager: allow stopping sleeping tasks' from Benny Halevy Use exponential_backoff_retry::retry(abort_source&) when sleeping between retries and request abort when the task is stopped. Fixes #10112 Test: unit(dev) Closes #10113 * github.com:scylladb/scylla: compaction_manager: allow stopping sleeping tasks compaction_manager: task: add make_compaction_stopped_exception compaction_manager: task: refactor stop	2022-02-22 10:39:47 +03:00
Benny Halevy	57f97046a7	compaction_manager: allow stopping sleeping tasks Use exponential_backoff_retry::retry(abort_source&) when sleeping between retries and request abort when the task is stopped. Fixes #10112 Test: unit(dev) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 21:01:56 +02:00
Benny Halevy	f21b985872	compaction_manager: task: add make_compaction_stopped_exception Provide a function to make a sstables::compaction_stopped_exception based on the information in the stopped task. To be reused by the next patch that will also throw this exception from the retry sleep path, when the task is stopped. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 18:09:49 +02:00
Benny Halevy	91514c20ec	compaction_manager: task: refactor stop Refactor compaction_manager::task::stop out of compaction_manager::task_stop. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-02-21 18:04:06 +02:00
Botond Dénes	fb0e0ec7c1	mutation_reader: compacting_reader: require a v2 input reader Before we add a v2 output option to the compactor, we want to get rid of all the v1 inputs to make it simpler. This means that for a while the compacting reader will be in a strange place of having a v2 input and a v1 output. Hopefully, not for long.	2022-02-21 12:27:55 +02:00
Raphael S. Carvalho	a9427f150a	Revert "sstables/compaction_manager: rewrite_sstables(): resolve maintenance group FIXME" This reverts commit `4c05e5f966`. Moving cleanup to maintenance group made its operation time up to 10x slower than previous release. It's a blocker to 4.6 release, so let's revert it until we figure this all out. Probably this happens because maintenance group is fixed at a relatively small constant, and cleanup may be incrementally generating backlog for regular compaction, where the former is fighting for resources against the latter. Fixes #10060. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>	2022-02-13 21:48:20 +02:00

1 2 3 4 5 ...

276 Commits