Simplify the function by implementing it as a coroutine,
ensuring the input vector, holding the shared task ptrs, is
kept alive throughout the lifetime of the function
(instead of using do_with to achieve that)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-2-bhalevy@scylladb.com>
task_stop is called exclusively from stop_tasks,
Now that stop_tasks calls task::stop() directly,
there is no need for this separation, so open-code
task_stop in stop_tasks, using coroutines.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302081547.2205813-1-bhalevy@scylladb.com>
It was assumed that offstrategy compaction is always triggered by streaming/repair
where it would inherit the caller's scheduling group.
However, offstrategy is triggered by a timer via table::_off_strategy_trigger so I don't see
how the expiration of this timer will inherit anything from streaming/repair.
Also, since d309a86, offstrategy compaction
may be triggered by the api where it will run in the default scheduling group.
The bottom line is that the compaction manager needs to explicitly perform offstrategy compaction
in the maintenance scheduling group similar to `perform_sstable_scrub_validate_mode`.
Fixes#10151
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302084821.2239706-1-bhalevy@scylladb.com>
Introduced in commit: ddd693c6d7
We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.
Let's fix the logic and add an unit test to cover this.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
Currently the output_run_identifier is assigned right
after the calling setup_new_compaction.
Move setting the uuid to setup_new_compaction to simplify
the flow.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083643.1845096-1-bhalevy@scylladb.com>
Instead, rely solely on compaction_data.abort source
that is task::stop now uses to stop the task.
This makes task stopping permanent, so it can't be undone
(as used to be the case where task_stop
set stopping to false after waiting for compaction_done,
to allow rerite_sstables's task to be created before
calling run_with_compaction_disabled, and start
running after it - which is no longer the case)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083535.1844829-1-bhalevy@scylladb.com>
Currently, rewrite_sstables retrieved the sstables under
run_with_compaction_disabled, *after* it's created a task for itself.
This makes little sense as this task have not started running yet
and therefore does not need to be stopped by
run_with_compaction_disabled.
This is currently worked around by setting task->stopping = false
in task_stop().
This change just moves task create in rewrite_sstables till
after the sstables are retrieved and the deferred cleanup
of _stats.pending_tasks till after it's first adjusted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220301083409.1844500-1-bhalevy@scylladb.com>
Rather than using a std::optional<compacting_sstable_registration>
for lazy construction, construct the object early
and call register_compacting when the sstables to register
are available.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Rather than a compaction_manager* so that in the next
patch it could be constructed with just that and
the caller can call register_compacting when
it has the sstables to register ready.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
Problem statement
=================
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.
The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.
It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.
Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.
Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.
With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.
[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQ
Results
=======
Test sequentially populates 3 tables and then run a mixed workload on them,
where disk:memory ratio (usage) reaches ~30:1 at the peak.
Please find graphs here:
https://user-images.githubusercontent.com/1409139/153687219-32368a35-ac63-461b-a362-64dbe8449a00.png
1) Patched version started at ~01:30
2) On population phase, throughput increase and lower P99 write latency can be
clearly observed.
3) On mixed phase, throughput increase and lower P99 write and read latency can
also be clearly observed.
4) Compaction CPU time sometimes reach ~100% because of the delay between each
loader.
5) On unpatched version, it can be seen that backlog keeps growing even when
though strategies become satisfied, so compaction is using much more CPU time
in comparison. Patched version correctly clears the backlog.
Can also be found at:
github.com/raphaelsc/scylla.git compaction-controller-v5
tests: UNIT(dev, debug).
"
* 'compaction-controller-v5' of https://github.com/raphaelsc/scylla:
tests: Add compaction controller test
test/lib/sstable_utils: Set bytes_on_disk for fake SSTables
compaction/size_tiered_backlog_tracker.hh: Use unsigned type for inflight component
compaction: Redefine compaction backlog to tame compaction aggressiveness
compaction_backlog_tracker: Batch changes through a new replacement interface
table: Disable backlog tracker when stopping table
compaction_backlog_tracker: make disable() public
compaction_backlog_tracker: Clear tracker state when disabled
compaction: Add normalized backlog metric
compaction: make size_tiered_compaction_strategy static
Today, compaction can act much more aggressive than it really has to, because
the strategy and its definition of backlog are completely decoupled.
The backlog definition for size-tiered, which is inherited by all
strategies (e.g.: LCS L0, TWCS' windows), is built on the assumption that the
world must reach the state of zero amplification. But that's unrealistic and
goes against the intent amplification defined by the compaction strategy.
For example, size tiered is a write oriented strategy which allows for extra
space amplification for compaction to keep up with the high write rate.
It can be seen today, in many deployments, that compaction shares is either
close to 1000, or even stuck at 1000, even though there's nothing to be done,
i.e. the compaction strategy is completely satisfied.
When there's a single sstable per tier, for example.
This means that whenever a new compaction job kicks in, it will act much more
aggressive because of the high shares, caused by false backlog of the existing
tables. This translates into higher P99 latencies and reduced throughput.
Solution
========
This problem can be fixed, as proposed in the document "Fixing compaction
aggressiveness due to suboptimal definition of zero backlog by controller" [1],
by removing backlog of tiers that don't have to be compacted now, like a tier
that has a single file. That's about coupling the strategy goal with the
backlog definition. So once strategy becomes satisfied, so will the controller.
Low-efficiency compaction, like compacting 2 files only or cross-tier, only
happens when system is under little load and can proceed at a slower pace.
Once efficient jobs show up, ongoing compactions, even if inefficient, will get
more shares (as efficient jobs add to the backlog) so compaction won't fall
behind.
With this approach, throughput and latency is improved as cpu time is no longer
stolen (unnecessarily) from the foreground requests.
[1]: https://docs.google.com/document/d/1EQnXXGWg6z7VAwI4u8AaUX1vFduClaf6WOMt2wem5oQFixes#4588.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If the tracker is disabled, we never get to access the underlying
implementation anymore. It makes sense to clear _impl on
disable(). So table::stop() can call its backlog tracker's disable
method, clearing all its state. This is important for clean
shutdown, as any sstable in tracker state may cause sstable
manager to hang when being stopped.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Normalized backlog metric is important for understanding the controller
behavior as the controller acts on normalized backlog for yielding an
output, not the raw backlog value in bytes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.
Fixes#10112
Test: unit(dev)
Closes#10113
* github.com:scylladb/scylla:
compaction_manager: allow stopping sleeping tasks
compaction_manager: task: add make_compaction_stopped_exception
compaction_manager: task: refactor stop
Use exponential_backoff_retry::retry(abort_source&)
when sleeping between retries and request abort
when the task is stopped.
Fixes#10112
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Provide a function to make a sstables::compaction_stopped_exception
based on the information in the stopped task.
To be reused by the next patch that will
also throw this exception from the retry sleep path,
when the task is stopped.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we add a v2 output option to the compactor, we want to get rid of
all the v1 inputs to make it simpler. This means that for a while the
compacting reader will be in a strange place of having a v2 input and a
v1 output. Hopefully, not for long.
This reverts commit 4c05e5f966.
Moving cleanup to maintenance group made its operation time up to
10x slower than previous release. It's a blocker to 4.6 release,
so let's revert it until we figure this all out.
Probably this happens because maintenance group is fixed at a
relatively small constant, and cleanup may be incrementally
generating backlog for regular compaction, where the former is
fighting for resources against the latter.
Fixes#10060.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220213184306.91585-1-raphaelsc@scylladb.com>
Now that the table layer is using perform_offstrategy,
submit_offstrategy is no longer in use and can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Return a future from perform_offstrategy, resolved
when the offstrategy compaction completes so that callers
can wait on it.
submit_offstrategy still submits the offstrategy compaction
in the background.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
run_with_compaction_disabled() is used to temporarily disable compaction
for a table T. Not only regular compaction, but all types.
Turns out it's stopping all types but it's only preventing new regular
compactions from starting. So major for example can start even with
compaction temporarily disabled. This is fixed by not allowing
compaction of any type if disabled. This wasn't possible before as
scrub incorrectly ran entirely with compaction disabled, so it wouldn't
be able to start, but now it only disables compaction while retrieving
its candidate list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220107154942.59800-1-raphaelsc@scylladb.com>
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
Move the ::database, ::keyspace, and ::table classes to a new replica
namespace and replica/ directory. This designates objects that only
have meaning on a replica and should not be used on a coordinator
(but note that not all replica-only classes should be in this module,
for example compaction and sstables are lower-level objects that
deserve their own modules).
The module is imperfect - some additional classes like distributed_loader
should also be moved, but there is only one way to untie Gordian knots.
Closes#9872
* github.com:scylladb/scylla:
replica: move ::database, ::keyspace, and ::table to replica namespace
database: Move database, keyspace, table classes to replica/ directory
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.
Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev
Fixes#9836.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
"
Completes coroutinization of rewrite_sstables().
tests: UNIT(debug)
"
* 'rewrite_sstable_coroutinization' of https://github.com/raphaelsc/scylla:
compaction_manager: coroutinize main loop in sstable rewrite procedure
compaction_manager: coroutinize exception handling in sstable rewrite procedure
compaction_manager: mark task::finish_compaction() as noexcept
compaction_manager: make maybe_stop_on_error() more flexible