Now that submit() switched to table_state, compaction_reenabler
and friends can switch to table_state too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
auto_compaction_disabled_by_user is a configuration that can be enabled
or disabled on a particular table. We're adding this interface to
avoid having to push the configuration for every compaction_state,
which would result in redundant information as the configuration
value is the same for all table states.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The idea is that we'll have a single on-completion interface for both
"in-strategy" and off-strategy compactions, so not to pollute table_state
with one interface for each.
replica::table::on_compaction_completion is being moved into private namespace.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_manager needs this interface when setting the sstable
creation lambda in compaction_descriptor, which is then forwarded
into the actual compaction procedure.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
they're used to stop all ongoing compaction on behalf of a given table
T. Today, each table has a single table_state representing it, but after
we implement compaction groups, we'll need to call the procedure for
each group in a table. But the discussion doesn't belong here, as
compaction group work will only come later. By the time being, we're
only making compaction manager fully switch to table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The only external user of get_compactions() doesn't use any filtering,
so after table_state switch, one will be allowed to get all jobs
running associated with a table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
propagate_replacement is used by incremental compaction to notify
ongoing compaction about sstable list updates, such that the
ongoing job won't hold reference to exhausted sstables.
So it needs to switch to table_state, too.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
in_strategy_sstables() doesn't have to be implemented in table, as it's
simply about main set with maintenance and staging files filtered out.
Also, let's make it switch to table_state as part of ongoing work.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
manager stores a state for each table. As we're transitioning towards
table_state, the mapping of a table to compaction state will now use
table_state ptr as key. table_state ptr is stable and its lifetime
is the same as table.
we're temporarily adding a ptr to compaction_state, as there's lots
of dependency on replica::table, but we'll get rid of it once
we complete the transition.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's to be able to get table_state from table in subsequent patch,
as table only has a forward declaration to it in compaction_manager.hh
to avoid including database.hh.
Once everything is moved to table_state, then ctor can be moved
back into header.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Now update_sstable_lists_on_off_strategy_completion() and
on_compaction_completion() can be called from the same unified
interface.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
To make it possible to add a single interface in table_state for
updating sstable list on behalf of both off-strategy and in-strategy
compactions, update_sstable_lists_on_off_strategy_completion() will
work with compaction_completion_desc too for describing sstable set
changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
If the compaction_descriptor returned by time_window_compaction_strategy::get_sstables_for_compaction
is marked with has_only_fully_expired::yes it should always be compacted
since time_window_compaction_strategy::get_sstables_for_compaction is not idempotent.
It sets _last_expired_check and if compaction is postponed and retried before
expired_sstable_check_frequency has passed, it will not look for those fully-expired
sstables again. Plus, compacting them is the cheapest possible as it does not require
reading anything, just deleting the input sstables, so there's no reason not postpone it.
Fixes#10989
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.
Also, run major compaction in maintenance scheduling group.
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Fixes#10961Closes#10984
* github.com:scylladb/scylla:
compaction_manager: major_compaction_task: run in maintenance scheduling groupt
compaction_manager: allow regular compaction to run in parallel to major
"
The option controlls the IO bandwidth of the compaction sched class.
It's not set to be 16MB/s, but is unused. This set makes it 0 by
default (which means unlimited), live-updateable and plugs it to the
seastar sched group IO throttling.
branch: https://github.com/xemul/scylla/tree/br-compaction-throttling-3
tests: unit(dev),
v2: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1010/ ,
v2: manual config update
"
* 'br-compaction-throttling-3-a' of https://github.com/xemul/scylla:
compaction_manager: Add compaction throughput limit
updateable_value: Support dummy observing
serialized_action: Allow being observer for updateable_value
config: Tune the config option
We should separate the scheduling groups used for major compaction
from the the regular compaction scheduling group so that
the latter can be affected by the backlog tracker in case
backlog accumulates during a long running major compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
After acquiring the _compaction_state write lock,
select all sstables using get_candidates and register them
as compacting, then unlock the _compaction_state lock
to let regular compaction run in parallel.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Re-use eisting compaction_throughput_mb_per_sec option, push it down to
compaction manager via config and update the nderlying compaction sched
class when the option is (live)updated.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Change 8f39547d89 added
`handle_exception_type([] (const semaphore_aborted& e) {})`,
but it turned out that `named_semaphore_aborted` isn't
derived from `semaphore_aborted`, but rather from
`abort_requested_exception` so handle the base exception
instead.
Fixes#10666
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10881
compaction_manager needs to decide about running off-strategy
compaction or not based on the maintenance_set, not partly
in table::trigger_offstrategy_compaction and part in
the compaction_manager layer as it is done today.
So move the logic down to performa_offstrategy
that now returns future<bool> to return true
iff it performed offstrategy compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Move logging from run_offstrategy_compaction to do_run
so that in the next patch we can skip run_offstrategy_compaction
if the maintenance set is empty (but still log it,
for the sake of dtests.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
In order to wire-in the compaction_throughput_mb_per_sec the compaction
creation and stopping will need to be patched. Right now both places are
quite hairy, this set coroutinizes stop() for simpler adding of stopping
bits, unifies all the compaction manager constructors and adds the
compaction_manager::config for simpler future extending.
As a side effect the backlog_controller class gets an "abstract" sched
group it controlls which in turn will facilitate seastar sched groups
unification some day.
"
* 'br-compaction-manager-start-stop-cleanup' of https://github.com/xemul/scylla:
compaction_manager: Introduce compaction_manager::config
backlog_controller: Generalize scheduling groups
database: Keep compound flushing sched group
compaction_manager: Swap groups and controller
compaction_manager: Keep compaction_sg on board
compaction_manager: Unify scheduling_group structures
compaction_manager: Merge static/dynamic constructors
compaction_manager: Coroutinuze really_do_stop()
compaction_manager: Shuffle really_do_stop()
compaction_manager: Remove try-catch around logger
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
This is to make it constructible in a way most other services are -- all
the "scalar" parameters are passed via a config.
With this it will be much shorter to add compaction bandwidth throttling
option by just extending the config itself, not the list of constructor
arguments (and all its callers).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make struct scheduling_group be sub-class of the backlog controller. Its
new meaning is now -- the group under controller maintenance. Both
database and compaction manager derive their sched groups from this one.
This makes backlog controller construction simpler, prepares the ground
for sched groups unification in seastar and facilitates next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is mainly to make next patch simpler. Also this makes the backlog
controller API smaller by removing its sg() method.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only difference between those two are in the way backlog controller
is created. It's much simpler to have the controller construction logic
in compaction manager instead. Similar "trick" is used to construct
flush controller for the database.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This way it's more compact and easier to extend.
Also it's small enough to fix indentation right at once.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Make it the future-returning method and setup the _stop_future in its
only caller. Makes next patch much simpler
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)
Closes#10507
* github.com:scylladb/scylla:
compaction: Release compaction weight before updating history.
compaction: Inline compact_sstables_and_update_history call.
compaction: Extract compact_sstables function
compaction: Rename compact_sstables to compact_sstables_and_update_history
compaction: Extract update_history function
compaction: Extract should_update_history function.
compaction: Fetch start_size from compaction_result
compaction: Add tracking start_size in compaction_result.
The backlog definition for leveled is incorrectly built on the assumption that
the world must reach the state of zero amplification, i.e. everything in the
last level. The actual goal is space amplification of 1.1.
In reality, LCS just wants that for every level L, level L is fan_out=10 times
larger than L-1. See more in commit 9de7abdc80 which adjusts LCS to conform
to this goal.
If level 3 = 1000G, level 2 = 100G, level 1 = 10G, level 0 = 1G, that should
return zero backlog as space amplification is (1000+100+10+1)/1000 = ~1.1
But today, LCS calculates high backlog for the layout above, as it will only be
satisfied once everything is promoted to the maximum level. That's completely
disconnected from what the strategy actually wants. Therefore, a mismatch.
With today's definition, the backlog for any SSTable is:
sizeof(sstable) * (Lmax - levelof(sstable)) * fan_out
where Lmax = maximum level,
and fan_out = LCS' fan out which is 10 by default
That's essentially calculating the total cost for data in the SSTable to climb
up to the maximum level. Of course, if a SSTable is at the maximum level,
(Lmax - levelof(sstable)) returns zero, therefore backlog for it is zero.
Take a look at this example:
If L0 sstable is 0.16G, then its backlog = 0.16G * (3 - 0) * 10 = 4.8G
0.16G = LCS' default fragment size
Maximum level (Lmax in formula) can be easily 3 as:
log10 of (30G/0.16G=~187 sstables)) = ~2.27
~2.27 means that data has exceeded level 2 capacity and so needs 3 levels.
So 3 L0 sstables could add ~15G of backlog. With 1G memory per shard (30:1 disk
memory ratio), that's normalized backlog of ~15, which translates into
additional ~500 shares. That's halfway to full compaction speed.
With more files in higher levels, we can easily get to a normalized backlog
above 30, resulting in 1k shares.
The suboptimal backlog definition causes either table using LCS or coexisting
tables to run with more shares than needed, causing compaction to steal
resources, resulting in higher latency and reduced throughput.
To solve this problem, a new formula is used which will basically calculate
the amount of work needed to achieve the layout goal. We no longer want to
promote everything to the last level, but instead we'll incrementally calculate
the backlog in each level L, which is the amount of work needed such that the
next level L + 1 is at least fan_out times bigger.
Fixes#10583.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We know in advance the maximum number of
sstable generations to track, so reserve space for it
to prevent vector reallocation for large number of sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>