rwlock was added to protect iterations against concurrent updates to the map.
the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).
the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).
to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.
Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.
Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.
Fixes#18821.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c539b7c861)
Currently if task_manager::task::impl::abort preempts before children are recursively aborted and then the task gets unregistered, we hit use after free since abort uses children vector which is no longer alive.
Modify abort method so that it goes over all tasks in task manager and aborts those with the given parent.
Fixes: https://github.com/scylladb/scylladb/issues/19304.
Requires backport to all versions containing task manager
(cherry picked from commit 3463f495b1)
(cherry picked from commit 50cb797d95)
Refs https://github.com/scylladb/scylladb/pull/19305Closesscylladb/scylladb#19437
* github.com:scylladb/scylladb:
test: add test for abort while a task is being unregistered
tasks: fix tasks abort
After this, TWCS reshape procedure can be changed to limit job
to 10% of available space.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 0ce8ee03f1)
Currently if task_manager::task::impl::abort preempts before children
are recursively aborted and then the task gets unregistered, we hit
use after free since abort uses children vector which is no
longer alive.
Modify abort method so that it goes over all tasks in task manager
and aborts those with the given parent.
Fixes: #19304.
(cherry picked from commit 3463f495b1)
before this change, we rely on the default-generated fmt::formatter created from operator<<, but fmt v10 dropped the default-generated formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h` for formatting the container types, like vector, map optional and variant using {fmt} instead of the homebrew formatter based on operator<<.
with this change, the changes adding fmt::formatter and the changes using ostream formatter explicitly, we are allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Closesscylladb/scylladb#17968
* github.com:scylladb/scylladb:
treewide: do not define FMT_DEPRECATED_OSTREAM
treewide: include fmt/ranges.h and/or fmt/std.h
utils/managed_bytes: add support for fmt::to_string() to bytes and friends
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
This hourly reevaluation is there to help tablets that have very low
write activity, which can go a long time without flushing a memtable,
and it's important to reevaluate compaction as data can get expired.
Today it can happen that we reevaluate a table that is being compacted
actively, which is waste of cpu as the reevaluation will happen anyway
when there are changes to sstable set. This waste can be amplified with
a significant tablet count in a given shard.
Eventually, we could make the revaluation time per table based on
expiration histogram, but until we get there, let's avoid this waste
by only reevaluating tables that are compaction idle for more than 1h.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#18280
The inject_with_handler() method accepts a coroutine that can be called
wiht injection_handler. With such function as an argument, there's no
need in distinctive inject_with_handler() name for a method, it can be
overload of all the existing inject()-s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Since commit f1bbf70, many compaction types can do cleanup work, but turns out
we forgot to invalidate cache on their completion.
So if a node regains ownership of token that had partition deleted in its previous
owner (and tombstone is already gone), data can be resurrected.
Tablet is not affected, as it explicitly invalidates cache during migration
cleanup stage.
Scylla 5.4 is affected.
Fixes#17501.
Fixes#17452.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#17502
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428
While the cleanup is ongoing. Otherwise, a concurrent table drop might
trigger a use-after-free, as we have seen in dtests recently.
Fixes: #16770Closesscylladb/scylladb#16874
3b424e391b introduced a loop
in `perform_cleanup` that waits until all sstables that require
cleanup are cleaned up.
However, with f1bbf705f9,
an sstable that is_eligible_for_compaction (i.e. it
is not in staging, awaiting view update generation),
may already be compacted by e.g. regular compaction.
And so perform_cleanup should interrupt that
by calling try_perform_cleanup, since the latter
reevaluates `update_sstable_cleanup_state` with
compaction disabled - that stops ongoing compactions.
Refs scylladb/scylladb#15673
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If for some reason an exception is thrown in compaction_manager::remove,
it might leave behind stale table pointers in _compaction_state. Fix
that by setting up a deffered action to perform the cleanup.
Fixes#16635
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#16632
The task for splitting compaction will run until all sstables
in the main set are split. The only exceptions are shutdown
or user has explicitly asked for abort.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Set top level compaction tasks as abortable.
Compaction tasks which have no children, i.e. compaction task
executors, have abort method overriden to stop compaction data.
This reverts commit 11cafd2fc8, reversing
changes made to 2bae14f743.
Reverting because this series causes frequent CI failures, and the
proposed quickfix causes other failures of its own.
Fixes: #16113
Set top level compaction tasks as abortable.
Compaction tasks which have no children, i.e. compaction task
executors, have abort method overriden to stop compaction data.
The polling loop was intended to ignore
`condition_variable_timed_out` and check for progress
using a longer `max_idle_duration` timeout in the loop.
Fixes#15669
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#15671
This reverts commit 2860d43309, reversing
changes made to a3621dbd3e.
Reverting because rest_api.test_compaction_task started failing after
this was merged.
Fixes: #16005
Set top level compaction tasks as abortable.
Compaction tasks which have no children, i.e. compaction task
executors, have abort method overriden to stop compaction data.
Keep compaction_progress_monitor in compaction_task_executor and pass a reference
to it further, so that the compaction progress could be retrieved out of it.
Before integration with task manager the state of one shard repair
was kept in repair_info. repair_info object was destroyed immediately
after shard repair was finished.
In an integration process repair_info's fields were moved to
shard_repair_task_impl as the two served the similar purposes.
Though, shard_repair_task_impl isn't immediately destoyed, but is
kept in task manager for task_ttl seconds after it's complete.
Thus, some of repair_info's fields have their lifetime prolonged,
which makes the repair state change delayed.
Release shard_repair_task_impl resources immediately after shard
repair is finished.
Fixes: #15505.
Closesscylladb/scylladb#15506
SSTable runs work hard to keep the disjointness invariant, therefore they're
expensive to build from scratch.
For every insertion, it keeps the elements sorted by their first key in
order to reject insertion of element that would introduce overlapping.
Additionally, a sstable run can grow to dozens of elements (or hundreds)
therefore, we can also make interaction with compaction strategies more
efficient by not copying them when building a list of candidates in compaction
manager. And less fragile by filtering out any sstable runs that are not
completely eligible for compaction.
Previously, ICS had to give up on using runs managed by sstable set due to
fragility of the interface (meaning runs are being built from scratch
on every call to the strategy, which is very inefficient, but that had to
be done for correctness), but now we can restore that.
Closesscylladb/scylladb#15440
* github.com:scylladb/scylladb:
compaction: Switch to strategy_control::candidates() for regular compaction
tests: Prepare sstable_compaction_test for change in compaction_strategy interface
compaction: Allow strategy to retrieve candidates either as sstables or runs
compaction: Make get_candidates() work with frozen_sstable_run too
sstables: add sstable_run::run_identifier()
sstables: tag sstable_run::insert() with nodiscard
sstables: Make all_sstable_runs() more efficient by exposing frozen shared runs
sstables: Simplify sstable_set interface to retrieve runs
Most of the time only the roots of tasks tree should be non internal.
Change default implementation of is_internal and delete overrides
consistent with it.
Closesscylladb/scylladb#15353
Now everything is prepared for the switch, let's do it.
Now let's wait for ICS to enjoy the set of changes.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's needed for upcoming changes that will allow ICS to efficiently
retrieve sstable runs.
Next patch will remove candidates from compaction_strategy's interface
to retrieve candidates using this one instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to be mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes https://github.com/scylladb/scylladb/issues/14992.
Closesscylladb/scylladb#15400
* github.com:scylladb/scylladb:
test: Verify that off-strategy can do incremental compaction
compaction: Clear pending_replacement list when tombstone GC is disabled
compaction: Enable incremental compaction on off-strategy
compaction: Extend reshape type to allow for incremental compaction
compaction: Move reshape_compaction in the source
compaction: Enable incremental compaction only if replacer callback is engaged
Off-strategy suffers with a 100% space overhead, as it adopted
a sort of all or nothing approach. Meaning all input sstables,
living in maintenance set, are kept alive until they're all
reshaped according to the strategy criteria.
Input sstables in off-strategy are very likely to mostly disjoint,
so it can greatly benefit from incremental compaction.
The incremental compaction approach is not only good for
decreasing disk usage, but also memory usage (as metadata of
input and output live in memory), and file desc count, which
takes memory away from OS.
Turns out that this approach also greatly simplifies the
off-strategy impl in compaction manager, as it no longer have
to maintain new unused sstables and mark them for
deletion on failure, and also unlink intermediary sstables
used between reshape rounds.
Fixes#14992.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.
Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.
The move-assignment operator has a similar problem,
therefore it is deleted in this series, and the function is
renamed to `transfer_backlog` that just doesn't deal with the
moved-from registration. This is safe since it's only used internally
by the compaction manager.
Fixes#15248Closesscylladb/scylladb#15445
* github.com:scylladb/scylladb:
compaction_state: store backlog_track in std::optional
compaction_backlog_tracker: do not allow moving registered trackers
So that replacing it will destroy the previous tracker
and unregister it before assigning the new one and
then registering it.
This is safer than assiging it in place.
With that, the move assignment operator is not longer
used and can be deleted.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, the moved-object's manager pointer is moved into the
constructed object, but without fixing the registration to
point to the moved-to object, causing #15248.
Although we could properly move the registration from
the moved-from object to the moved-to one, it is simpler
to just disallow moving a registered tracker, since it's
not needed anywhere. This way we just don't need to mess
with the trackers' registration.
With that in mind, when move-assigning a compaction_backlog_tracker
the existing tracker can remain registered.
Fixesscylladb/scylladb#15248
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before compaction_task_executor::do_run is called, the executor can
be already aborted. Check if compaction was stopped and set
_compaction_done to exceptional future.
For compaction_task_executors, unlike for all other task manager
tasks, run method does not embrace operations performed in a scope
of a task, but only waits until shared_future connected with
the operations is resolved.
Apart from breaking task manager task conventions, such a run method
must consider all corner cases, not to break task manager or
compaction manager functionality.
To fix existing and prevent further bugs related to task manager
and compaction manager coexistence, call perform_task inside
run method and wait for it in a standard way.
Executors that are not going to be reflected in task manager run call
perform_task the old way.
Make sure the compaction_state:s are idle before
they are destroyed. Although all tasks are stopped
in stop_ongoing_compactions, make sure there is
fiber holding the compaction_state gate.
compaction_manager::remove now needs to close the
compaction_state gate and to stop_ongoing_compactions
only if the gate is not closed yet.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Check if the compaction_state gate is closed
along with _state != state::enabled and return early
in this case.
At this point entering the gate is guaranteed to succeed.
So enter the gate before calling `perform_compaction`
keeping the std::optional<gate_holder> throughout
the compaction task.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Loop in shard_reshaping_compaction_task_impl::run relies on whether
sstables::compaction_stopped_exception is thrown from run_custom_job.
The exception is swallowed for each type of compaction
in compaction_manager::perform_task.
Rethrow an exception in perfrom task for reshape compaction.
Fixes: #15058.
Closes#15067