Commit Graph

620 Commits

Author SHA1 Message Date
Kefu Chai
057701299c compaction_manager: remove unnecessary include
also, remove unnecessary forward declarations.

* compaction_manager_test_task_executor is only referenced
  in the friend declaration. but this declaration does not need
  a forward declaration of the friend class
* compaction_manager_test_task_executor is not used anywhere.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14680
2023-07-13 14:59:39 +03:00
Kefu Chai
3a67c31df0 compaction_manager: pass const reference to ctor
the callers of the constructor does not move variable into this
parameter, and the constructor itself is not able to consume it.
as the parameter is a vector while `compaction_sstable_registration`
use an `unordered_set` for tracking the sstables being compacted.

so, to avoid creating a temporary copy of the vector, let's just
pass by reference.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14661
2023-07-13 11:19:44 +03:00
Botond Dénes
968421a3e0 Merge 'Stop task manager compaction module properly' from Aleksandra Martyniuk
Due to wrong order of stopping of compaction services, shutdown needs
to wait until all compactions are complete, which may take really long.

Moreover, test version of compaction manager does not abort task manager,
which is strictly bounded to it, but stops its compaction module. This results
in tests waiting for compaction task manager's tasks to be unregistered,
which never happens.

Stopping and aborting of compaction manager and task manager's compaction
module are performed in a proper order.

Closes #14461

* github.com:scylladb/scylladb:
  tasks: test: abort task manager when wrapped_compaction_manager is destructed
  compaction: swap compaction manager stopping order
  compaction: modify compaction_manager::stop()
2023-07-12 09:54:00 +03:00
Avi Kivity
1545ae2d3b Merge 'Make SSTable cleanup more efficient by fast forwarding to next owned range' from Raphael "Raph" Carvalho
Today, SSTable cleanup skips to the next partition, one at a time, when it finds that the current partition is no longer owned by this node.

That's very inefficient because when a cluster is growing in size, existing nodes lose multiple sequential tokens in its owned ranges. Another inefficiency comes from fetching index pages spanning all unowned tokens, which was described in https://github.com/scylladb/scylladb/issues/14317.

To solve both problems, cleanup will now use multi range reader, to guarantee that it will only process the owned data and as a result skip unowned data. This results in cleanup scanning an owned range and then fast forwarding to the next one, until it's done with them all. This reduces significantly the amount of data in the index caching, as index will only be invoked at each range boundary instead.

Without further ado,

before:

`INFO  2023-07-01 07:10:26,281 [shard 0] compaction - [Cleanup keyspace2.standard1 701af580-17f7-11ee-8b85-a479a1a77573] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s8o_06uww24drzrroaodpv-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.`

after:

`INFO  2023-07-01 07:07:52,354 [shard 0] compaction - [Cleanup keyspace2.standard1 199dff90-17f7-11ee-b592-b4f5d81717b9] Cleaned 1 sstables to [./tmp/1/keyspace2/standard1-b490ee20179f11ee9134afb16b3e10fd/me-3g7a_0s4m_5hehd2rejj8w15d2nt-big-Data.db:level=0]. 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.`

Fixes #12998.
Fixes #14317.

Closes #14469

* github.com:scylladb/scylladb:
  test: Extend cleanup correctness test to cover more cases
  compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
  sstables: Close SSTable reader if index exhaustion is detected in fast forward call
  sstables: Simplify sstable reader initialization
  compaction: Extend make_sstable_reader() interface to work with mutation_source
  test: Extend sstable partition skipping test to cover fast forward using token
2023-07-11 23:28:15 +03:00
Raphael S. Carvalho
8d58ff1be6 compaction: Make SSTable cleanup more efficient by fast forwarding to next owned range
Today, SSTable cleanup skips to the next partition, one at a time, when it finds
that the current partition is no longer owned by this node.

That's very inefficient because when a cluster is growing in size, existing
nodes lose multiple sequential tokens in its owned ranges. Another inefficiency
comes from fetching index pages spanning all unowned tokens, which was described
in #14317.

To solve both problems, cleanup will now use multi range reader, to guarantee
that it will only process the owned data and as a result skip unowned data.
This results in cleanup scanning an owned range and then fast forwarding to the
next one, until it's done with them all. This reduces significantly the amount
of data in the index caching, as index will only be invoked at each range
boundary instead.

Without further ado,

before:

... 2GB to 1GB (~50% of original) in 26248ms = 81MB/s. ~9443072 total partitions merged to 4750028.

after:

... 2GB to 1GB (~50% of original) in 17424ms = 123MB/s. ~9443072 total partitions merged to 4750028.

Fixes #12998.
Fixes #14317.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-11 13:56:24 -03:00
Michał Chojnowski
b511d57fc8 Revert "Merge 'Compaction resharding tasks' from Aleksandra Martyniuk"
This reverts commit 2a58b4a39a, reversing
changes made to dd63169077.

After patch 87c8d63b7a,
table_resharding_compaction_task_impl::run() performs the forbidden
action of copying a lw_shared_ptr (_owned_ranges_ptr) on a remote shard,
which is a data race that can cause a use-after-free, typically manifesting
as allocator corruption.

Note: before the bad patch, this was avoided by copying the _contents_ of the
lw_shared_ptr into a new, local lw_shared_ptr.

Fixes #14475
Fixes #14618

Closes #14641
2023-07-11 19:11:37 +03:00
Raphael S. Carvalho
3b1829f0d8 compaction: base compaction throughput on amount of data read
Today, we base compaction throughput on the amount of data written,
but it should be based on the amount of input data compacted
instead, to show the amount of data compaction had to process
during its execution.

A good example is a compaction which expire 99% of data, and
today throughput would be calculated on the 1% written, which
will mislead the reader to think that compaction was terribly
slow.

Fixes #14533.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14615
2023-07-11 15:48:05 +03:00
Raphael S. Carvalho
bd50943270 compaction: Extend make_sstable_reader() interface to work with mutation_source
As the goal is to make compaction filter to the next owned range,
make_sstable_reader() should be extended to create a reader with
parameters forwarded from mutation_source interface, which will
be used when wiring cleanup with multi range reader.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-07-10 17:19:30 -03:00
Avi Kivity
0cabf4eeb9 build: disable implicit fallthrough
Prevent switch case statements from falling through without annotation
([[fallthrough]]) proving that this was intended.

Existing intended cases were annotated.

Closes #14607
2023-07-10 19:36:06 +02:00
Aleksandra Martyniuk
529c703143 compaction: swap compaction manager stopping order
task_manager::module::stop() waits till all compactions are complete.
Thus, ongoing compactions should be aborted before stop() is called
not to prolong shutdown process.

Task manager's compaction module is stopped after
compaction_manager::do_stop(), which aborts ongoing compactions,
is called.
2023-07-09 12:05:49 +02:00
Aleksandra Martyniuk
a59485b6da compaction: modify compaction_manager::stop()
In compaction_manager::stop(), do_stop() is called unconditionally.
It relies on do_stop to return immediately when _state == none.
2023-07-09 12:04:14 +02:00
Aleksandra Martyniuk
87c8d63b7a compaction: add shard_reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction on one shard.
2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk
db6e4a356b compaction: invoke resharding on sharded database
In reshard_sstables_compaction_task_impl::run() we call
sharded<sstables::sstable_directory>::invoke_on_all. In lambda passed
to that method, we use both sharded sstable_directory service
and its local instance.

To make it straightforward that sharded and local instances are
dependend, we call sharded<replica::database>::invoke_on_all
instead and access local directory through the sharded one.
2023-06-28 11:43:12 +02:00
Aleksandra Martyniuk
1acaed026a compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run() 2023-06-28 11:43:11 +02:00
Aleksandra Martyniuk
837d77ba8c compaction: add reshard_sstables_compaction_task_impl
Add task manager's task covering resharding compaction.
2023-06-28 11:41:43 +02:00
Aleksandra Martyniuk
0d6dd3eeda compaction: replica: copy struct and functions from distributed_loader.cc
As a preparation for integrating resharding compaction with task manager
a struct and some functions are copied from replica/distributed_loader.cc
to compaction/task_manager_module.cc.
2023-06-28 11:41:42 +02:00
Aleksandra Martyniuk
2b4874bbf7 compaction: create resharding_compaction_task_impl
resharding_compaction_task_impl serves as a base class of all
concrete resharding compaction task classes.
2023-06-28 11:36:53 +02:00
Benny Halevy
3ca0c6c0a5 compaction_manager: try_perform_cleanup: set owned_ranges_ptr with compaction disabled
Otherwise regular compaction can sneak in and
see !cs.sstables_requiring_cleanup.empty() with
cs.owned_ranges_ptr == nullptr and trigger
the internal error in `compaction_task_executor::compact_sstables`.

Fixes scylladb/scylladb#14296

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14297
2023-06-27 08:47:13 +03:00
Raphael S. Carvalho
83c70ac04f utils: Extract pretty printers into a header
Can be easily reused elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-06-26 21:58:20 -03:00
Botond Dénes
b23361977b Merge 'Compaction reshape tasks' from Aleksandra Martyniuk
Task manager's tasks covering resharding compaction
on top and shard level.

Closes #14112

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py to test reshaping compaction
  compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run()
  compaction: add shard_reshaping_compaction_task_impl
  replica: delete unused function
  compaction: add table_reshaping_compaction_task_impl
  compaction: copy reshape to task_manager_module.cc
  compaction: add reshaping_compaction_task_impl
2023-06-26 11:56:07 +03:00
Aleksandra Martyniuk
197635b44b compaction: delete generation of new sequence number for table tasks
Compaction tasks covering table major, cleanup, offstrategy,
and upgrade sstables compaction inherit sequence number from their
parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.

Closes #14379
2023-06-26 10:36:10 +03:00
Aleksandra Martyniuk
f9a527b06d compaction: move reshape function to shard_reshaping_table_compaction_task_impl::run() 2023-06-23 16:22:53 +02:00
Aleksandra Martyniuk
1960904a72 compaction: add shard_reshaping_compaction_task_impl
shard_reshaping_compaction_task_impl covers reshaping compaction
on one shard.
2023-06-23 16:22:38 +02:00
Aleksandra Martyniuk
e3e2d6b886 compaction: add table_reshaping_compaction_task_impl 2023-06-23 15:57:37 +02:00
Aleksandra Martyniuk
dace5fb004 compaction: copy reshape to task_manager_module.cc
distributed_loader::reshape is copied to compaction/task_manager_module.cc
as it will be used in reshape compaction tasks.
2023-06-23 12:53:16 +02:00
Aleksandra Martyniuk
981a50e490 compaction: add reshaping_compaction_task_impl
reshaping_compaction_task_impl serves as a base class of all
concrete reshaping compaction task classes.
2023-06-23 12:53:15 +02:00
Botond Dénes
320159c409 Merge 'Compaction group major compaction task' from Aleksandra Martyniuk
Task manager task covering compaction group major
compaction.

Uses multiple inheritance on already existing
major_compaction_task_executor to keep track of
the operation with task manager.

Closes #14271

* github.com:scylladb/scylladb:
  test: extend test_compaction_task.py
  test: use named variable for task tree depth
  compaction: turn major_compaction_task_executor into major_compaction_task_impl
  compaction: take gate holder out of task executor
  compaction: extend signature of some methods
  tasks: keep shared_ptr to impl in task
  compaction: rename compaction_task_executor methods
2023-06-22 08:15:17 +03:00
Tomasz Grabiec
36da062bcb db: Use table sharder in compaction 2023-06-21 00:58:24 +02:00
Aleksandra Martyniuk
74e5b4ebfc compaction: turn major_compaction_task_executor into major_compaction_task_impl
major_compaction_task_executor inherits both from compaction_task_executor
and major_compaction_task_impl.

Thanks to that an executed operation is represented in task manager.
2023-06-20 12:12:49 +02:00
Aleksandra Martyniuk
4922f4cf80 compaction: take gate holder out of task executor
In the following commits, classes deriving from compaction_task_executor
will be alive longer than they are kept in compaction_manager::_tasks.
Thus, the compaction_task_executor::_gate_holder would be held,
blocking other compactions.

compaction_task_executor::_gate_holder is moved outside of
compaction_task_executor object.
2023-06-20 12:12:45 +02:00
Aleksandra Martyniuk
e317ffe23a compaction: extend signature of some methods
Extend a signature of table::compact_all_sstables and
compaction_manager::perform_major_compaction so that they get
the info of a covering task.

This allows to easily create child tasks that cover compaction group
compaction.
2023-06-20 10:45:34 +02:00
Aleksandra Martyniuk
3007fbeee3 compaction: rename compaction_task_executor methods
compaction_task_executor methods are renamed to prevent name
colisions between compaction_task_executor
and tasks::task_manager::task::impl.
2023-06-20 10:45:34 +02:00
Pavel Emelyanov
5412c7947a backlog_controller: Unwrap scheduling_group
Some time ago (997a34bf8c) the backlog
controller was generalized to maintain some scheduling group. Back then
the group was the pair of seastar::scheduling_group and
seastar::io_priority_class. Now the latter is gone, so the controller's
notion of what sched group is can be relaxed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14266
2023-06-16 12:02:14 +03:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Raphael S. Carvalho
156d771101 compaction: Fix sstable cleanup after resharding on refresh
Problem can be reproduced easily:
1) wrote some sstables with smp 1
2) shut down scylla
3) moved sstables to upload
4) restarted scylla with smp 2
5) ran refresh (resharding happens, adds sstable to cleanup
set and never removes it)
6) cleanup (tries to cleanup resharded sstables which were
leaked in the cleanup set)

Bumps into assert "Assertion `!sst->is_shared()' failed", as
cleanup picks a shared sstable that was leaked and already
processed by resharding.

Fix is about not inserting shared sstables into cleanup set,
as shared sstables are restricted to resharding and cannot
be processed later by cleanup (nor it should because
resharding itself cleaned up its input files).

Dtest: https://github.com/scylladb/scylla-dtest/pull/3206

Fixes #14001.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14147
2023-06-06 12:14:03 +03:00
Benny Halevy
17795757d3 compaction_manager: compact_sstables: fix typo in log message about cleanup
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14151
2023-06-06 11:17:02 +03:00
Botond Dénes
80b944a9b8 Merge 'Table compaction tasks' from Aleksandra Martyniuk
Implementation of task_manager's tasks that cover major, cleanup,
offstrategy, and upgrade sstables compaction of one table.

Closes #13619

* github.com:scylladb/scylladb:
  test: extend compaction tasks test
  compaction: fix indentation
  compaction: create table_upgrade_sstables_compaction_task_impl
  compaction: create table_offstrategy_keyspace_compaction_task_impl
  compaction: create table_cleanup_keyspace_compaction_task_impl
  compaction: create table_major_keyspace_compaction_task_impl
  compaction: add helpers for table tasks scheduling
  compaction: add run_on_table
  compaction: pass std::string to run_on_existing_tables
2023-06-06 10:51:53 +03:00
Aleksandra Martyniuk
fecdd75cd6 compaction: fix indentation 2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
53c24c0f7d compaction: create table_upgrade_sstables_compaction_task_impl
Implementation of task_manager's task that covers upgrade sstables
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
143919cfa7 compaction: create table_offstrategy_keyspace_compaction_task_impl
Implementation of task_manager's task that covers offstrategy keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
55ef1c24e1 compaction: create table_cleanup_keyspace_compaction_task_impl
Implementation of task_manager's task that covers cleanup keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
5c7832ab59 compaction: create table_major_keyspace_compaction_task_impl
Implementation of task_manager's task that covers major keyspace
compaction of one table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
d0c4028d64 compaction: add helpers for table tasks scheduling
In shard compaction tasks per table tasks will be created all at once
and then they will wait for their turn to run.

A function that allows waking up tasks one after another and a function
that makes the task wait for its turn are added.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
6dacc45c70 compaction: add run_on_table
Extract code which runs a function on a particular table from
run_on_existing_tables to run_on_table.
2023-05-31 14:59:24 +02:00
Aleksandra Martyniuk
5c65ac00ef compaction: pass std::string to run_on_existing_tables
Keyspace argument passed to run_on_existing_tables has its type
changed from std::string_view to std::string.
2023-05-31 14:59:24 +02:00
Raphael S. Carvalho
23443e0574 compaction: Fix incremental compaction for sstable cleanup
After c7826aa910, sstable runs are cleaned up together.

The procedure which executes cleanup was holding reference to all
input sstables, such that it could later retry the same cleanup
job on failure.

Turns out it was not taking into account that incremental compaction
will exhaust the input set incrementally.

Therefore cleanup is affected by the 100% space overhead.

To fix it, cleanup will now have the input set updated, by removing
the sstables that were already cleaned up. On failure, cleanup
will retry the same job with the remaining sstables that weren't
exhausted by incremental compaction.

New unit test reproduces the failure, and passes with the fix.

Fixes #14035.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14038
2023-05-31 06:46:12 +03:00
Aleksandra Martyniuk
f48b57e7b9 compaction: use table_info in compaction tasks
Task manager compaction tasks need table names for logs.
Thus, compaction tasks store table infos instead of table ids.

get_table_ids function is deleted as it isn't used anywhere.
2023-05-30 09:58:55 +02:00
Aleksandra Martyniuk
24864e39dd compaction: delete unnecessary sequence number incrementations
Task manager's tasks that have parent task inherit sequence number
from their parents. Thus they do not need to have a new sequence number
generated as it will be overwritten anyway.

Closes #14045
2023-05-29 23:03:25 +03:00
Botond Dénes
3b424e391b Merge 'perform_cleanup: wait until all candidates are cleaned up' from Benny Halevy
cleanup_compaction should resolve only after all
sstables that require cleanup are cleaned up.

Since it is possible that some of them are in staging
and therefore cannot be cleaned up, retry once a second
until they become eligible.

Timeout if there is no progress within 5 minutes
to prevent hanging due to view building bug.

Fixes #9559

Closes #13812

* github.com:scylladb/scylladb:
  table: signal compaction_manager when staging sstables become eligible for cleanup
  compaction_manager: perform_cleanup: wait until all candidates are cleaned up
  compaction_manager: perform_cleanup: perform_offstrategy if needed
  compaction_manager: perform_cleanup: update_sstables_cleanup_state in advance
  sstable_set: add for_each_sstable_gently* helpers
2023-05-19 12:35:59 +03:00
Raphael S. Carvalho
38b226f997 Resurrect optimization to avoid bloom filter checks during compaction
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.

Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.

This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.

The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.

Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
    45.64%    45.64%  sstable_compact  sstable_compaction_test_g
        [.] utils::filter::bloom_filter::is_present

With its resurrection, the problem is gone.

This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.

Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])

[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.

After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).

With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #13908
2023-05-18 09:01:50 +03:00