Commit Graph

186 Commits

Author SHA1 Message Date
Benny Halevy
c89876c975 compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.

Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.

Fixes #9766

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
2021-12-14 11:15:23 +02:00
Avi Kivity
e44a28dce4 Merge "compaction: Allow data from different buckets (e.g. windows) to be compacted together" from Raphael
"
Today, data from different buckets (e.g. windows) cannot be compacted together because
mutation compactor happens inside each consumer, where each consumer is done on behalf
of a particular bucket. To solve this problem, mutation compaction process is being
moved from consumer into producer, such that interposer consumer, which is responsible
for segregation, will be feeded with compacted data and forward it into the owner bucket.

Fixes #9662.

tests: unit(debug).
"

* 'compact_across_buckets_v2' of github.com:raphaelsc/scylla:
  tests: sstable_compaction_test: add test_twcs_compaction_across_buckets
  compaction: Move mutation compaction into producer for TWCS
  compaction: make enable_garbage_collected_sstable_writer() more precise
2021-12-12 15:07:15 +02:00
Raphael S. Carvalho
9b8aa1e9ae compaction: Move mutation compaction into producer for TWCS
If interposer is enabled, like the timestamp-based one for TWCS, data
from different buckets (e.g. windows) cannot be compacted together because
mutation compaction happens inside each consumer, where each consumer
will be belong to a different bucket.
To remove this limitation, let's move the mutation compactor from
consumer into producer, such that compacted data will be feeded into
the interposer, before it segregates data.
We're short-circuiting this logic if TWCS isn't in use as
compacting reader adds overhead to compaction, given that this reader
will pop fragments from combined sstable reader, compact them using
mutation_compactor and finally push them out to the underlying
reader.

without compacting reader (e.g. STCS + no interposer):
228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops)
224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops)
224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops)

with compacting reader (e.g. TWCS + interposer):
221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops)
216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops)
215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops)

So the cost of compacting data across buckets is ~3.5%, which happens
only with interposer enabled and GC writer disabled.

Fixes #9662.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 17:14:44 -03:00
Raphael S. Carvalho
484269cd8f compaction: make enable_garbage_collected_sstable_writer() more precise
we only want to enable GC writer if incremental compaction is required.
let's make it more precise by checking that size limit for sstable
isn't disabled, so GC writer will only be enabled for compaction
strategies that really need it. So strategies that don't need it
won't pay the penalty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 15:22:08 -03:00
Raphael S. Carvalho
e0758fded1 compaction_manager: make get_compaction_state() private
internal method that should never be directly used by the outside
world.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211210120806.19233-1-raphaelsc@scylladb.com>
2021-12-10 17:19:24 +03:00
Mikołaj Sielużycki
504efe0607 table: Prevent resurrecting data from memtable on compaction
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
2021-12-09 13:22:14 +01:00
Mikołaj Sielużycki
7ce0ca040d table: Add min_memtable_timestamp function to table 2021-12-09 13:14:38 +01:00
Botond Dénes
2e5440bdf2 Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho
Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too.

There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2.

tests: unit(debug).

Closes #9725

* github.com:scylladb/scylla:
  compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
  sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
2021-12-07 15:17:38 +02:00
Raphael S. Carvalho
2435bd14c6 compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:57 -03:00
Raphael S. Carvalho
c6399005a3 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:55 -03:00
Raphael S. Carvalho
aebbe68239 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:53 -03:00
Raphael S. Carvalho
c3c070a5ca sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:51 -03:00
Avi Kivity
395b30bca8 mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2
As part of the drive to move over to flat_mutation_reader_v2, update
make_filtering_reader(). Since it doesn't examine range tombstones
(only the partition_start, to filter the key) the entire patch
is just glue code upgrading and downgrading users in the pipeline
(or removing a conversion, in one case).

Test: unit (dev)

Closes #9723
2021-12-07 12:18:07 +02:00
Raphael S. Carvalho
6737c88045 compaction_manager: use single semaphore for serialization of maintenance compactions
We have three semaphores for serialization of maintenance ops.
1) _rewrite_sstables_sem: for scrub, cleanup and upgrade.
2) _major_compaction_sem: for major
3) _custom_job_sem: for reshape, resharding and offstrategy

scrub, cleanup and upgrade should be serialized with major,
so rewrite sem should be merged into major one.

offstrategy is also a maintenance op that should be serialized
with others, to reduce compaction aggressiveness and space
requirement.

resharding is one-off operation, so can be merged there too.
the same applies for reshape, which can take long and not
serializing it with other maintenance activity can lead to
exhaustion of resources and high space requirement.

let's have a single semaphore to guarantee their serialization.

deadlock isn't an issue because locks are always taken in same
order.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211201182046.100942-1-raphaelsc@scylladb.com>
2021-12-07 12:18:07 +02:00
Benny Halevy
cc122984d6 compaction: scrub: add quarantine_mode option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:29:04 +02:00
Benny Halevy
60ff28932c compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub
So we can pass additional options on top of the scrub mode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:21:37 +02:00
Benny Halevy
bbe275f37d compaction: scrub_sstables_validate_mode: quarantine invalid sstables
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.

Refs #7658

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:14:16 +02:00
Benny Halevy
13e7b00f2e sstables: add is_quarantined
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
07c5ddf182 sstables: add is_eligible_for_compaction
Currently compaction_manager tracks sstables
based on !requires_view_building() and similarly,
table::in_strategy_sstables picks up only sstables
that are not in staging.

is_eligible_for_compaction() generalizes this condition
in preparation for adding a quarantine subdirectory for
invalid sstables that should not be compacted as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Raphael S. Carvalho
2f9f089eda compaction_strategy: kill unused compaction_strategy_type::major
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:27:10 -03:00
Raphael S. Carvalho
0e3d388ebb compaction: Log skip of fully expired sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:25:48 -03:00
Raphael S. Carvalho
9725e5efa9 compaction_strategy: kill unused can_compact_partial_runs()
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:20:51 -03:00
Raphael S. Carvalho
7a7a2467fa compaction: kill useless on_skipped_expired_sstable()
It was introduced by commit 5206a97915 because fully expired sstable
wouldn't be registed and therefore could be never removed from backlog
tracker. This is no longer possible as table is now responsible for
removing all input sstables. So let's kill on_skipped_expired_sstable()
as it's now only boilerplate we don't need.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:29 -03:00
Raphael S. Carvalho
32c2534e91 compaction: merge _total_input_sstables and _ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:23 -03:00
Raphael S. Carvalho
4a02e312f6 compaction: increase disjoint tolerance in TWCS reshape
When reshaping TWCS table in relaxed mode, which is the case for
offstrategy and boot, disjoint tolerance is too strict, which can
lead those processes to do more work than needed.
Let's increase the tolerance to max threshold, which will limit the
amount of sstables opened in compaction to a reasonable amount.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130132538.56285-1-raphaelsc@scylladb.com>
2021-12-03 06:38:42 +02:00
Raphael S. Carvalho
6d750d4f59 compaction_manager: move check_for_cleanup into perform_cleanup()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00
Raphael S. Carvalho
9aed7e9d67 compaction_manager: replace get_total_size by one liner
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00
Raphael S. Carvalho
760cfd93fb compaction_manager: make consistent usage of type and name table
new code in manager adopted name and type table, whereas historical
code still uses name and type column family. let's make it consistent
for newcomers to not get confused.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:27 -03:00
Raphael S. Carvalho
e460f72250 compaction_manager: simplify rewrite_sstables()
as rewrite_sstables() switched to coroutine, it can be simplified
by not using smart pointers to handle lifetime issues.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 08:15:41 -03:00
Raphael S. Carvalho
48124fc15a compaction_manager: restore indentation
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 08:15:38 -03:00
Raphael S. Carvalho
f23e0d7f2d compaction_manager: Disconsider inactive tasks when filtering sstables
After commit 1f5b17f, overlapping can be introduced in level 1 because
procedure that filters out sstables from partial runs is considering
inactive tasks, so L1 sstables can be incorrectly filtered out from
next compaction attempt. When L0 is merged into L1, overlapping is
then introduced in L1 because old L1 sstables weren't considered in
L0 -> L1 compaction.

From now on, compaction_manager::get_candidates() will only consider
active tasks, to make sure actual partial runs are filtered out.

Fixes #9693.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129180459.125847-1-raphaelsc@scylladb.com>
2021-12-01 16:11:44 +02:00
Raphael S. Carvalho
9de7abdc80 compaction: LCS: Fix inefficiency when pushing SSTables to higher levels
To satisfy backlog controller, commit 28382cb25c changed LCS to
incrementally push sstables to highest level *when* there's nothing else
to be done.

That's overkill because controller will be satisfied with level L being
fanout times larger than L-1. No need to push everything to last level as
it's even worse than a major, because any file being promoted will
overlap with ~10 files in next level. At least, the cost is amortized by
multiple iterations, but terrible write amplification is still there.
Consequently, this reduces overall efficiency.
For example, it might happen that LCS in table A start pushing everything
to highest level, when table B needs resources for compaction to reduce its
backlog. Increased write amplification in A may prevent other tables
from reducing their backlog in a timely manner.

It's clear that LCS should stop promoting as soon as level L is 10x
larger than L-1, so strategy will still be satisfied while fixing the
inefficiency problem.

Now layout will look like as follow:
SSTables in each level: [0, 2, 15, 121]

Previously, it looked like once table stopped being written to:
SSTables in each level: [0, 0, 0, 138]

It's always good to have everything in a single run, but that comes
with a high write amplification cost which we cannot afford in steady
state. With this change, the layout will still be good enough to make
everybody happy.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129143606.71257-1-raphaelsc@scylladb.com>
2021-12-01 16:10:25 +02:00
Avi Kivity
03755b362a Merge 'compaction_manager api: stop ongoing compactions' from Benny Halevy
This series extends `compaction_manager::stop_ongoing_compaction` so it can be used from the api layer for:
- table::disable_auto_compaction
- compaction_manager::stop_compaction

Fixes #9313
Fixes #9695

Test: unit(dev)

Closes #9699

* github.com:scylladb/scylla:
  compaction_manager: stop_compaction: wait for ongoing compactions to stop
  compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
  compaction_manager: unify stop_ongoing_compactions implementations
  compaction_manager: stop_ongoing_compactions: add compaction_type option
  compaction_manager: get_compactions: get a table* parameter
  table: disable_auto_compaction: stop ongoing compactions
  compaction_manager: make stop_ongoing_compactions public
  table: futurize disable_auto_compactions
2021-11-30 19:08:14 +02:00
Benny Halevy
957003e73f compaction_manager: stop_compaction: wait for ongoing compactions to stop
Similar to #9313, stop_compaction should also reuse the
stop_ongoing_comapctions() infrastructure and wait on ongoing
compactions of the given type to stop.

Fixes #9695

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
b9ba181d3c compaction_manager: stop_ongoing_compactions: log Stopping 0 tasks at debug level
Normally, "Stopping 0 tasks for 0 ongoing compactions for table ..."
is not very interesting so demote its log_level to debug.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:11 +02:00
Benny Halevy
03e969dbef compaction_manager: unify stop_ongoing_compactions implementations
Now stop_ongoing_compactions(reason) is equivalent to
to stop_ongoing_compactions(reason, nullptr, std::nullopt)
so share the code of the latter for the former entry point.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:09:07 +02:00
Benny Halevy
94011bdcca compaction_manager: stop_ongoing_compactions: add compaction_type option
And make the table optional as well, so it can be used
by stop_compaction() to a particular compaction type on all tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:07:47 +02:00
Benny Halevy
a419759835 compaction_manager: get_compactions: get a table* parameter
Optionally get running compaction on the provided table.
This is required for stop_ongoing_compactions on a given table.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:34 +02:00
Benny Halevy
3c721eb228 compaction_manager: make stop_ongoing_compactions public
So it can be used directly by table code
in the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-30 16:06:29 +02:00
Raphael S. Carvalho
3006394312 compaction: Allow incremental compaction with interposer consumer
Until commit c94e6f8567, interposer consumer wouldn't work
with our GC writer, needed for incremental compaction correctness.
Now that the technical debt is gone, let's allow incremental
compaction with interposer consumer.

The only change needed is serialization of replacer as two
consumers cannot step on each toe, like when we have concurrent
bucket writers with TWCS.

sstable_compaction_test.test_bug_6472 passes with this change,
which was added when #6472 was fixed by not allowing incremental
compaction with interposer consumer.

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>
2021-11-30 15:24:17 +02:00
Raphael S. Carvalho
80a1ebf0f3 compaction_manager: Fix race when selecting sstables for rewrite operations
Rewrite operations are scrub, cleanup and upgrade.

Race can happen because 'selection of sstables' and 'mark sstables as
compacting' are decoupled. So any deferring point in between can lead
to a parallel compaction picking the same files. After commit 2cf0c4bbf,
files are marked as compacting before rewrite starts, but it didn't
take into account the commit c84217ad which moved retrieval of
candidates to a deferring thread, before rewrite_sstables() is even
called.

Scrub isn't affected by this because it uses a coarse grained approach
where whole operation is run with compaction disabled, which isn't good
because regular compaction cannot run until its completion.

From now on, selection of files and marking them as compacting will
be serialized by running them with compaction disabled.

Now cleanup will also retrieve sstables with compaction disabled,
meaning it will no longer leave uncleaned files behind, which is
important to avoid data resurrection if node regains ownership of
data in uncleaned files.

Fixes #8168.
Refs #8155.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211129133107.53011-1-raphaelsc@scylladb.com>
2021-11-29 16:27:29 +02:00
Benny Halevy
0a33762fb1 compaction_manager: add compaction_state when table is constructed
With that, it is always expected that _compaction_state[cf]
exists when compaction jobs are submnitted.

Otherwise, throw std::out_of_range exception.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
29dd24ab46 compaction_manager: remove: fixup indentation
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
46ac139490 compaction_manager: remove: detach compaction_state before stopping ongoing compactions
So that the compaction_state won't be found from this point on,
while stopping the ongoing compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
75a2509b07 compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
Now that compaction tasks enter the compaction_state gate there is
no point in stopping ongoing compaction in parallel to closing the gate.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
3940ffb085 compaction_manager: task: keep a reference on compaction_state
And hold its gate to make sure the compaction_state outlives
the task and can be used to wait on all tasks and functions
using it.

With that, doing access _compaction_state[cf] to acquire
shared/exclusive locks but rather get to it via
task->compaction_state so it can be detached from
_compaction_state while task is running, if needed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-23 09:40:06 +02:00
Benny Halevy
e7ab1f8581 compaction_manager: compaction_state: use counter for compaction_disabled
We'd like to use compaction_state::gate both for functions
running with compaction disabled and for and tasks referring
to the compaction_state so that stop_ongoing_compactions
could wait on all functions referring to the state structure.

This is also cleaner with respect to not relying on
gate::use_count() when re-submitting regular compaction
when compaction is re-enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:08:42 +02:00
Benny Halevy
3268c94e72 compaction_manager: task: delete move and copy constructors
We use a lw_shared_ptr<task> everywhere.
So prevent moving or copying task objects.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00
Benny Halevy
0cc6060552 compaction_manager: add per-task debug log messages
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00
Benny Halevy
1d8d472028 compaction_manager: stop_ongoing_compactions: log number of tasks to stop
get_compactions().size() may return 0 while there are
non-zero tasks to stop.

Some tasks may not be marked as `compaction_running` since
they are either:
- postponed (due to compaction manger throttling of regular compaction)
- sleeping before retry.

In both cases we still want to stop them so the log message
should reflect both the number of ongoing compactions
and the actual number of tasks we're stopping.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-11-22 22:00:18 +02:00