Commit Graph

96 Commits

Author SHA1 Message Date
Raphael S. Carvalho
5d654a6b9a compaction: don't copy owned ranges in cleanup ctor
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220119142322.39791-1-raphaelsc@scylladb.com>
2022-01-20 14:05:58 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
a7f4ab6b14 compaction/compaction: remove v1 version of validate and scrub reader factory methods 2022-01-14 10:19:56 +02:00
Botond Dénes
d57634ad46 compaction: use v2 version of mutation_writer::segregate_by_partition() 2022-01-14 08:54:26 +02:00
Botond Dénes
b315d17c2a compaction: migrate scrub and validate to v2
We add v2 version of external API but leave the old v1 in place to help
incremental migration. The implementation is migrated to v2.
2022-01-14 08:54:26 +02:00
Botond Dénes
15d8ea983e compaction: upgrade compaction::make_interposer_consumer() to v2
Almost all (except the scrub one) actual interposer consumers are v2.
2022-01-07 13:52:14 +02:00
Botond Dénes
aa3c943f4c mutation_reader: remove unecessary stable_flattened_mutations_consumer
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
2022-01-07 13:52:07 +02:00
Botond Dénes
1ba19c2aa4 compaction/compaction_strategy: convert make_interposer_consumer() to v2
The underlying timestamp-based splitter is v2 already.
2022-01-07 13:51:59 +02:00
Botond Dénes
0601a465a2 mutation_writer: migrate shard_based_splitting_writer to v2 2022-01-07 13:48:53 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Raphael S. Carvalho
e05859c3f9 compaction: kill unused code for resharding_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Raphael S. Carvalho
d1f2fd7f03 compaction: rename compacting_sstable_writer to compacted_fragments_writer
the name compacting_sstable_writer is misleading as it doesn't perform
any compaction. let's rename it to a name that reflects more what it
does.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Benny Halevy
c89876c975 compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.

Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.

Fixes #9766

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
2021-12-14 11:15:23 +02:00
Raphael S. Carvalho
9b8aa1e9ae compaction: Move mutation compaction into producer for TWCS
If interposer is enabled, like the timestamp-based one for TWCS, data
from different buckets (e.g. windows) cannot be compacted together because
mutation compaction happens inside each consumer, where each consumer
will be belong to a different bucket.
To remove this limitation, let's move the mutation compactor from
consumer into producer, such that compacted data will be feeded into
the interposer, before it segregates data.
We're short-circuiting this logic if TWCS isn't in use as
compacting reader adds overhead to compaction, given that this reader
will pop fragments from combined sstable reader, compact them using
mutation_compactor and finally push them out to the underlying
reader.

without compacting reader (e.g. STCS + no interposer):
228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops)
224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops)
224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops)

with compacting reader (e.g. TWCS + interposer):
221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops)
216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops)
215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops)

So the cost of compacting data across buckets is ~3.5%, which happens
only with interposer enabled and GC writer disabled.

Fixes #9662.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 17:14:44 -03:00
Raphael S. Carvalho
484269cd8f compaction: make enable_garbage_collected_sstable_writer() more precise
we only want to enable GC writer if incremental compaction is required.
let's make it more precise by checking that size limit for sstable
isn't disabled, so GC writer will only be enabled for compaction
strategies that really need it. So strategies that don't need it
won't pay the penalty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 15:22:08 -03:00
Mikołaj Sielużycki
504efe0607 table: Prevent resurrecting data from memtable on compaction
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
2021-12-09 13:22:14 +01:00
Botond Dénes
2e5440bdf2 Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho
Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too.

There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2.

tests: unit(debug).

Closes #9725

* github.com:scylladb/scylla:
  compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
  sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
2021-12-07 15:17:38 +02:00
Raphael S. Carvalho
2435bd14c6 compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:57 -03:00
Raphael S. Carvalho
c6399005a3 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:55 -03:00
Raphael S. Carvalho
aebbe68239 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:53 -03:00
Raphael S. Carvalho
c3c070a5ca sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:51 -03:00
Avi Kivity
395b30bca8 mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2
As part of the drive to move over to flat_mutation_reader_v2, update
make_filtering_reader(). Since it doesn't examine range tombstones
(only the partition_start, to filter the key) the entire patch
is just glue code upgrading and downgrading users in the pipeline
(or removing a conversion, in one case).

Test: unit (dev)

Closes #9723
2021-12-07 12:18:07 +02:00
Benny Halevy
cc122984d6 compaction: scrub: add quarantine_mode option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:29:04 +02:00
Benny Halevy
bbe275f37d compaction: scrub_sstables_validate_mode: quarantine invalid sstables
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.

Refs #7658

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:14:16 +02:00
Benny Halevy
13e7b00f2e sstables: add is_quarantined
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
07c5ddf182 sstables: add is_eligible_for_compaction
Currently compaction_manager tracks sstables
based on !requires_view_building() and similarly,
table::in_strategy_sstables picks up only sstables
that are not in staging.

is_eligible_for_compaction() generalizes this condition
in preparation for adding a quarantine subdirectory for
invalid sstables that should not be compacted as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Raphael S. Carvalho
0e3d388ebb compaction: Log skip of fully expired sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:25:48 -03:00
Raphael S. Carvalho
7a7a2467fa compaction: kill useless on_skipped_expired_sstable()
It was introduced by commit 5206a97915 because fully expired sstable
wouldn't be registed and therefore could be never removed from backlog
tracker. This is no longer possible as table is now responsible for
removing all input sstables. So let's kill on_skipped_expired_sstable()
as it's now only boilerplate we don't need.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:29 -03:00
Raphael S. Carvalho
32c2534e91 compaction: merge _total_input_sstables and _ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:23 -03:00
Raphael S. Carvalho
3006394312 compaction: Allow incremental compaction with interposer consumer
Until commit c94e6f8567, interposer consumer wouldn't work
with our GC writer, needed for incremental compaction correctness.
Now that the technical debt is gone, let's allow incremental
compaction with interposer consumer.

The only change needed is serialization of replacer as two
consumers cannot step on each toe, like when we have concurrent
bucket writers with TWCS.

sstable_compaction_test.test_bug_6472 passes with this change,
which was added when #6472 was fixed by not allowing incremental
compaction with interposer consumer.

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>
2021-11-30 15:24:17 +02:00
Raphael S. Carvalho
06405729ce compaction: stop including database.hh
after switching to table_state, compaction code can finally stop
including database.hh

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:03 -03:00
Raphael S. Carvalho
69ab5c9dff compaction: switch to table_state in get_fully_expired_sstables()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:02 -03:00
Raphael S. Carvalho
d89edad9fb compaction: switch to table_state
Make compaction procedure switch to table_state. Only function in
compaction.cc still directly using table is
get_fully_expired_sstables(T,...), but subsequently we'll make it
switch to table_state and then we can finally stop including database.hh
in the compaction code.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-19 22:06:01 -03:00
Raphael S. Carvalho
c94e6f8567 compaction: Merge GC writer into regular compaction writer
Turns out most of regular writer can be reused by GC writer, so let's
merge the latter into the former. We gain a lot of simplification,
lots of duplication is removed, and additionally, GC writer can now
be enabled with interposer as it can be created on demand by
each interposer consumer (will be done in a later patch).

Refs #6472.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211119120841.164317-1-raphaelsc@scylladb.com>
2021-11-19 14:19:50 +02:00
Raphael S. Carvalho
4b1bb26d5a compaction: Make maybe_replace_exhausted_sstables_by_sst() more robust
Make it more robust by tracking both partial and sealed sstables.
This way, maybe_r__e__s__by_sst() won't pick partial sstables as
part of incremental compaction. It works today because interposer
consumer isn't enabled with incremental compaction, so there's
a single consumer which will have sealed the sstable before
the function for early replacement is called, but the story is
different if both is enabled.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211117135817.16274-1-raphaelsc@scylladb.com>
2021-11-17 17:21:53 +02:00
Avi Kivity
bc75e2c1d1 treewide: wrap runtime formats with fmt::runtime for fmt 8
fmt 8 checks format strings at compile time, and requires that
non-compile-time format strings be wrapped with fmt::runtime().

Do that, and to allow coexistence with fmt 7, supply our own
do-nothing version of fmt::runtime() if needed. Strictly speaking
we shouldn't be introducing names into the fmt namespace, but this
is transitional only.

Closes #9640
2021-11-17 15:21:36 +02:00
Raphael S. Carvalho
29df862f57 compaction: make table param of get_fully_expired_sstables() const
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-09 10:41:54 -03:00
Raphael S. Carvalho
5af9a690c1 compaction: move incremental_owned_ranges_checker into cleanup_compaction
let's move checker into cleanup as it's not needed elsewhere.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:49:44 -03:00
Raphael S. Carvalho
04ef2124c6 compaction: make owned ranges const in cleanup_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:47:12 -03:00
Raphael S. Carvalho
d86c2491d4 compaction: replace outdated comment in regular_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:45:34 -03:00
Raphael S. Carvalho
b344db1696 compaction: give a more descriptive name to compaction_data
info is no longer descriptive, as compaction now works with
compaction_data instead of compaction_info.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-11-04 09:43:08 -03:00
Raphael S. Carvalho
ab0217e30e compaction: Improve overall efficiency by not diluting it with relatively inefficient jobs
Compaction efficiency can be defined as how much backlog is reduced per
byte read or written.
We know a few facts about efficiency:
1) the more files are compacted together (the fan-in) the higher the
efficiency will be, however...
2) the bigger the size difference of input files the worse the
efficiency, i.e. higher write amplification.

so compactions with similar-sized files are the most efficient ones,
and its efficiency increases with a higher number of files.

However, in order to not have bad read amplification, number of files
cannot grow out of bounds. So we have to allow parallel compaction
on different tiers, but to avoid "dilution" of overall efficiency,
we will only allow a compaction to proceed if its efficiency is
greater than or equal to the efficiency of ongoing compactions.

By the time being, we'll assume that strategies don't pick candidates
with wildly different sizes, so efficiency is only calculated as a
function of compaction fan-in.

Now when system is under heavy load, then fan-in threshold will
automatically grow to guarantee that overall efficiency remains
stable.

Please note that fan-in is defined in number of runs. LCS compaction
on higher levels will have a fan-in of 2. Under heavy load, it
may happen that LCS will temporarily switch to size-tiered mode
for compaction to keep up with amount of data being produced.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211103215110.135633-2-raphaelsc@scylladb.com>
2021-11-03 20:03:23 +02:00
Botond Dénes
6ad0a2989c compaction/scrub: segregate input only in segregate mode
scrub_compaction assumes that `make_interposer_consumer()` is called
only when `use_interposer_consumer()` returns true. This is false, so in
effect scrub always ends up using the segregating interposer. Fix this
by short-circuiting the former method when the latter returns true,
returning the passed-in consumer unchanged.

Tests: unit(dev)

Fixes #9541

Closes #9564
2021-11-02 15:25:22 +02:00
Botond Dénes
eaf4454ac8 compaction: scrub_compaction: add bucket count to finish message
It is useful to know how many buckets (output sstables) scrub produced
in total. The end compaction message will only report those still open
when the scrub finished, but will omit those that were closed in the
middle.
2021-11-02 12:24:37 +02:00
Botond Dénes
f2f529855d compaction,test: use the new in-memory segregator for scrub 2021-11-02 09:00:44 +02:00
Benny Halevy
5483269dfb compaction_manager: pass owned_ranges via cleanup/upgrade options
So they can be easily computed using an async task
before constructing the compaction object
in a following patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-10-13 14:17:46 +03:00
Botond Dénes
cc65c9d0da compaction: scrub/segregate: adjust partition-estimate as buckets accumulate
Scrub compaction in segregate mode can split the input sstable into as
many as hundreds or even thousands of output sstables in the extreme
case. But even at a few dozen output sstables, most of these will only
have a few partitions with a few rows. These sstables however will still
have their bloom filter allocated according to the original
partition-count estimate, causing memory bloat or even OOM in the
extreme case.
This patch solves this by aggressively adjusting the partition count
downwards after the second bucket has been created. Each subsequent
bucket will halve the partition estimate, which will quickly reach 1.

Fixes: #9463

Closes #9464
2021-10-12 12:44:42 +03:00
Avi Kivity
1bac93e075 Merge "simplifications and layer violation fix for compaction manager" from Raphael
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."

* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
  compaction: split compaction info and data for control
  compaction_manager: use task when stopping a given compaction type
  compaction: remove start_size and end_size from compaction_info
  compaction_manager: introduce helpers for task
  compaction_manager: introduce explicit ctor for task
  compaction: kill sstables field in compaction_info
  compaction: kill table pointer in compaction_info
  compaction: simplify procedure to stop ongoing compactions
  compaction: move management of compaction_info to compaction_manager
  compaction: move output run id from compaction_info into task
2021-10-04 13:09:31 +03:00
Botond Dénes
61e7d3de90 Merge 'Cleanup compaction_stop_exception' from Benny Halevy
The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception`
and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown
or api event vs. compaction aborting due to scrub validation error.

While at it, cleanup the existing retry logic related to `compaction_stop_exception`.

Test: unit(dev)
Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267)

Closes #9321

* github.com:scylladb/scylla:
  compaction: split compaction_aborted_exception from compaction_stopped_exception
  compaction_manager: maybe_stop_on_error: rely on retry=false default
  compaction_manager: maybe_stop_on_error: sync return value with error message.
  compaction: drop retry parameter from compaction_stop_exception
  compaction_manager: move errors stats accounting to maybe_stop_on_error
2021-10-04 07:27:11 +03:00
Raphael S. Carvalho
9067a13eac compaction: split compaction info and data for control
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-09-30 13:16:57 -03:00