Commit Graph

209 Commits

Author SHA1 Message Date
Botond Dénes
0f60cc84f4 Merge 'replica: create a replica module' from Avi Kivity
Move the ::database, ::keyspace, and ::table classes to a new replica
namespace and replica/ directory. This designates objects that only
have meaning on a replica and should not be used on a coordinator
(but note that not all replica-only classes should be in this module,
for example compaction and sstables are lower-level objects that
deserve their own modules).

The module is imperfect - some additional classes like distributed_loader
should also be moved, but there is only one way to untie Gordian knots.

Closes #9872

* github.com:scylladb/scylla:
  replica: move ::database, ::keyspace, and ::table to replica namespace
  database: Move database, keyspace, table classes to replica/ directory
2022-01-07 13:37:40 +02:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Raphael S. Carvalho
07fba4ab5d compaction_manager: Abort reshape for tables waiting for a chance to run
Tables waiting for a chance to run reshape wouldn't trigger stop
exception, as the exception was only being triggered for ongoing
compactions. Given that stop reshape API must abort all ongoing
tasks and all pending ones, let's change run_custom_job() to
trigger the exception if it found that the pending task was
asked to stop.

Tests:
dtest: compaction_additional_test.py::TestCompactionAdditional::test_stop_reshape_with_multiple_keyspaces
unit: dev

Fixes #9836.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223002157.215571-1-raphaelsc@scylladb.com>
2022-01-06 18:04:16 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Raphael S. Carvalho
4c28c49bc7 compaction_manager: make return of maybe_stop_on_error less confusing
maybe_stop_on_error() is confusing because it returns true if the task
can be retried which goes in opposite direction of its semantics.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220106143233.459903-1-raphaelsc@scylladb.com>
2022-01-06 16:39:15 +02:00
Avi Kivity
2e958b3555 Merge "Coroutinization of compaction sstable rewrite procedure" from Raphael
"
Completes coroutinization of rewrite_sstables().

tests: UNIT(debug)
"

* 'rewrite_sstable_coroutinization' of https://github.com/raphaelsc/scylla:
  compaction_manager: coroutinize main loop in sstable rewrite procedure
  compaction_manager: coroutinize exception handling in sstable rewrite procedure
  compaction_manager: mark task::finish_compaction() as noexcept
  compaction_manager: make maybe_stop_on_error() more flexible
2022-01-05 10:15:19 +02:00
Benny Halevy
e0a351e0c6 compaction_manager: stop_compaction: disallow specific types
We can stop only specific compaction types.

Reshard should be excluded since it mustn't be stopped.

And other types of compaction types like "VALIDATION" or "INDEX_BUILD"
are valid in terms of their syntax but unsupported by scylla so we better
return an error rather than appear to support them.

Test: unit(dev)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222133449.2177746-1-bhalevy@scylladb.com>
2022-01-05 09:32:20 +02:00
Raphael S. Carvalho
f0b816d8e8 compaction_manager: coroutinize main loop in sstable rewrite procedure
with this patch, rewrite_sstables() is now fully coroutinized.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 16:03:23 -03:00
Raphael S. Carvalho
c85ba1e694 compaction_manager: coroutinize exception handling in sstable rewrite procedure
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:39:54 -03:00
Raphael S. Carvalho
59a65742f9 compaction_manager: mark task::finish_compaction() as noexcept
As it's intended to be used in a deferred action.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:30:04 -03:00
Raphael S. Carvalho
3fe4c2e517 compaction_manager: make maybe_stop_on_error() more flexible
It's hard to integrate maybe_stop_on_error() with coroutines as it
accepts a resolved future, not an exception pointer. Let's adjust
its interface, making it more flexible to work with.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-01-04 15:28:30 -03:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Raphael S. Carvalho
ad82ede5f3 compaction: simplify rewrite_sstables() with coroutine
rewrite_sstables() is terribly nested, making it hard to read.
as usual, can be nicely simplified with coroutines.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211223135012.56277-1-raphaelsc@scylladb.com>
2021-12-26 14:10:52 +02:00
Raphael S. Carvalho
e05859c3f9 compaction: kill unused code for resharding_compaction
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Raphael S. Carvalho
d1f2fd7f03 compaction: rename compacting_sstable_writer to compacted_fragments_writer
the name compacting_sstable_writer is misleading as it doesn't perform
any compaction. let's rename it to a name that reflects more what it
does.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>
2021-12-20 18:21:31 +02:00
Botond Dénes
55bb70a878 Merge "Make sure TWCS per-window major includes all files" from Raphael
"
TWCS perform STCS on a window as long as it's the most recent one.
From there on, TWCS will compact all files in the past window into
a single file. With some moderate write load, it could happen that
there's still some compaction activity in that past window, meaning
that per-window major may miss some files being currently compacted.
As a result, a past window may contain more than 1 file after all
compaction activity is done on its behalf, which may increase read
amplification. To avoid that, TWCS will now make sure that per-window
major is serialized, to make sure no files are missed.

Fixes #9553.

tests: unit(dev).
"

* 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla:
  TWCS: Make sure major on past window is done on all its sstables
  TWCS: remove needless param for STCS options
  TWCS: kill unused param in newest_bucket()
  compaction: Implement strategy control and wire it
  compaction: Add interface to control strategy behavior.
2021-12-20 17:12:50 +02:00
Nadav Har'El
252ce8afd4 Merge 'Extend stop compaction api' from Benny Halevy
Allow stopping compaction by type on a given keyspace and list of tables.

Also add api unit test suite that tests the existing `stop_compaction` api
and the new `stop_keyspace_compaction` api.

Fixes #9700

Closes #9746

* github.com:scylladb/scylla:
  api: storage_service: validate_keyspace: improve exception error message
  api: compaction_manager: add stop_keyspace_compaction
  api: storage_service: expose validate_keyspace and parse_tables
  api: compaction_manager: stop_compaction: fix type description
  compaction_manager: stop_compaction: expose optional table*
  test: api: add basic compaction_manager test
2021-12-20 00:18:46 +02:00
Benny Halevy
c89876c975 compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.

Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.

Fixes #9766

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
2021-12-14 11:15:23 +02:00
Raphael S. Carvalho
8eace8fc49 TWCS: Make sure major on past window is done on all its sstables
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.

Fixes #9553.

Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:10:43 -03:00
Raphael S. Carvalho
2dc890d8e6 TWCS: remove needless param for STCS options
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:40 -03:00
Raphael S. Carvalho
41a5736aaf TWCS: kill unused param in newest_bucket()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:36 -03:00
Raphael S. Carvalho
49f40c8791 compaction: Implement strategy control and wire it
This implements strategy control interface for both manager and
tests, and wire it.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 16:05:23 -03:00
Raphael S. Carvalho
6d9466052e compaction: Add interface to control strategy behavior.
This interface is akin to table_state, but compaction manager's
representative instead.
It will allow compaction manager to set goals and contraints on
compaction strategies. It will start by allowing strategy to know
if there's ongoing compaction, which is useful for virtually all
strategies. For example, LCS may want to compact L0 in parallel
with higher levels, to avoid L0 falling behind.
This interface can be easily extended to allow manager to switch
to a reclaim mode, if running out of space, etc.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-13 15:55:37 -03:00
Avi Kivity
e44a28dce4 Merge "compaction: Allow data from different buckets (e.g. windows) to be compacted together" from Raphael
"
Today, data from different buckets (e.g. windows) cannot be compacted together because
mutation compactor happens inside each consumer, where each consumer is done on behalf
of a particular bucket. To solve this problem, mutation compaction process is being
moved from consumer into producer, such that interposer consumer, which is responsible
for segregation, will be feeded with compacted data and forward it into the owner bucket.

Fixes #9662.

tests: unit(debug).
"

* 'compact_across_buckets_v2' of github.com:raphaelsc/scylla:
  tests: sstable_compaction_test: add test_twcs_compaction_across_buckets
  compaction: Move mutation compaction into producer for TWCS
  compaction: make enable_garbage_collected_sstable_writer() more precise
2021-12-12 15:07:15 +02:00
Raphael S. Carvalho
9b8aa1e9ae compaction: Move mutation compaction into producer for TWCS
If interposer is enabled, like the timestamp-based one for TWCS, data
from different buckets (e.g. windows) cannot be compacted together because
mutation compaction happens inside each consumer, where each consumer
will be belong to a different bucket.
To remove this limitation, let's move the mutation compactor from
consumer into producer, such that compacted data will be feeded into
the interposer, before it segregates data.
We're short-circuiting this logic if TWCS isn't in use as
compacting reader adds overhead to compaction, given that this reader
will pop fragments from combined sstable reader, compact them using
mutation_compactor and finally push them out to the underlying
reader.

without compacting reader (e.g. STCS + no interposer):
228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops)
224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops)
224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops)

with compacting reader (e.g. TWCS + interposer):
221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops)
216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops)
215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops)

So the cost of compacting data across buckets is ~3.5%, which happens
only with interposer enabled and GC writer disabled.

Fixes #9662.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 17:14:44 -03:00
Raphael S. Carvalho
484269cd8f compaction: make enable_garbage_collected_sstable_writer() more precise
we only want to enable GC writer if incremental compaction is required.
let's make it more precise by checking that size limit for sstable
isn't disabled, so GC writer will only be enabled for compaction
strategies that really need it. So strategies that don't need it
won't pay the penalty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-10 15:22:08 -03:00
Raphael S. Carvalho
e0758fded1 compaction_manager: make get_compaction_state() private
internal method that should never be directly used by the outside
world.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211210120806.19233-1-raphaelsc@scylladb.com>
2021-12-10 17:19:24 +03:00
Mikołaj Sielużycki
504efe0607 table: Prevent resurrecting data from memtable on compaction
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
2021-12-09 13:22:14 +01:00
Benny Halevy
fed7319698 compaction_manager: stop_compaction: expose optional table*
To be used by api layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-09 14:14:49 +02:00
Mikołaj Sielużycki
7ce0ca040d table: Add min_memtable_timestamp function to table 2021-12-09 13:14:38 +01:00
Botond Dénes
2e5440bdf2 Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho
Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too.

There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2.

tests: unit(debug).

Closes #9725

* github.com:scylladb/scylla:
  compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
  sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
  sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
2021-12-07 15:17:38 +02:00
Raphael S. Carvalho
2435bd14c6 compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:57 -03:00
Raphael S. Carvalho
c6399005a3 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:55 -03:00
Raphael S. Carvalho
aebbe68239 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:53 -03:00
Raphael S. Carvalho
c3c070a5ca sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-07 09:37:51 -03:00
Avi Kivity
395b30bca8 mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2
As part of the drive to move over to flat_mutation_reader_v2, update
make_filtering_reader(). Since it doesn't examine range tombstones
(only the partition_start, to filter the key) the entire patch
is just glue code upgrading and downgrading users in the pipeline
(or removing a conversion, in one case).

Test: unit (dev)

Closes #9723
2021-12-07 12:18:07 +02:00
Raphael S. Carvalho
6737c88045 compaction_manager: use single semaphore for serialization of maintenance compactions
We have three semaphores for serialization of maintenance ops.
1) _rewrite_sstables_sem: for scrub, cleanup and upgrade.
2) _major_compaction_sem: for major
3) _custom_job_sem: for reshape, resharding and offstrategy

scrub, cleanup and upgrade should be serialized with major,
so rewrite sem should be merged into major one.

offstrategy is also a maintenance op that should be serialized
with others, to reduce compaction aggressiveness and space
requirement.

resharding is one-off operation, so can be merged there too.
the same applies for reshape, which can take long and not
serializing it with other maintenance activity can lead to
exhaustion of resources and high space requirement.

let's have a single semaphore to guarantee their serialization.

deadlock isn't an issue because locks are always taken in same
order.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211201182046.100942-1-raphaelsc@scylladb.com>
2021-12-07 12:18:07 +02:00
Benny Halevy
cc122984d6 compaction: scrub: add quarantine_mode option
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:29:04 +02:00
Benny Halevy
60ff28932c compaction_manager: perform_sstable_scrub: get the whole compaction_type_options::scrub
So we can pass additional options on top of the scrub mode.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:21:37 +02:00
Benny Halevy
bbe275f37d compaction: scrub_sstables_validate_mode: quarantine invalid sstables
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.

Refs #7658

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:14:16 +02:00
Benny Halevy
13e7b00f2e sstables: add is_quarantined
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Benny Halevy
07c5ddf182 sstables: add is_eligible_for_compaction
Currently compaction_manager tracks sstables
based on !requires_view_building() and similarly,
table::in_strategy_sstables picks up only sstables
that are not in staging.

is_eligible_for_compaction() generalizes this condition
in preparation for adding a quarantine subdirectory for
invalid sstables that should not be compacted as well.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-12-05 18:00:44 +02:00
Raphael S. Carvalho
2f9f089eda compaction_strategy: kill unused compaction_strategy_type::major
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:27:10 -03:00
Raphael S. Carvalho
0e3d388ebb compaction: Log skip of fully expired sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:25:48 -03:00
Raphael S. Carvalho
9725e5efa9 compaction_strategy: kill unused can_compact_partial_runs()
This strategy method was introduced unnecessarily. We assume it was
going to be needed, but turns out it was never needed, not even
for ICS. Also it's built on a wrong assumption as an output
sstable run being generated can never be compacted in parallel
as the non-overlapping requirement can be easily broken.
LCS for example can allow parallel compaction on different runs
(levels) but correctness cannto be guaranteed with same runs
are compacted in parallel.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:20:51 -03:00
Raphael S. Carvalho
7a7a2467fa compaction: kill useless on_skipped_expired_sstable()
It was introduced by commit 5206a97915 because fully expired sstable
wouldn't be registed and therefore could be never removed from backlog
tracker. This is no longer possible as table is now responsible for
removing all input sstables. So let's kill on_skipped_expired_sstable()
as it's now only boilerplate we don't need.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:29 -03:00
Raphael S. Carvalho
32c2534e91 compaction: merge _total_input_sstables and _ancestors
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-03 12:19:23 -03:00
Raphael S. Carvalho
4a02e312f6 compaction: increase disjoint tolerance in TWCS reshape
When reshaping TWCS table in relaxed mode, which is the case for
offstrategy and boot, disjoint tolerance is too strict, which can
lead those processes to do more work than needed.
Let's increase the tolerance to max threshold, which will limit the
amount of sstables opened in compaction to a reasonable amount.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211130132538.56285-1-raphaelsc@scylladb.com>
2021-12-03 06:38:42 +02:00
Raphael S. Carvalho
6d750d4f59 compaction_manager: move check_for_cleanup into perform_cleanup()
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00
Raphael S. Carvalho
9aed7e9d67 compaction_manager: replace get_total_size by one liner
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2021-12-02 14:39:31 -03:00