With compaction_manager switching to table_state, we'll need to
introduce a method in table_state to return maintenance set.
So better to have a descriptive name for main set.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
update_history can take a long time compared to compaction, as a call
issued on shard S1 can be handled on shard S2. If the other shard is
under heavy load, we may unnecessarily block kicking off a new
compaction. Normally it isn't a problem, as compactions aren't super
frequent, but there were edge cases where the described behaviour caused
compaction to fail to keep up with excessive flushing, leading to too
many sstables on disk and OOM during a read.
There is no need to wait with next compaction until history is updated,
so release the weight earlier to remove unnecessary serialization.
Changelog:
v3:
- explicitly call deregister instead of moving the weight RAII object to release weight
- mark compaction as finished when sstables are compacted, without waiting for history to update
v2:
- Split the patches differently for easier review
- Rebased agains newer master, which contains fixes that failed the debug version of the test
- Removed the test, as it will be provided by [PR#10717](https://github.com/scylladb/scylla/pull/10717)
Closes#10507
* github.com:scylladb/scylla:
compaction: Release compaction weight before updating history.
compaction: Inline compact_sstables_and_update_history call.
compaction: Extract compact_sstables function
compaction: Rename compact_sstables to compact_sstables_and_update_history
compaction: Extract update_history function
compaction: Extract should_update_history function.
compaction: Fetch start_size from compaction_result
compaction: Add tracking start_size in compaction_result.
We know in advance the maximum number of
sstable generations to track, so reserve space for it
to prevent vector reallocation for large number of sstables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Let's also use generation_type for compaction ancestors, so once we
support something other than integer for SSTable generation, we
won't have discrepancy about what the generation type is.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The messages only dumps the last sealed fragment, but it should dump
all the output fragments replacing the exhausted input ones.
Let's print origin of output fragments, so we can differ between
files with compaction and garbage-collection origin.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220524232232.119520-1-raphaelsc@scylladb.com>
No functional changes intended - this series is quite verbose,
but after it's in, it should be considerably easier to change
the type of SSTable generations to something else - e.g. a string
or timeUUID.
Closes#10533
flat_mutation_reader make_scrubbing_reader no longer exists
and there is no need to include flat_mutation_reader.hh
nor forward declare the class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We filter only on the parittion key, so it doesn't matter,
but we want to get rid of flat_mutation_reader v1.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
interrupt() makes it sound like it's interrupting the compaction, but it's
actually called *on* interrupt, to handle the interrupt scenario.
Let's rename it to on_interrupt().
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311000128.189840-1-raphaelsc@scylladb.com>
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.
While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.
Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.
Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py
Fixes#9974Closes#10122
* github.com:scylladb/scylla:
compaction_manager: use coroutine::switch_to
compaction_manager::task: drop _compaction_running
compaction_manager: move per-type logic to derived task
compaction_manager: task: add state enum
compaction_manager: task: add maybe_retry
compaction_manager: reevaluate_postponed_compactions: mark as noexcept
compaction_manager: define derived task types
compaction_manager: register_metrics: expose postponed_compactions
compaction_manager: register_metrics: expose failed_compactions
compaction_manager: register_metrics: expose _stats.completed_tasks
compaction: add documentation for compaction_type to string conversions
compaction: expose to_string(compaction_type)
compaction_manager: task: standardize task description in log messages
compaction_manager: refactor can_proceed
compaction_manager: pass compaction_manager& to task ctor
compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
To be used in the next patch to generate
a string dscription from the compaction_type.
In theory, we could use compaction_name()
btu the latter returns the compaction type
in all-upper case and that is very different from
what we print to the log today. The all-upper
strings are used for the api layer, e.g. to
stop tasks of a particular compaction type.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);
(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
Info messages are logged when compaction jobs start and finish
but there is no message logged when the job is interrupted, e.g.
when stopped by the compaction_manager.
Refs scylladb/scylla-dtest#2468
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we add a v2 output option to the compactor, we want to get rid of
all the v1 inputs to make it simpler. This means that for a while the
compacting reader will be in a strange place of having a v2 input and a
v1 output. Hopefully, not for long.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Currently when scrub/validate is stopped (e.g. via the api),
scrub_validate_mode_validate_reader co_return:s without
closing the reader passed to it - causing a crash due
to internal error check, see #9766.
Throwing a compaction_stopped_exception rather than co_return:ing
an exception will be handled as any other exeption, including closing
the reader.
Fixes#9766
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>
If interposer is enabled, like the timestamp-based one for TWCS, data
from different buckets (e.g. windows) cannot be compacted together because
mutation compaction happens inside each consumer, where each consumer
will be belong to a different bucket.
To remove this limitation, let's move the mutation compactor from
consumer into producer, such that compacted data will be feeded into
the interposer, before it segregates data.
We're short-circuiting this logic if TWCS isn't in use as
compacting reader adds overhead to compaction, given that this reader
will pop fragments from combined sstable reader, compact them using
mutation_compactor and finally push them out to the underlying
reader.
without compacting reader (e.g. STCS + no interposer):
228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops)
224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops)
224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops)
with compacting reader (e.g. TWCS + interposer):
221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops)
216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops)
215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops)
So the cost of compacting data across buckets is ~3.5%, which happens
only with interposer enabled and GC writer disabled.
Fixes#9662.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
we only want to enable GC writer if incremental compaction is required.
let's make it more precise by checking that size limit for sstable
isn't disabled, so GC writer will only be enabled for compaction
strategies that really need it. So strategies that don't need it
won't pay the penalty.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too.
There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2.
tests: unit(debug).
Closes#9725
* github.com:scylladb/scylla:
compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2
sstable_set: update make_crawling_reader() to flat_mutation_reader_v2
sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2
sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2
sstable_set: update incremental_reader_selector to flat_mutation_reader_v2
As part of the drive to move over to flat_mutation_reader_v2, update
make_filtering_reader(). Since it doesn't examine range tombstones
(only the partition_start, to filter the key) the entire patch
is just glue code upgrading and downgrading users in the pipeline
(or removing a conversion, in one case).
Test: unit (dev)
Closes#9723
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.
Refs #7658
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently compaction_manager tracks sstables
based on !requires_view_building() and similarly,
table::in_strategy_sstables picks up only sstables
that are not in staging.
is_eligible_for_compaction() generalizes this condition
in preparation for adding a quarantine subdirectory for
invalid sstables that should not be compacted as well.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>