That's done by picking the ideal level for the input, such
that LCS won't have to either promote or demote data, because
the output level is not the best candidate for having the
size of the output data.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There's a public call on replica::table to get back the compaction
manager reference. It's not needed, actually. The users of the call are
distributed loader which already has database at hand, and a test that
creates itw own instance of compaction manager for its testing tables
and thus also has it available.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220406171351.3050-1-xemul@scylladb.com>
Bucket awareness in cleanup was introduced in a69d98c3d0.
STCS and TWCS already support it, and now LCS will receive it.
The goal of bucket awareness is to reduce writeamp in cleanup,
therefore reducing operation time. Additionally, garbage collection
becomes more efficient as shadowed data can now be potentially
compacted with the data that shadows it, assuming they're on
the same level.
The implementation for LCS is simple. Will reuse the procedure
for STCS for returning jobs in level 0. And one job will be
returned for each non-empty level > 0. What allows us to do it
is our incremental selection approach used in compaction,
that sets a limit on memory usage and disk space requirement.
Fixes#10097.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220331173417.211257-1-raphaelsc@scylladb.com>
For compaction to be able to purge expired data, like tombstones, a
sstable set snapshot is set in the compaction descriptor.
That's a decision that belongs to task type. For example, all regular
compaction enable GC, whereas scrub for example doesn't for safety
reasons.
The problem is that the decision is being made by every instantiation
of compaction_descriptor in the strategies, which is both unnecessary
and also adds lots of boilerplate to the code, making it hard to
understand and work with.
As sstable set snapshot is an implementation detail, a new method
is being added to compaction_descriptor to make the intention
clearer, making the interface easier to understand.
can_purge_tombstones, used previously by rewrite task only, is being
reused for communicating GC intention into task::compact_sstables().
The boilerplate was a pain when adding a new strategy method for
the ongoing work on cleanup, described by issue #10097.
Another benefit is that we'll now only create a set snapshot when
compaction will really run. Before, it could happen that the snapshot
would be discarded if the compaction attempt had to be postponed,
which is a waste of cpu cycles.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Compaction manager is calling back the table to run off-strategy compaction,
but the logic clearly belongs to manager which should perform the
operation independently and only call table to update its state with the
result.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220315174504.107926-2-raphaelsc@scylladb.com>
Table submits compaction request into manager, which in turn calls
back table to run the compaction when the time has come, i.e.:
table -> compaction manager -> table -> execute compaction
But manager should not rely on table to run compaction, as compaction
execution procedure sits one layer below the manager and should be
accessed directly by it, i.e:
table -> compaction manager -> execute compaction
This makes code easier to understand and update_compaction_history()
can now be noop for unit tests using table_state.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220311023410.250149-1-raphaelsc@scylladb.com>
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
The series overhauls the compaction_manager::task design and implementation
by properly layering the functionality between the compaction_manager
that deals with generic task execution, and the per-task business logic that is defined
in a set of classes derived from the generic task class.
While at it, the series introduces `task::state` and a set of helper functions to manage it
to prevent leaks in the statistics, fixing #9974.
Two more stats counter were exposed: `completed_tasks` and a new `postponed_tasks`.
Test: sstable_compaction_test
Dtest: compaction_test.py compaction_additional_test.py
Fixes#9974Closes#10122
* github.com:scylladb/scylla:
compaction_manager: use coroutine::switch_to
compaction_manager::task: drop _compaction_running
compaction_manager: move per-type logic to derived task
compaction_manager: task: add state enum
compaction_manager: task: add maybe_retry
compaction_manager: reevaluate_postponed_compactions: mark as noexcept
compaction_manager: define derived task types
compaction_manager: register_metrics: expose postponed_compactions
compaction_manager: register_metrics: expose failed_compactions
compaction_manager: register_metrics: expose _stats.completed_tasks
compaction: add documentation for compaction_type to string conversions
compaction: expose to_string(compaction_type)
compaction_manager: task: standardize task description in log messages
compaction_manager: refactor can_proceed
compaction_manager: pass compaction_manager& to task ctor
compaction_manager: use shared_ptr<task> rather than lw_shared_ptr
compaction_manager: rewrite_sstables: acquire _maintenance_ops_sem once
compaction_manager: use compaction_state::lock only to synchronize major and regular compaction
Move the business logic into the task specific classes.
Separating initialization during task construction,
from the compaction_done task, moved into
a do_run() method, and in some cases moving
a lambda function that was called per table (as in
rewrite_sstables) into a private method of the
derived class.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);
(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
Introduced in commit: ddd693c6d7
We're not emplacing newer windows in the tracker, causing
std::out_of_range when replacing sstables for windows.
Let's fix the logic and add an unit test to cover this.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220301194944.95096-1-raphaelsc@scylladb.com>
There's no automated test for controller, it's time to have one.
Let's start with a basic one that verifies the assumption that
perfectly compacted tiers should produce 0 backlog.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This new interface allows table to communicate multiple changes in the
SSTable set with a single call, which is useful on compaction completion
for example.
With this new interface, the size tiered backlog tracker will be able to
know when compaction completed, which will allow it to recompute tiers
and their backlog contribution, if any. Without it, tiered tracker
would have to recompute tiers for every change, which would be terribly
expensive.
The old remove/add interface are being removed in favor of the new one.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.
Memtables are also used outside the replica, in two places:
- in some virtual tables; this is also in some way inside the replica,
(virtual readers are installed at the replica level, not the
cooordinator), so I don't consider it a layering violation
- in many sstable unit tests, as a convenient way to create sstables
with known input. This is a layering violation.
We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).
Test: unit (dev)
Closes#10120
"
Currently the mutation compactor supports v1 and v2 output and has a v1
output. The next step is to add a v2 output but this would lead to a
full conversion matrix which we want to avoid. So in preparation we drop
the v1 input support. Most inputs were already v2, but there were some
notable exceptions: tests, the compacting reader and the multishard
query code. The former two was a simple mechanical update but the latter
required some further work because it turned out the v2 version of
evictable reader wasn't used yet and thus it managed to hide some bugs
and dropped features. While at it, we migrate all evictable and
multishard reader users to the v2 variant of the respective readers and
drop the v1 variant completely.
With this the road is open to a v2 compactor output and therefore to a
v2 sstable writer.
Tests: unit(dev, release), dtest(paging_additional_test.py)
"
* 'compact-mutation-v2-only-input/v5' of https://github.com/denesb/scylla:
test/lib/test_utils: return OK from check() variants
repair/row_level: use evictable reader v2
db/view/view_updating_consumer: migrate to v2
test/boost/mutation_reader_test: add v2 specific evictable reader tests
test: migrate to evictable reader v2 and multishard combining reader v2
compact_mutation: drop support for v1 input
test: pass v2 input to mutation_compaction
test/boost/mutation_test: simplify test_compaction_data_stream_split test
mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones
multishard_mutation_query: migrate to v2
mutation_fragment_v2: range_tombstone_change: add memory_usage()
evictable_reader_v2: terminate active range tombstones on reader recreation
evictable_reader_v2: restore handling of non-monotonically increasing positions
evictable_reader_v2: simplify handling of reader recreation
mutation: counter_write_query: use v2 reader
mutation: migrate consume() to v2
mutation_fragment_v2,flat_mutation_reader_v2: mirror v1 concept organization
mutation_reader: compacting_reader: require a v2 input reader
db/view/view_builder: use v2 reader
test/lib/flat_mutation_reader_assertions: adjust has_monotonic_positions() to v2 spec
Take into account that get_reshaping_job selects only
buckets that have more than min_threashold sstables in them.
Therefore, with 256 disjoint sstables in different windows,
allow first or last windows to not be selected by get_reshaping_job
that will return at least disjoint_sstable_count - min_threshold + 1
sstables, and not more than disjoint_sstable_count.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220123090044.38449-2-bhalevy@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Currently, when TWCS reshape finds a bucket containing more than 32
files, it will blindly resize that bucket to 32.
That's very bad because it doesn't take into consideration that
compaction efficiency depends on relative sizes of files being
compacted together, meaning that a huge file can be compacted with
a tiny one, producing lots of write amplification.
To solve this problem, STCS reshape logic will now be reused in
each time bucket. So only similar-sized files are compacted together
and the time bucket will be considered reshaped once its size tiers
are properly compacted, according to the reshape mode.
Fixes#9938.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220117205000.121614-1-raphaelsc@scylladb.com>
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
seastar::later() was recently deprecated and replaced with two
alternatives: a cheap seastar::yield() and an expensive (but more
powerful) seastar::check_for_io_immediately(), that corresponds to
the original later().
This patch replaces all later() calls with the weaker yield(). In
all cases except one, it's unambiguously correct. In one case
(test/perf scheduling_latency_measurer::stop()) it's not so ambiguous,
since check_for_io_immediately() will additionally force a poll and
so will cause more work to be done (but no additional tasks to be
executed). However, I think that any measurement that relies on
the measuring the work on the last tick to be inaccurate (you need
thousands of ticks to get any amount of confidence in the
measurement) that in the end it doesn't matter what we pick.
Tests: unit (dev)
Closes#9904
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Make sure that major will compact data in all sstables and memtable,
as tombstones sitting in memtable could shadow data in sstables.
For example, a tombstone in memtable deleting a large partition could
be missed in major, so space wouldn't be saved as expected.
Additionally, write amplification is reduced as data in memtable
won't have to travel through tiers once flushed.
Fixes#9514.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211217160055.96693-2-raphaelsc@scylladb.com>
"
TWCS perform STCS on a window as long as it's the most recent one.
From there on, TWCS will compact all files in the past window into
a single file. With some moderate write load, it could happen that
there's still some compaction activity in that past window, meaning
that per-window major may miss some files being currently compacted.
As a result, a past window may contain more than 1 file after all
compaction activity is done on its behalf, which may increase read
amplification. To avoid that, TWCS will now make sure that per-window
major is serialized, to make sure no files are missed.
Fixes#9553.
tests: unit(dev).
"
* 'fix_twcs_per_window_major_v3' of https://github.com/raphaelsc/scylla:
TWCS: Make sure major on past window is done on all its sstables
TWCS: remove needless param for STCS options
TWCS: kill unused param in newest_bucket()
compaction: Implement strategy control and wire it
compaction: Add interface to control strategy behavior.
"
Users are adjusted by sprinkling `upgrade_to_v2()` and
`downgrade_to_v1()` where necessary (or removing any of these where
possible). No attempt was made to optimize and reduce the amount of
v1<->v2 conversions. This is left for follow-up patches to keep this set
small.
The combined reader is composed of 3 layers:
1. fragment producer - pop fragments from readers, return them in batches
(each fragment in a batch having the same type and pos).
2. fragment merger - merge fragment batches into single fragments
3. reader implementation glue-code
Converting layers (1) and (3) was mostly mechanical. The logic of
merging range tombstone changes is implemented at layer (2), so the two
different producer (layer 1) implementations we have share this logic.
Tests: unit(dev)
"
* 'combined-reader-v2/v4' of https://github.com/denesb/scylla:
test/boost/mutation_reader_test: add test_combined_reader_range_tombstone_change_merging
mutation_reader: convert make_clustering_combined_reader() to v2
mutation_reader: convert position_reader_queue to v2
mutation_reader: convert make_combined_reader() overloads to v2
mutation_reader: combined_reader: convert reader_selector to v2
mutation_reader: convert combined reader to v2
mutation_reader: combined_reader: attach stream_id to mutation_fragments
flat_mutation_reader_v2: add v2 version of empty reader
test/boost/mutation_reader_test: clustering_combined_reader_mutation_source_test: fix end bound calculation
Once current window is sealed, TWCS is supposed to compact all its
sstables into one. If there's ongoing compaction, it can happen that
sstables are missed and therefore past windows will contain more than
one sstable. Additionally, it could happen that major doesn't happen
at all if under heavy load. All these problems are fixed by serializing
major on past window and also postponing it if manager refuses to run
the job now.
Fixes#9553.
Reviewed-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
STCS option can be retrieved from class member, as newest_bucket()
is no longer a static function. let's get rid of it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Verify that TWCS compaction can now compact data across time windows,
like a tombstone which will cause all shadowed data to be purged
once they're all compacted together.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For each quarantine mode:
Validate sstables to quarantine one of them
and then scrub with the given quarantine mode
and verify the output whwther the quarantined
sstable was scrubbed or not.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.
Refs #7658
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This series hardens compaction_manager::remove by:
- add debug logging around task execution and stopping.
- access compaction_state as lw_shared_ptr rather than via a raw pointer.
- with that, detach it from `_compaction_state` in `compaction_manager::remove` right away, to prevent further use of it while compactions are stopped.
- added write_lock in `remove` to make sure the lock is not held by any stray task.
Test: unit(dev), sstable_compaction_test(debug)
Dtest: alternator_tests.py:AlternatorTest.test_slow_query_logging (debug)
Closes#9636
* github.com:scylladb/scylla:
compaction_manager: add compaction_state when table is constructed
compaction_manager: remove: fixup indentation
compaction_manager: remove: detach compaction_state before stopping ongoing compactions
compaction_manager: remove: serialize stop_ongoing_compactions and gate.close
compaction_manager: task: keep a reference on compaction_state
test: sstable_compaction_test: incremental_compaction_data_resurrection_test: stop table before it's destroyed.
test: sstable_utils: compact_sstables: deregister compaction also on error path
test: sstable_compaction_test: partial_sstable_run_filtered_out_test: deregiser_compaction also on error path
test: compaction_manager_test: add debug logging to register/deregister compaction
test: compaction_manager_test: deregister_compaction: erase by iterator
test: compaction_manager_test: move methods out of line
compaction_manager: compaction_state: use counter for compaction_disabled
compaction_manager: task: delete move and copy constructors
compaction_manager: add per-task debug log messages
compaction_manager: stop_ongoing_compactions: log number of tasks to stop