It is confusing. For query compaction, it initialized to `always_gc`,
for sstable compaction it is initialized to a lambda calling into
`can_gc()`. This makes understanding the purpose of this member very
confusing.
The real use of this member is to bridge
mutation_partition::compact_and_expire() with can_gc(). This patch
ditches the member and creates the lambda near the call sites instead,
just like the other params to `compact_and_expire()` already are.
can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just
blindly returning true when in query mode.
With this patch, whether tombstones are collected or not in query mode
is now consistent and controlled by the tombstone_gc_state.
Into total and live. Currently only live (those with live content) are
counted. Report live and total seprately, just like we do for rows. This
allows deducing the count of dead partitions as well, which is
particularly interesting for scans.
Closesscylladb/scylladb#26548
So tombstones can be purged correctly based on the tombstone gc mode.
Currently if repair-mode is used, tombstones are not purged at all,
which can lead to purged tombstone being re-replicated to replicas which
already purged them via read-repair.
This is not a correctness problem, tombstones are not included in data
query resutl or digest, these purgable tombstone are only a nuissance
for read repair, where they can create extra differences between
replicas. Note that for the read repair to trigger, some difference
other than in purgable tombstones has to exist, because as mentioned
above, these are not included in digets.
Fixes: scylladb/scylladb#24332Closesscylladb/scylladb#26351
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.
Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.
This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:
The repaired_at on disk field of a SSTable is used.
- A 64-bit number increases sequentially
The sstables_repaired_at is added to the system.tablets table.
- repaired_at <= sstables_repaired_at means the sstable is repaired
The being_repaired in memory field of a SSTable is added.
- A repair UUID tells which sstable has participated in the repair
Initial test results:
1) Medium dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~500GB
Cluster pre-populated with ~500GB of data before starting repairs job.
Results for Repair Timings:
The regular repair run took 210 mins.
Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
The speedup is: 183 mins / 48s = 228X
2) Small dataset results
Node amount: 3
Instance type: i4i.2xlarge
Disk usage per node: ~167GB
Cluster pre-populated with ~167GB of data before starting the repairs job.
Regular repair 1st run took 110s, 2nd and 3rd runs took 110s.
Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
The speedup is: 110s / 1.5s = 73X
3) Large dataset results
Node amount: 6
Instance type: i4i.2xlarge, 3 racks
50% of base load, 50% read/write
Dataset == Sum of data on each node
Dataset Non-incremental repair (minutes)
1.3 TiB 31:07
3.5 TiB 25:10
5.0 TiB 19:03
6.3 TiB 31:42
Dataset Incremental repair (minutes)
1.3 TiB 24:32
3.0 TiB 13:06
4.0 TiB 5:23
4.8 TiB 7:14
5.6 TiB 3:58
6.3 TiB 7:33
7.0 TiB 6:55
Fixes#22472Closesscylladb/scylladb#24291
* github.com:scylladb/scylladb:
replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
compaction: Move compaction_reenabler to compaction_reenabler.hh
topology_coordinator: Make rpc::remote_verb_error to warning level
repair: Add metrics for sstable bytes read and skipped from sstables
test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
test.py: Add tests for tablet incremental repair
repair: Add tablet incremental repair support
compaction: Add tablet incremental repair support
feature_service: Add TABLET_INCREMENTAL_REPAIR feature
tablet_allocator: Add tablet_force_tablet_count_increase and decrease
repair: Add incremental helpers
sstable: Add being_repaired to sstable
sstables: Add set_repaired_at to metadata_collector
mutation_compactor: Introduce add operator to compaction_stats
tablet: Add sstables_repaired_at to system.tablets table
test: Fix drain api in task_manager_client.py
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.
Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh
Fixes: #22102Fixes: #22103Fixes: #22105Closesscylladb/scylladb#25082
Max purgeable has two possible values for each partition: one for
regular tombstones and one for shadowable ones. Yet currently a single
member is used to cache the max-purgeable value for the partition, so
whichever kind of tombstone is checked first, its max-purgeable will
become sticky and apply to the other kind of tombstones too. E.g. if the
first can_gc() check is for a regular tombstone, its max-purgeable will
apply to shadowable tombstones in the partition too, meaning they might
not be purged, even though they are purgeable, as the shadowable
max-purgeable is expected to be more lenient. The other way around is
worse, as it will result in regular tombstone being incorrectly purged,
permitted by the more lenient shadowable tombstone max-purgeable.
Fix this by caching the two possible values in two separate members.
A reproducer unit test is also added.
Fixes: scylladb/scylladb#23272Closesscylladb/scylladb#24171
Let compact_mutation_state collect all tombstone purge attempts
and failures. For this purpose a new statistic structure is created
(tombstone_purge_stats) and the relative stats are collected in
the can_purge_tombstone method.
The statistics are collect only for sstables compaction.
An optional statistics structure can be passed in via compact_mutation_state
constructor.
Currently, when a max purgeable timestamp is computed, there is no
information where it comes from and how the value was obtained.
Take compaction, if there are memtables or other uncompacting sstables
possibly shadowing data, the timestamp is decreased to ensure a
tombstone is not purged but the caller does not know what that the
timestamp has its value.
In this patch, we extend the return type of max_purgeable_fn to
contain not only a timestamp but also an information on how it was
computed. This information will be required to collect statistics
on tombstone purge failures due to overlapping memtables/uncompacting
sstables that come later in the series.
This doesn't introduce additional work for single-partition queries: the
key is copied anyway on consume_end_of_stream().
Multi-partition reads and compaction are not that sensitive to
additional copy added.
This change fixes a bug in the compacting_reader: currently the reader
passes _last_uncompacted_partition_start.key() to the compactor's
consume_new_partition(). When the compactor emits enough content for this
partition, _last_uncompacted_partition_start is moved from to emit the
partition start, this makes the key reference passed to the compaction
corrupt (refer to moved-from value). This in turn means that subsequent
GC checks done by the compactor will be done with a corrupt key and
therefore can result in tombstone being garbage-collected while they
still cover data elsewhere (data resurrection).
The compacting reader is violating the API contract and normally the bug
should be fixed there. We make an exception here because doing the fix
in the mutation compactor better aligns with our future plans:
* The fix simplifies the compactor (gets rid of _last_dk).
* Prepares the way to get rid of the consume API used by the compactor.
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
'static inline' is always wrong in headers - if the same header is
included multiple times, and the function happens not to be inlined,
then multiple copies of it will be generated.
Fix by mechanically changing '^static inline' to 'inline'.
When purging regular tombstone consult the min_live_timestamp,
if available.
For shadowable_tombstones, consult the
min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.
If both are missing, fallback to the legacy
(and inaccurate) min_timestamp.
Fixesscylladb/scylladb#20423Fixesscylladb/scylladb#20424
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Before we add a new, is_shadowable, parameter to it.
And define global `can_always_purge` and `can_never_purge`
functions, a-la `always_gc` and `never_gc`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
in theory, std::result_of_t should have been removed in C++20. and
std::invoke_result_t is available since C++17. thanks to libstdc++,
the tree is compiling. but we should not rely on this.
so, in this change, we replace all `std::result_of_t` with
`std::invoke_result_t`. actually, clang + libstdc++ is already warning
us like:
```
In file included from /home/runner/work/scylladb/scylladb/multishard_mutation_query.cc:9:
In file included from /home/runner/work/scylladb/scylladb/schema/schema_registry.hh:11:
In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/unordered_map:38:
Warning: /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/type_traits:2624:5: warning: 'result_of<void (noop_compacted_fragments_consumer::*(noop_compacted_fragments_consumer &))()>' is deprecated: use 'std::invoke_result' instead [-Wdeprecated-declarations]
2624 | using result_of_t = typename result_of<_Tp>::type;
| ^
/home/runner/work/scylladb/scylladb/mutation/mutation_compactor.hh:518:43: note: in instantiation of template type alias 'result_of_t' requested here
518 | if constexpr (std::is_same_v<std::result_of_t<decltype(&GCConsumer::consume_end_of_stream)(GCConsumer&)>, void>) {
|
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18835
codespell reports that "statics" could be the misspelling of
"statistics". but "static" here means the static column(s). so
replace "static" with more specific wording.
Refs #589
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#17216
Currently, the compactor requires a valid stream and thus abandoning a
partition in the middle was not possible. This causes some complications
for the compacting reader, which implements methods such as
`next_partition()` which is possibly called in the middle of a
partition. In this case the compacting reader attempts to close the
partition properly by inserting a synthetic partition-end fragment into
the stream. This is not enough however as it doesn't close any range
tombstone changes that might be active. Instead of piling on more
complexity, add an API to the compactor which allows abandoning the
current partition.
The mutation compactor has a validator which it uses to validate the
stream of mutation fragments that passes through it. This validator is
supposed to validate the stream as it enters the compactor, as opposed
to its compacted form (output). This was true for most fragment kinds
except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by e2c9cdb576, which itself was a flawed
attempt at fixing an error seen because purged tombstones were not
terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly:
on page-cut, if the validator has an active tombstone, a closing
tombstone is generated for it, to avoid the false-positive error. With
this, range tombstones can be validated again as they come in.
Commit 8c4b5e4283 introduced an optimization which only
calculates max purgeable timestamp when a tombstone satisfy the
grace period.
Commit 'repair: Get rid of the gc_grace_seconds' inverted the order,
probably under the assumption that getting grace period can be
more expensive than calculating max purgeable, as repair-mode GC
will look up into history data in order to calculate gc_before.
This caused a significant regression on tombstone heavy compactions,
where most of tombstones are still newer than grace period.
A compaction which used to take 5s, now takes 35s. 7x slower.
The reason is simple, now calculation of max purgeable happens
for every single tombstone (once for each key), even the ones that
cannot be GC'ed yet. And each calculation has to iterate through
(i.e. check the bloom filter of) every single sstable that doesn't
participate in compaction.
Flame graph makes it very clear that bloom filter is a heavy path
without the optimization:
45.64% 45.64% sstable_compact sstable_compaction_test_g
[.] utils::filter::bloom_filter::is_present
With its resurrection, the problem is gone.
This scenario can easily happen, e.g. after a deletion burst, and
tombstones becoming only GC'able after they reach upper tiers in
the LSM tree.
Before this patch, a compaction can be estimated to have this # of
filter checks:
(# of keys containing *any* tombstone) * (# of uncompacting sstable
runs[1])
[1] It's # of *runs*, as each key tend to overlap with only one
fragment of each run.
After this patch, the estimation becomes:
(# of keys containing a GC'able tombstone) * (# of uncompacting
runs).
With repair mode for tombstone GC, the assumption, that retrieval
of gc_before is more expensive than calculating max purgeable,
is kept. We can revisit it later. But the default mode, which
is the "timeout" (i.e. gc_grace_seconds) one, we still benefit
from the optimization of deferring the calculation until
needed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#13908
Allowing the validation level to be customized by whoever creates the
compaction state. Add a default value (the previous hardcoded level) to
avoid the churn of updating all call sites.
e2c9cdb576 moved the validation of the
range tombstone change to the place where it is actually consumed, so we
don't attempt to pass purged or discarded range tombstones to the
validator. In doing so however, the validate pass was moved after the
consume call, which moves the range tombstone change, the validator
having been passed a moved-from range tombstone. Fix this by moving he
validation to before the consume call.
Refs: #12575
The purpose of `_stop` is to remember whether the consumption of the
last partition was interrupted or it was consumed fully. In the former
case, the compactor allows retreiving the compaction state for the given
partition, so that its compaction can be resumed at a later point in
time.
Currently, `_stop` is set to `stop_iteration::yes` whenever the return
value of any of the `consume()` methods is also `stop_iteration::yes`.
Meaning, if the consuming of the partition is interrupted, this is
remembered in `_stop`.
However, a partition whose consumption was interrupted is not always
continued later. Sometimes consumption of a partitions is interrputed
because the partition is not interesting and the downstream consumer
wants to stop it. In these cases the compactor should not return an
engagned optional from `detach_state()`, because there is not state to
detach, the state should be thrown away. This was incorrectly handled so
far and is fixed in this patch, but overwriting `_stop` in
`consume_partition_end()` with whatever the downstream consumer returns.
Meaning if they want to skip the partition, then `_stop` is reset to
`stop_partition::no` and `detach_state()` will return a disengaged
optional as it should in this case.
Fixes: #12629Closes#13365
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788