Scrub compaction in segregate mode can split the input sstable into as
many as hundreds or even thousands of output sstables in the extreme
case. But even at a few dozen output sstables, most of these will only
have a few partitions with a few rows. These sstables however will still
have their bloom filter allocated according to the original
partition-count estimate, causing memory bloat or even OOM in the
extreme case.
This patch solves this by aggressively adjusting the partition count
downwards after the second bucket has been created. Each subsequent
bucket will halve the partition estimate, which will quickly reach 1.
Fixes: #9463Closes#9464
rewrite_sstables() can be asked to stop either on shutdown or on an
user-triggered comapction which forces all ongoing compaction to stop,
like scrub.
turns out we weren't actually bailing out from do_until() when task
cannot proceed. So rewrite_sstables() potentially runs into an infinite
loop which in turn causes shutdown or something else waiting on it
to hang forever.
found this while auditting code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211005233601.155442-1-raphaelsc@scylladb.com>
"This series removes layer violation in compaction, and also
simplifies compaction manager and how it interacts with compaction
procedure."
* 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla:
compaction: split compaction info and data for control
compaction_manager: use task when stopping a given compaction type
compaction: remove start_size and end_size from compaction_info
compaction_manager: introduce helpers for task
compaction_manager: introduce explicit ctor for task
compaction: kill sstables field in compaction_info
compaction: kill table pointer in compaction_info
compaction: simplify procedure to stop ongoing compactions
compaction: move management of compaction_info to compaction_manager
compaction: move output run id from compaction_info into task
The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception`
and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown
or api event vs. compaction aborting due to scrub validation error.
While at it, cleanup the existing retry logic related to `compaction_stop_exception`.
Test: unit(dev)
Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267)
Closes#9321
* github.com:scylladb/scylla:
compaction: split compaction_aborted_exception from compaction_stopped_exception
compaction_manager: maybe_stop_on_error: rely on retry=false default
compaction_manager: maybe_stop_on_error: sync return value with error message.
compaction: drop retry parameter from compaction_stop_exception
compaction_manager: move errors stats accounting to maybe_stop_on_error
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.
So stopping a task becomes simpler, and duplication is naturally
gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
There will be unbounded growth of pending tasks if they are submitted
faster than retiring them. That can potentially happen if memtables
are frequently flushed too early. It was observed that this unbounded
growth caused task queue violations as the queue will be filled
with tons of tasks being reevaluated. By avoiding duplication in
pending task list for a given table T, growth is no longer unbounded
and consequently reevaluation is no longer aggressive.
Refs #9331.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210930125718.41243-1-raphaelsc@scylladb.com>
Recently we observed an OOM caused by the partition based splitting
writer going crazy, creating 1.7K buckets while scrubbing an especially
broken sstable. To avoid situations like that in the future, this patch
provides a max limit for the number of live buckets. When the number of
buckets reach this number, the largest bucket is closed and replaced by
a bucket. This will end up creating more output sstables during scrub
overall, but now they won't all be written at the same time causing
insane memory pressure and possibly OOM.
Scrub compaction sets this limit to 100, the same limit the TWCS's
timestamp based splitting writer uses (implemented through the
classifier -
time_window_compaction_strategy::max_data_segregation_window_count).
Fixes: #9400
Tests: unit(dev)
Closes#9401
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic backlog formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info must only contain info data to be exported to the
outside world, whereas compaction_data will contain data for
controlling compaction behavior and stats which change as
compaction progresses.
This separation makes the interface clearer, also allowing for
future improvements like removing direct references to table
in compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
compaction_info will eventually only be used for exporting data about
ongoing compactions, so task must be used instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
those stats aren't used in compaction stats API and therefore they
can be removed. end_size is added to compaction_result (needed for
updating history) and start_size can be calculated in advance.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compactions are tracked by both _compactions and _tasks,
where _compactions refer to actual ongoing compaction tasks,
whereas _tasks refer to manager tasks which is responsible for
spawning new compactions, retry them on failure, etc.
As each task can only have one ongoing compaction at a time,
let's move compaction into task, such that manager won't have to
look at both when deciding to do something like stopping a task.
So stopping a task becomes simpler, and duplication is naturally
gone.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Today, compaction is calling compaction manager to register / deregister
the compaction_info created by it.
This is a layer violation because manager sits one layer above
compaction, so manager should be responsible for managing compaction
info.
From now on, compaction_info will be created and managed by
compaction_manager. compaction will only have a reference to info,
which it can use to update the world about compaction progress.
This will allow compaction_manager to be simplified as info can be
coupled with its respective task, allowing duplication to be removed
and layer violation to be fixed.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
this run id is used to track partial runs that are being written to.
let's move it from info into task, as this is not an external info,
but rather one that belongs to compaction_manager.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic back formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
by switching to unordered_map, removal of generated monitors is
made easier. this is a preparatory change for patch which will
remove monitor for all exhausted sstables
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Indicate whether the compaction job should be aborted
due to an error using a new, compaction_aborted_exception type,
vs. compaction_stopped_exception that indicates
the task should be stopped due to some external event that
doesn't indicate an error (like shutdown or api call).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
message.
It is misleading to set retry to true in the following statement
and return it later on when the `will_stop` parameter is true.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently, _stats.errors is not accounted for non-retryable errors
like storage_io_error.
Fixes#9354
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This warning can catch a virtual function that thinks it
overrides another, but doesn't, because the two functions
have different signatures. This isn't very likely since most
of our virtual functions override pure virtuals, but it's
still worth having.
Enable the warning and fix numerous violations.
Closes#9347
the name compaction_options is confusing as it overlaps in meaning
with compaction_descriptor. hard to reason what are the exact
difference between them, without digging into the implementation.
compaction_options is intended to only carry options specific to
a give compaction type, like a mode for scrub, so let's rename
it to compaction_type_options to make it clearer for the
readers.
[avi: adjust for scrub changes]
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210908003934.152054-1-raphaelsc@scylladb.com>
Corrupt keys might be printed as non-utf8 strings to the log,
and that, in turn, may break applications reading the logs,
such as Python (3.7)
For example:
```
Traceback (most recent call last):
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1148, in tearDown
self.cleanUpCluster()
File "/home/bhalevy/dev/scylla-dtest/dtest.py", line 1184, in cleanUpCluster
matches = node.grep_log(expr)
File "/home/bhalevy/dev/scylla-ccm/ccmlib/node.py", line 367, in grep_log
for line in f:
File "/usr/lib64/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 5577: invalid start byte
```
Test: unit(dev)
DTest: scrub_with_one_node_expect_data_loss_test
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210730105428.2844668-1-bhalevy@scylladb.com>
Take write lock for cf to serialize cleanup/upgrade sstables/scrub
with major compaction/reshape/reshard.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
since we might potentially have ongoing compactions, and we
must ensure that all sstables created before we run are scrubbed,
we need to barrier out any previously running compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
even sstables being compacted must be validated. otherwise scrub
validate may return false negative.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Scrub is supposed to not remove anything from input, write it as is
while fixing any corruption it might have. It shouldn't have any
assumption on the input. Additionally, a data shadowed by a tombstone
might be in another corrupted sstable, so expired tombstones should
not be purged in order to prevent data ressurection from occurring.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210904165908.135044-1-raphaelsc@scylladb.com>
"
A special-purpose reader which doesn't use the index at all, designed to
be used in circumstances where the index is not reliable. The use-case
is scrub and validate which often have to work with corrupt indexes and
it is especially important that they don't further any existing
corruption.
Tests: unit(dev)
"
* 'crawling-sstable-reader/v2' of https://github.com/denesb/scylla:
compaction: scrub/validate: use the crawling sstable reader
sstables: wire in crawling reader
sstables: mx/reader: add crawling reader
sstables: kl/reader: add crawling reader