The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.
Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.
When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:
CREATE TABLE system.batchlog_v2 (
version int,
stage int,
shard int,
written_at timestamp,
id uuid,
data blob,
PRIMARY KEY ((version, stage, shard), written_at, id));
The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.
Fixes: https://github.com/scylladb/scylladb/issues/23358
This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.
Closesscylladb/scylladb#26671
* github.com:scylladb/scylladb:
db/config: change batchlog_replay_cleanup_after_replays default to 1
test/boost/batchlog_manager_test: add test for batchlog cleanup
replica/mutation_dump: always set position weight for clustering positions
service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
test/lib: introduce error_injection.hh
utils/error_injection: add debug log to disable() and disable_all()
test/lib/cql_test_env: forward config to batchlog
test/lib/cql_test_env: add batch type to execute_batch()
test/lib/cql_assertions: add with_size(predicate) overload
test/lib/cql_assertions: add source location to fail messages
test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
db/batchlog_manager: config: s/write_timeout/reply_timeot/
db,service: switch to system.batchlog_v2
db/system_keyspace: introduce system.batchlog_v2
service,db: extract generation of batchlog delete mutation
service,db: extract get_batchlog_mutation_for() from storage-proxy
db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
data_dictionary: table: add get_truncation_time()
db/batchlog_manager: batch(): replace map_reduce() with simple loop
db/batchlog_manager: finish coroutinizing replay_all_failed_batches
db/batchlog_manager: improve replayAllFailedBatches logs
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.
To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.
Fixes#27531
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#27532
The std::optional<T> inject_parameter(...) method is a template, and in
dev/debug modes this parameter is defaulted to std::string_view, but for
release mode it's not. This patch makes it symmetrical.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26706
With this, now it is possible to have two-way communication between
the error injection point and its enabler. The test can enable the error
injection point, then wait until it is hit, before proceedin.
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Replace boost with a standard facility; this reduces dependencies
as lexical_cast depends on boost ranges.
As a side effect the exception error message is improved.
The overload was introduced by a8b14b0227 (utils: add timeout error
injection with lambda), but is only used by the test nowadays.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21377
The wrapper object denotes that injection should run a handler and
wait_for_message() on it. Wrapper carries the timeout used to call the
mentioned method. It's currently unused, next patches will start enjoing
it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of inject() overloads check if the injection is enabled, then
optionally clear the one-shot one, then do the injection. Everything
but doing the injection is implemented in the enter() method, it's
perfectly worth re-using one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#21285
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.
Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.
To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.
[1] 66ef711d68Closesscylladb/scylladb#20006
Allow injection points to write values into the parameter map, which
external code can then examine. This allows exfiltrating the values if
internal variables, to be examined by tests, without exposing these
variables via an "official" path.
in theory, std::result_of_t should have been removed in C++20. and
std::invoke_result_t is available since C++17. thanks to libstdc++,
the tree is compiling. but we should not rely on this.
so, in this change, we replace all `std::result_of_t` with
`std::invoke_result_t`. actually, clang + libstdc++ is already warning
us like:
```
In file included from /home/runner/work/scylladb/scylladb/multishard_mutation_query.cc:9:
In file included from /home/runner/work/scylladb/scylladb/schema/schema_registry.hh:11:
In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/unordered_map:38:
Warning: /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/type_traits:2624:5: warning: 'result_of<void (noop_compacted_fragments_consumer::*(noop_compacted_fragments_consumer &))()>' is deprecated: use 'std::invoke_result' instead [-Wdeprecated-declarations]
2624 | using result_of_t = typename result_of<_Tp>::type;
| ^
/home/runner/work/scylladb/scylladb/mutation/mutation_compactor.hh:518:43: note: in instantiation of template type alias 'result_of_t' requested here
518 | if constexpr (std::is_same_v<std::result_of_t<decltype(&GCConsumer::consume_end_of_stream)(GCConsumer&)>, void>) {
|
```
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#18835
In this PR, we ensure unpublished CDC generation's data is
never removed, which was theoretically possible. If it happened,
it could cause problems. CDC generation publisher would then try
to publish the generation with its data removed. In particular, the
precondition of calling `_sys_ks.read_cdc_generation` wouldn't be
satisfied.
We also add a test that passes only after the fix. However, this test
needs to block execution of the CDC generation publisher's loop
twice. Currently, error injections with handlers do not allow it
because handlers always share received messages. Apart from the
first created handler, all handlers would be instantly unblocked by
a message from the past that has already unblocked the first
handler. This seems like a general limitation that could cause
problems in the future, so in this PR, we extend injections with
handlers to solve it once and for all. We add the `share_messages`
parameter to the `inject` (with handler) function. Depending on its
value, handlers will share messages (as before) or not.
Fixesscylladb/scylladb#17497Closesscylladb/scylladb#17934
* github.com:scylladb/scylladb:
topology_coordinator: clean_obsolete_cdc_generations: fix log
topology_coordinator: do not clear unpublished CDC generation's data
topology_coordinator: cdc_generation_publisher_fiber injection: make handlers share messages
error_injection: allow injection handlers to not share messages
For a single injection, all created injection handlers share all
received messages. In particular, it means that one received message
unblocks all handlers waiting for the first message. This behavior
is often desired, for example, if multiple fibers execute the
injected code and we want to unblock them all with a single message.
However, there is a problem if we want to block every execution
of the injected code. Apart from the first created handler, all
handlers will be instantly unblocked by messages from the past that
have already unblocked the first handler.
In one of the following commits, we add a test that needs to block
the CDC generation publisher's loop twice. Since it looks like there
are no good workarounds for this arguably general problem, we extend
injections with handlers in a way that solves it. We introduce the
new `share_messages` parameter. Depending on its value, handlers
will share messages or not. The details are described in the new
comments in `error_injection.hh`.
We also add some basic unit tests for the new funcionality.
In this commit we extend the error_injector
with a new method inject_parameter. It allows
to pass parameters from tests to scylla, e.g. to
lower timeouts or limits. A typical use cases is
described in scylladb/scylladb#15571.
It's logically the same as inject_with_handler,
whose lambda reads the parameter named 'value'.
The only difference is that the inject_parameter
doesn't return future, it just read the
parameter from the injection shared_data.
In subsequent commit we'll need the injection_name from inside
injection_shared_data, so in this commit we move it there.
Additionally, we fix the todo about switching the injections dictionary
from map to unordered_set, now unordered_map contains
string_views, pointing to injection_name inside
injection_shared_data.
Injection parameters can be used in the lambda passed to
inject_with_handler method to take some values from
the test. However, there was no way to set values to these
parameters on node startup, only through
the error injection REST api. Therefore, we couldn't rely
on this when inject_with_handler is used during
node startup, it could trigger before we call the api
from the test.
In this commit with solve this problem by allowing these
parameters to be assigned through scylla.yaml config.
The defer.hh header was added to error_injection.hh to fix
compilation after adding error_injection.hh to config.hh,
defer function is used in error_injection.hh.
The recently renamed inject_with_handler() was a template, but it can be
symmetrical to its peer that accepts void function as a callback, and
use std::function as its argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The inject_with_handler() method accepts a coroutine that can be called
wiht injection_handler. With such function as an argument, there's no
need in distinctive inject_with_handler() name for a method, it can be
overload of all the existing inject()-s
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we always cast the wait duration to millisecond,
even if it could be using a higher resolution. actually
`std::chrono::steady_clock` is using `nanosecond` for its duration,
so if we inject a deadline using `steady_clock`, we could be awaken
earlier due to the narrowing of the duration type caused by the
duration_cast.
in this change, we just use the duration as it is. this should allow
the caller to use the resolution provided by Seastar without losing
the precision.
Fixes#15902
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
For now, `received_messages_counter` have only data for messaging the injection.
In future, there will be more data to keep, for example, a string-to-string map of
injection's parameters.
Rename this class and its attributes.
Currently, it is hard for injected code to wait for some events, for example,
requests on some REST endpoint.
This commit adds the `inject_with_handler` method that executes injected function
and passes `injection_handler` as its argument.
The `injection_handler` class is used to wait for events inside the injected code.
The `error_injection` class can notify the injection's handler or handlers
associated with the injection on all shards about the received message.
There is a counter of received messages in `received_messages_counter`; it is shared
between the injection_data, which is created once when enabling an injection on
a given shard, and all `injection_handler`s, that are created separately for each
firing of this injection. The `counter` is incremented when receiving a message from
the REST endpoint and the condition variable is signaled.
Each `injection_handler` (separate for each firing) stores its own private counter,
`_read_messages_counter` that private counter is incremented whenever we wait for a
message, and compared to the received counter. We sleep on the condition variable
if not enough messages were received.
Seastar is an external library from Scylla's point of view so
we should use the angle bracket #include style. Most of the source
follows this, this patch fixes a few stragglers.
Also fix cases of #include which reached out to seastar's directory
tree directly, via #include "seastar/include/sesatar/..." to
just refer to <seastar/...>.
Closes#10433
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Constrain inject() with a requires clause rather than enable_if,
simplifying the code and compiler diagnostics.
Note that the second instance could not have been called, since
the template argument does not appear in the function parameter
list and thus could not be deduced. This is corrected here.
Closes#8322
Clang dislikes forward-declared functions returning auto, so declare the
type up front. Functions returning auto are a readability problem
anyway.
To solve a circular dependency problem (get_local_injector() ->
error_injection<> -> get_local_injector()), which is further compounded
by problems in using template specializations before they are defined
(which is forbidden), the storage for get_local_injector() was moved
to error_injection<>, and get_local_injector() is just an accessor.
After this, error_injection<> does not depend on get_local_injector().
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
Even though calling then() on a ready future does not allocate a
continuation, calling then on the result of it will allocate.
This error injection only adds a continuation in the dependency
chain if error injections are enabled at compile timeand this particular
error injection is enabled.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
For control flow (i.e. return) and simplicity add enter() method.
For disabled injections, this method is const returning false,
therefore it has no overhead.
Add boost test.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Fix disabled injection templates to match enabled ones.
Fix corresponding test to not be a continuation.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
This removes the need to include reactor.hh, a source of compile
time bloat.
In some places, the call is qualified with seastar:: in order
to resolve ambiguities with a local name.
Includes are adjusted to make everything compile. We end up
having 14 translation units including reactor.hh, primarily for
deprecated things like reactor::at_exit().
Ref #1