Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.
To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
now() < written_at + min(timeout + propagation_delay)
repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
repair_time <= now()
So we know that:
repair_time < written_at + propagation_delay
With this condition we are sure that GC won't happen.
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.
Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.
Fixes: scylladb/scylladb#25432Closesscylladb/scylladb#26304
If there are pending mutations in the batchlog for a table that
has been dropped, we'll keep attempting to replay them but with
no success -- `db::no_such_column_family` exceptions will be thrown,
and we'll keep trying again and again.
To prevent that, we drop the batch in that case just like we do
in the case of a non-existing keyspace.
A reproducer test has been included in the commit. It fails without
the changes in `db/batchlog_manager.cc`, and it succeeds with them.
Fixesscylladb/scylladb#24806Closesscylladb/scylladb#26057
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.
The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.
It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.
Set a timeout on writes of replayed batches by the batchlog manager.
We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.
The timeout is set to be high enough to allow any reasonable write to
complete.
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.
To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.
To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.
Blobs can be large, and unfragmented blobs can easily exceed 128k
(as seen in #23903). Rename get_blob() to get_blob_unfragmented()
to warn users.
Note that most uses are fine as the blobs are really short strings.
Closesscylladb/scylladb#24102
This commit eliminates unused boost header includes from the tree.
Removing these unnecessary includes reduces dependencies on the
external Boost.Adapters library, leading to faster compile times
and a slightly cleaner codebase.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#22857
Recently, seastar rpc started accepting std::type_identity in addition
to boost::type as a type marker (while labeling the latter with an
ominous deprecation warning). Reduce our depedendency on boost
by switching to std::type_identity.
After the specified amount of replays, trigger a cleanup: flush batchlog
table memtables. This allows the cleanup to happen on a configurable
interval, instead of on every batchlog replay attempt, which might be
too much.
Add a flag controlling whether cleanup (memtable flush) will be done
after the replay. This is to allow repair to opt out from cleanup --
when many concurrenty repairs are running, there can be storms of calles
to do_batch_log_replay(), which will be mostly no-op, but they will all
attempt to flush the memtable to clean-up after themselves. This is
unnecessary and introduces latency to repairs, best to leave the cleanup
to the periodic batch-log replay.
the log.hh under the root of the tree was created keep the backward
compatibility when seastar was extracted into a separate library.
so log.hh should belong to `utils` directory, as it is based solely
on seastar, and can be used all subsystems.
in this change, we move log.hh into utils/log.hh to that it is more
modularized. and this also improves the readability, when one see
`#include "utils/log.hh"`, it is obvious that this source file
needs the logging system, instead of its own log facility -- please
note, we do have two other `log.hh` in the tree.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Scans should not pollute the cache with cold data, in general. In the
case of the batchlog table, there is another reason to bypass the cache:
this table can have a lot of partition tombstones, which currently are
not purged from the cache. So in certain cases, using the cache can make
batch replay very slow, because it has to scan past tombstones of
already replayed batches.
We have a commented code snippet from Origin with cleanup and a FIXME to
implement it. Origin flushes the memtables and kicks a compaction. We
only implement the flush here -- the flush will trigger a compaction
check and we leave it up to the compaction manager to decide when a
compaction is worthwhile.
This method used to be called only from unbootstrap, so a cleanup was
not really needed. Now it is also called at the end of repair, if the
table is using repair-based tombstone-gc. If the memtable is filled with
tombstones, this can add a lot of time to the runtime of each repair. So
flush the memtable at the end, so the tombstones can be purged (they
aren't purged from memtables yet).
Now when function context creation is encapsulated in lang::manager,
some .cc files can stop using wasm-specific headers and just go with the
lang/manager.hh one.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The only usage is in batchlog_manager, and it
can be replaced with cf.get_truncation_time().
std::optional<std::reference_wrapper<canonical_mutation>>
is replaced with canonical_mutation* since it is
semantically the same but with less type boilerplate.
... and sanitize the future used on stop.
The loop in question is now started in .start(), but all callers now
construct the manager late enough, so the loop spawning can be moved.
This also calls for renaming the future member of the class and allows
to make it regular, not shared, future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently the only caller of it is the batchlog manager itself. It
checks for the shard-id to be zero, calls the method, then the method
asserts that it's run on shard-0.
Moving the check into the method removes the need for assertion and
makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently drain() is called twise -- first time from
storage_service::drain() (on shutdown), second via
batchlog_manager::stop(). The routine is unintentinally re-entrable,
because:
- explicit check for not aborting the abort source twise
- breaking semaphore can be done multiple times
- co-await-ing of the _started future works because the future is shared
That's not extremely elegant, better to make the drain() bail out early
if it was already called.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The legacy failure_detector is now unused and can be removed.
TODO: integare direct_failure_detector with failure_detector api.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The manager will need system ks to get truncation record from, so add it
explicitly. Start-stop sequence no allows that
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.
Closes#12858
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.
mutation_reader remains in the readers/ module.
mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.
This is a step forward towards librarization or modularization of the
source base.
Closes#12788
Add include statements to satisfy dependencies.
Delete, now unneeded, include directives from the upper level
source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This is not identical change, if drain() resolves with exception we end
up skipping the gate closing, but since it's stop why bother
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The .drain() method can be called from several places, each needs to
wait for its completion. Now this is achieved with the help of a gate,
but there's a simpler way
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now, mutate/mutate_result accept a flag which decides whether the write
should be rate limited or not.
The new parameter is mandatory and all call sites were updated.
After fcb8d040 ("treewide: use Software Package Data Exchange
(SPDX) license identifiers"), many dual-licensed files were
left with empty comments on top. Remove them to avoid visual
noise.
Closes#10562
Only for reasons other than "no such KS", i.e. when the failure is
presumed transient and the batch in question is not deleted from
batchlog and will be retried in the future.
(Would info be more appropriate here than warning?)
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Closes#10556
When executing internal queries, it is important that the developer
will decide if to cache the query internally or not since internal
queries are cached indefinitely. Also important is that the programmer
will be aware if caching is going to happen or not.
The code contained two "groups" of `query_processor::execute_internal`,
one group has caching by default and the other doesn't.
Here we add overloads to eliminate default values for caching behaviour,
forcing an explicit parameter for the caching values.
All the call sites were changed to reflect the original caching default
that was there.
Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>