Permits have to wait for re-admission after having been evicted. This
happens via `reader_permit::maybe_wait_readmission()`. The user of this
method -- the evictable reader -- uses it to re-wait admission when the
underlying reader was evicted. There is one tricky scenario however,
when the underlying reader is created for the first time. When the
evictable reader is part of a multishard query stack, the created reader
might in fact be a resumed, saved one. These readers are kept in an
inactive state until actually resumed. The evictable reader shares it
permit with the to-be-resumed reader so it can check whether it has been
evicted while saved and needs to wait readmission before being resumed.
In this flow it is critical that there is no preemption point between
this check and actually resuming the reader, because if there is, the
reader might end up actually recreated, without having waited for
readmission first.
To help avoid this situation, the existing `maybe_wait_readmission()` is
split into two methods:
* `bool reader_permit::needs_readmission()`
* `future<> reader_permit::wait_for_readmission()`
The evictable reader can now ensure there is no preemption point between
`needs_readmission()` and resuming the reader.
Fixes: #10187
Tests: unit(release)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20220315105851.170364-1-bdenes@scylladb.com>
"
Namely the query result writer and the reconcilable result builder, used
for building results for regular queries and mutation queries (used in
read repair) respectively.
With this, there are no users left for the v1 output of the compactor,
so we remove that, making the compactor v2 all-the-way (and simpler).
This means that for regular queries, a downgrade phase is eliminated
completely, as regular queries don't store range tombstone in their
result, so no need to convert them.
Tests: unit(dev, release, debug)
"
* 'result-builders-v2/v1' of https://github.com/denesb/scylla:
reconcilable_result_builder: remove v1 support
query_result_builder: remove v1 support
mutation_compactor: drop v1 related code-paths
mutation_compactor: drop v1 support altogether from the API
tree: migrate to the v2 consumer APIs
test/boost/mutation_test: remove v1 specific test code
querier: switch to v2 compactor output
reconcilable_result_builder: add v2 support
query_result_writer: add v2 support
query_result_builder: make consume(range_tombstone) noop
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
Reader recreation messes with the continuity of the mutation fragment
stream because it breaks snapshot isolation. We cannot guarantee that a
range tombstone or even the partition started before will continue after
too. So we have to make sure to wrap up all loose threads when
recreating the reader. We already close uncontinued partitions. This
commit also takes care of closing any range tombstone started by
unconditionally emitting a null range tombstone. This is legal to do,
even if no range tombstone was in effect.
We thought that unlike v1, v2 will not need this. But it does.
Handled similarly to how v1 did it: we ensure each buffer represents
forward progress, when the last fragment in the buffer is a range
tombstone change:
* Ensure the content of the buffer represents progress w.r.t.
_next_position_in_partition, thus ensuring the next time we recreate
the reader it will continue from a later position.
* Continue reading until the next (peeked) fragment has a strictly
larger position.
The code is just much nicer because it uses coroutines.
The evictable reader has a handful of flags dictating what to do after
the reader is recreated: what to validate, what to drop, etc. We
actually need a single flag telling us if the reader was recreated or
not, all other things can be derived from existing fields.
This patch does exactly that. Furthermore it folds do_fill_buffer() into
fill_buffer() and replaces the awkward to use `should_drop_fragment()`
with `examine_first_fragments()`, which does a much better job of
encapsulating all validation and fragment dropping logic.
This code reorganization also fixes two bugs introduced by the v2
conversion:
* The loop in `do_fill_buffer()` could become infinite in certain
circumstances due to a difference between the v1 and v2 versions of
`is_end_of_stream()`.
* The position of the first non-dropped fragment is was not validated
(this was integrated into the range tombstone trimming which was
thrown out by the conversion).
Before we add a v2 output option to the compactor, we want to get rid of
all the v1 inputs to make it simpler. This means that for a while the
compacting reader will be in a strange place of having a v2 input and a
v1 output. Hopefully, not for long.
1. There's nothing we can do about this error.
2. It doesn't affect any query
3. No need to reprort timeout errors here.
Refs #10029
Note that in 4.6.rc4-0.20220203.34d470967a0 (where the issue above was opened against)
the error is likely to be related to read_ahead failure which
is already reported as a warning in master since fc729a804b.
When backported, this patch should be applied after:
fc729a804bd7a993043d
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220207080041.174934-1-bhalevy@scylladb.com>
Fast forwarding is delegated to the underlying reader and assumes the
it's supported. The only corner case requiring special handling that has
shown up in the tests is producing partition start mutation in the
forwarding case if there are no other fragments.
compacting state keeps track of uncompacted partition start, but doesn't
emit it by default. If end of stream is reached without producing a
mutation fragment, partition start is not emitted. This is invalid
behaviour in the forwarding case, so I've added a public method to
compacting state to force marking partition as non-empty. I don't like
this solution, as it feels like breaking an abstraction, but I didn't
come across a better idea.
Tests: unit(dev, debug, release)
Message-Id: <20220128131021.93743-1-mikolaj.sieluzycki@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
Almost all mechanical: not passing a `reader` parameter around when we
know it's the `_reader` member, folding a short one-use method into
its caller.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Cloning instead of converting because there is at least one
downstream (via multishard_combining_reader) use that is not
straightforward to convert (multishard_mutation_query).
The clone is mostly mechanical and much simpler than the original,
because it does not have to deal with range tombstones when deciding
if it is safe to pause the wrapped reader, and also does not have to
trim any range tombstones.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
As this reader is used in a wide variety of places, it would be a
nightmare to upgrade all such sites in one go. So create a v2 clone and
migrate users incrementally.
The meat of the change is on the fragment merger level, which is now
also responsible for merging range tombstone changes. The fragment
producers are just mechanically converted to v2 by appending `_v2` to
the appropriate type names.
The beauty of this approach is that range tombstone merging happens in a
single place, shared by all fragment producers (there is 2 of them).
Selectors and factory functions are left as v1 for now, they will be
converted incrementally by the next patches.
The fragment producer component of the combined reader returns a batch
of fragments on each call to operator()(). These fragments are merged
into a single one by the fragment merger. This patch adds a stream id to
each fragment in the batch which identifies the stream (reader) it
originates from. This will be used in the next patches to associate
range-tombstone-changes originating from the same stream with each other.
The reader_lifecycle_policy API was created around the idea of shard
readers (optionally) being saved and reused on the next page. To do
this, the lifecycle policy has to also be able to control the lifecycle
of by-reference parameters of readers: the slice and the range. This was
possible from day 1, as the readers are created through the lifecycle
policy, which can intercept and replace the said parameters with copies
that are created in stable storage. There was one whole in the design
though: fast-forwarding, which can change the range of the read, without
the lifecycle policy knowing about this. In practice this results in
fast-forwarded readers being saved together with the wrong range, their
range reference becoming stale. The only lifecycle implementation prone
to this is the one in `multishard_mutation_query.cc`, as it is the only
one actually saving readers. It will fast-forward its reader when the
query happens over multiple ranges. There were no problems related to
this so far because no one passes more than one range to said functions,
but this is incidental.
This patch solves this by adding an `update_read_range()` method to the
lifecycle policy, allowing the shard reader to update the read range
when being fast forwarded. To allow the shard reader to also have
control over the lifecycle of this range, a shared pointer is used. This
control is required because when an `evictable_reader` is the top-level
reader on the shard, it can invoke `create_reader()` with an edited
range after `update_read_range()`, replacing the fast-forwarded-to
range with a new one, yanking it out from under the feet of the
evictable reader itself. By using a shared pointer here, we can ensure
the range stays alive while it is the current one.
Allocating the exception might fail, terminating the application as
`abandoned()` is called in a noexcept context. Handle this case by
catching the bad-alloc and aborting the reader with that instead when
this happens.
Evictable reader has to be made aware of reverse reads as it checks/edits the
slice. This shouldn't require reverse awareness normally, it is only
required because we still use the half-reversed (legacy) slice format
for reversed reads. Once we switch to the native format this commit can
be reverted.
Now that the timeout is stored in the reader
permit use it for admission rather than a timeout
parameter.
Note that evictable_reader::next_partition
currently passes db::no_timeout to
resume_or_create_reader, which propagated to
maybe_wait_readmission, but it seems to be
an oversight of the f_m_r api that doesn't
pass a timeout to next_partition().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.
(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).
We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.
Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
When recreating the underlying reader, the evictable reader validates
that the first partition key it emits is what it expects to be. If the
read stopped at the end of a partition, it expects the first partition
to be a larger one. If the read stopped in the middle of a certain
partition it expects the first partition to be the same it stopped in
the middle of. This latter assumption doesn't hold in all circumstances
however. Namely, the partition it stopped in the middle of might get
compacted away in the time the read was paused, in which case the read
will resume from a greater partition. This perfectly valid cases however
currently triggers the evictable reader's self validation, leading to
the abortion of the read and a scary error to be logged. Relax this
check to accept any partition that is >= compared to the one the read
stopped in the middle of.
When the evictable reader recreates in underlying reader, it does it
such that it continues from the exact mutation fragment the read was
left off from. There are however two special mutation fragments, the
partition start and static row that are unconditionally re-emitted at
the start of a new read. To work around this, when stopping at either of
these fragments the evictable reader sets two flags
_drop_partition_start and _drop_static_row to drop the unneeded
fragments (that were already emitted by it) from the underlying reader.
These flags are then reset when the respective fragment is dropped.
_drop_static_row has a vulnerability though: the partition doesn't
necessarily have a static row and if it doesn't the flag is currently
not cleared and can stay set until the next fill buffer call causing the
static row to be dropped from another partition.
To fix, always reset the _drop_static_row flag, even if no static row
was dropped (because it doesn't exist).
"
This series prevents broken_promise or dangling reader errors
when (resharding) compaction is stopped, e.g. during shutdown.
At the moment compaction just closes the reader unilaterally
and this yanks the reader from under the queue_reader_handle feet,
causing dangling queue reader and broken_promise errors
as seen in #8755.
Instead, fix queue_reader::close to set value on the
_full/_not_full promises and detach from the handle,
and return _consume_fut from bucket_writer::consume
if handle is terminated.
Fixes#8755
Test: unit(dev)
DTest: materialized_views_test.py:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
"
* tag 'propagate-reader-abort-v3' of github.com:bhalevy/scylla:
mutation_writer: bucket_writer: consume: propagate _consume_fut if queue_reader_handle is_terminated
queue_reader_handle: add get_exception method
queue_reader: close: set value on promises on detach from handle
"
The close path of the multishard combining reader is riddled with
workarounds the fact that the flat mutation reader couldn't wait on
futures when destroyed. Now that we have a close() method that can do
just that, all these workarounds can be removed.
Even more workarounds can be found in tests, where resources like the
reader concurrency semaphore are created separately for each tested
multishard reader and then destroyed after it doesn't need it, so we had
to come up with all sorts of creative and ugly workarounds to keep
these alive until background cleanup is finished.
This series fixes all this. Now, after calling close on the multishard
reader, all resources it used, including the life-cycle policy, the
semaphores created by it can be safely destroyed. This greatly
simplifies the handling of the multishard reader, and makes it much
easier to reason about life-cycle dependencies.
Tests: unit(dev, release:v2, debug:v2,
mutation_reader_test:debug -t test_multishard,
multishard_mutation_query_test:debug,
multishard_combining_reader_as_mutation_source:debug)
"
* 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla:
mutation_reader: reader_lifecycle_policy: remove convenience methods
mutation_reader: multishard_combining_reader: store shard_reader via unique ptr
test/lib/reader_lifecycle_policy: destroy_reader: cleanup context
test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
test/lib/reader_lifcecycle_policy: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
reader_lifecycle_policy implementations: fix indentation
mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
mutation_reader: shard_reader::close(): wait on the remote reader
multishard_mutation_query: destroy remote parts in the foreground
mutation_reader: shard_reader::close(): close _reader
mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
To prevent broken_promise exception.
Since close() is manadatory the queue_reader destructor,
that just detaches the reader from the handle, is not
needed anymore, so remove it.
Adjust the test_queue_reader unit test accordingly.
Test: test_queue_reader(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.