lookup_readers might fail after populating some readers
and those better be closed before returning the exception.
Fixes#10351
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#10425
The change is mostly mechanical: update all compactor instances to the
_v2 variant and update all call-sites, of which there is not that many.
As a consequence of this patch, queries -- both single-partition and
range-scans -- now do the v2->v1 conversion in the consumers, instead of
in the compactor.
Mostly mechanical transformation. The main difference is in the detached
compaction state, from which we now get the range tombstone change,
instead of the range tombstone list. The code around this is a bit
awkward, will become simpler when compactor drops v1 support.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
"
With this series the mutation compactor can now consume a v2 stream. On
the output side it still uses v1, so it can now act as an online
v2->v1 converter. This allows us to push out v2->v1 conversion to as far
as the compactor, usually the next to last component in a read pipeline,
just before the final consumer. For reads this is as far as we can go,
as the intra-node ABI and hence the result-sets built are v1. For
compaction we could go further and eliminate conversion altogether, but
this requires some further work on both the compactor and the sstable
writer and so it is left to be done later.
To summarize, this patchset enables a v2 input for the compactor and it
updates compaction and single partition reads to use it.
"
* 'mutation-compactor-consume-v2/v1' of https://github.com/denesb/scylla:
table: add make_reader_v2()
querier: convert querier_cache and {data,mutation}_querier to v2
compaction: upgrade compaction::make_interposer_consumer() to v2
mutation_reader: remove unecessary stable_flattened_mutations_consumer
compaction/compaction_strategy: convert make_interposer_consumer() to v2
mutation_writer: migrate timestamp_based_splitting_writer to v2
mutation_writer: migrate shard_based_splitting_writer to v2
mutation_writer: add v2 clone of feed_writer and bucket_writer
flat_mutation_reader_v2: add reader_consumer_v2 typedef
mutation_reader: add v2 clone of queue_reader
compact_mutation: make start_new_page() independent of mutation_fragment version
compact_mutation: add support for consuming a v2 stream
compact_mutation: extract range tombstone consumption into own method
range_tombstone_assembler: add get_range_tombstone_change()
range_tombstone_assembler: add get_current_tombstone()
Consuming either a v1 or v2 stream is supported now, but compacted
fragments are still emitted in the v1 format, thus the compactor acts an
online downgrader when consuming a v2 stream. This allows pushing out
downgrade to v1 on the input side all the way into the compactor. This
means that reads for example can now use an all v2 reader pipeline, the
still mandatory downgrade to v1 happening at the last possible place:
just before creating the result-set. Mandatory because our intra-node
ABI is still v1.
There are consumers who are ready for v2 in principle (e.g. compaction),
they have to wait a little bit more.
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
if fill_buffer() is called after EOS, underlying reader will
be fast forwarded to a range pointed to by an invalid iterator,
so producing incorrect results.
fill_buffer() is changed to return early if EOS was found,
meaning that underlying reader already fast forwarded to
all ranges managed by multi_range_reader.
Usually, consume facilities check for EOS, before calling
fill_buffer() but most reader impl check for EOS to avoid
correctness issues. Let's do the same here.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211208131423.31612-1-raphaelsc@scylladb.com>
When multiple ranges are passed to `multishard_{mutation,data}_query()`,
it wraps the multishard reader with a multi-range one. This interferes
with the disassembly of the multishard reader's buffer at the end of the
page, because the multi-range reader becomes the top-level reader,
denying direct access to the multishard reader itself, whose buffer is
then dropped. This confuses the reading logic, causing data corruption
on the next page(s). A further complication is that the multi-range
reader can include data from more then one range in its buffer when
filling it. To solve this, a special-purpose multi-range is introduced
and used instead of the generic one, which solves both these problems by
guaranteeing that:
* Upon calling fill_buffer(), the entire content of the underlying
multishard reader is moved to that of the top-level multi-range
reader. So calling `detach_buffer()` guarantees to remove all
unconsumed fragments from the top-level readers.
* fill_buffer() will never mix data from more than one ranges. It will
always stop on range boundaries and will only cross if the last range
was consumed entirely.
With this, multi-range reads finally work with reader-saving.
The reader_lifecycle_policy API was created around the idea of shard
readers (optionally) being saved and reused on the next page. To do
this, the lifecycle policy has to also be able to control the lifecycle
of by-reference parameters of readers: the slice and the range. This was
possible from day 1, as the readers are created through the lifecycle
policy, which can intercept and replace the said parameters with copies
that are created in stable storage. There was one whole in the design
though: fast-forwarding, which can change the range of the read, without
the lifecycle policy knowing about this. In practice this results in
fast-forwarded readers being saved together with the wrong range, their
range reference becoming stale. The only lifecycle implementation prone
to this is the one in `multishard_mutation_query.cc`, as it is the only
one actually saving readers. It will fast-forward its reader when the
query happens over multiple ranges. There were no problems related to
this so far because no one passes more than one range to said functions,
but this is incidental.
This patch solves this by adding an `update_read_range()` method to the
lifecycle policy, allowing the shard reader to update the read range
when being fast forwarded. To allow the shard reader to also have
control over the lifecycle of this range, a shared pointer is used. This
control is required because when an `evictable_reader` is the top-level
reader on the shard, it can invoke `create_reader()` with an edited
range after `update_read_range()`, replacing the fast-forwarded-to
range with a new one, yanking it out from under the feet of the
evictable reader itself. By using a shared pointer here, we can ensure
the range stays alive while it is the current one.
The read itself has to be done with the reversed schema (query schema)
but the result building has to be done with the table schema. For data
queries this doesn't matter, but replicate the distinction for
consistency (and because this might change).
Push down reversing to the mutation-sources proper, instead of doing it
on the querier level. This will allow us to test reverse reads on the
mutation source level.
The `max_size` parameter of `consume_page()` is now unused but is not
removed in this patch, it will be removed in a follow-up to reduce
churn.
08042c1688 added the query max result size
to the permit but only set it for single partition queries. This patch
does the same for range-scans in preparation of `query::consume_page()`
not propagating max size soon.
Add the content of the compaction stats introduced in the previous patch
to the tracing data. This will help diagnose query performance related
problems caused by tombstones.
In order to avoid needless throwing, exceptions are passed
directly wherever possible. Two mechanisms which help with that are:
1. make_exception_future<> for futures
2. co_return coroutine::exception(...) for coroutines
which return future<T> (the mechanism does not work for future<>
without parameters, unfortunately)
This method is both a convenience method to obtain the permit, as well
as an abstraction to allow different implementations to get creative.
For example, the main implementation, the one in multishard mutation
query returns the permit of the saved reader one was successful. This
ensures that on a multi-paged read the same permit is used across as
much pages as possible. Much more importantly it ensures the evictable
reader wrapping the actual reader both use the same permit.
Ensure that when the reader has to be created anew the passed-in permit
is used to create it, instead of the one left over in remote-parts,
which is that of the already evicted reader.
This lays the groundwork to ensure the same permit is used across all
pages of a read, by a future patch which creates the wrapping reader
with the existing permit.
These convenience methods are not used as much anymore and they are not
even really necessary as the register/unregister inactive read API got
streamlined a lot to the point where all of these "convenience methods"
are just one-liners, which we can just inline into their few callers
without loosing readability.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
Currently the foreign fields of the reader meta are destroyed in the
background via the foreign pointer's destructor (with one exception).
This makes the already complicated life-cycle of these parts and their
dependencies even harder to reason about, especially in tests, where
even things like semaphores live only within the test.
This patch makes sure to destroy all these remote fields in the
foreground in either `save_reader()` or `stop()`, ensuring that once
`stop()` returns, everything is cleaned up.
Currently unregister_inactive_read for other shards is moved
to the background with nothing keep the respective
reader_concurrency_semaphore around.
This change runs the loop in parallel_for_each
so that we don't have to serially wait on all of them
but rather they can run in parallel on all shards, but
all are waited on via the returned future<>.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently only if the reader_meta is in the saved state
we unregister its inactive_read, yet it is possible
that it will hold an inactive_read also in the lookup state.
To cover all cases, rather than testing the reader_state,
unregister if the inactive_read_handle is engaged.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
As it can't throw. This is needed to simplify the following
patch that will always close the reader in read_page.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
034cb81323 and 0f0c3be disallowed reverse partition-range scans based on
the observation that the CQL frontend disallows them, assuming that
other client APIs also disallow them. As it turns out this is not true
and there it at least one client API (Thrift) which does allows reverse
range scans. So re-enable them.
Fixes: #8211
Tests: unit(release), dtest(thrift_tests.py)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210304142249.164247-1-bdenes@scylladb.com>
Refuse reverse queries just like in the new
`query_data_on_all_shards()`. The reason is the same, reverse range
scans are not supported on the client API level and hence they are
underspecified and more importantly: not tested.
A data query variant of the existing `query_mutations_on_all_shards()`.
This variant builds a `query::result`, instead of `reconcilable_result`.
This is actually the result format coordinators want when executing
range scans, the reason for using the reconcilable result for these
queries is historic, and it just introduces an unnecessary intermediate
format.
This new method allows the storage proxy to skip this intermediate
format and the associated conversion to `query::result`, just like we do
for single partition queries.
Reverse queries are refused because they are not supported on the client
API (CQL) level anyway and hence it is unspecified how they should work
and more importantly: they are not tested.
We want to add support to building `query::result` directly and reuse
the code path we use to build reconcilable result currently for it.
So templatize said code path on the result builder used. Since the
different result builders don't have a source level compatible interface
an adaptor class is used.
In the next patches we are going to generalize the query logic w.r.t.
the result builder used, so query_mutations_on_all_shards() will be just
a facade parametrizing the actual query code with the right result
builder.