Currently the new reversing sstable algorithms do not support fast
forwarding and the cache does not yet handle reversed results. This
forced us to disable the cache for reversed queries if we want to
guarantee bounded memory. We introduce an option that does this
automatically (without specifying `bypass cache` in the query) and turn
it on by default.
If the user decides that they prefer to keep the cache at the
cost of fetching entire partitions into memory (which may be viable
if their partitions are small) during reversed queries, the option can
be turned off. It is live-updateable.
The test generates a random set of mutations and creates two readers:
- one by reversing the mutations, creating an sstable out of the result,
and querying it in reverse,
- one by creating an sstable directly from the mutations and querying it
in forward mode.
It checks that the readers give equal results.
The test already managed to find a bug where offsets returned by the
sstable index were interpreted incorrectly as absolute instead of
relative. It also helped find another bug unrelated to reversing (#9352).
Surprisingly few tests use the random schema and random mutation
utilities which seem to be quite powerful.
We use partition_reversing_data_source and the new `index_reader` methods
to implement single-partition reads in `mx_sstable_mutation_reader`.
The parsing logic does not need to change: the buffers returned by the
source already contain rows in reversed clustering order.
Some changes were required in `mp_row_consumer_m` which processes the
parsed rows and emits appropriate mutation fragments. The consumer uses
`mutation_fragment_filter` underneath to decide whether a fragment
should be ignored or not (e.g. the parsed fragment may come from outside
the requested clustering range), among other things. Previously
`mutation_fragment_filter` was provided a `partition_slice`. If the
slice was reversed, the filter would use
`clustering_key_filter_ranges::get_ranges` to obtain the clustering
ranges from the slice in unreversed order (they were reversed in the
slice) since we didn't perform any reversing in the reader. Now the
reader provides the ranges directly instead of the slice; furthermore,
the ranges are provided in native-reversed format (the order of ranges
is reversed and the ranges themselves are also reversed), and the schema
provided to the filter is also reversed. Thus to the filter everything
appears as if it was used during a non-reversed query but on a table
with reversed schema, which works correctly given the fact that the
reader is feeding parsed rows into the consumer in reversed order.
During reversed queries the reader uses alternative logic for skipping
to a later range (or, speaking in non-reversed terms, to an earlier range),
which happens in `advance_context`. It asks the index to advance its
upper bound in reverse so that the reversing_data_source notices the
change of the index end position and returns following buffers with rows
from the new range.
There is a slight difference in behavior of the reader from
`mp_row_consumer_m`'s point of view. For non-reversed reads, after
the consumer obtains the beginning of a row (`consume_row_start`)
- which contains the row's position but not the columns - and tells the
reader that the row won't be emitted because we need to skip to a later
range, the reader would tell the data source (the 'context') immediately
to skip to a later range by calling `skip_to`. This caused the source
not to return the rest of the row, and the rest of the row would not
be fed to the consumer (`consume_row_end`). However, for reversed reads,
the data source performs skipping 'on its own', after it notices that
the index end position has changed. This may happen 'too late', causing
the rest of the row to be returned anyway. We are prepared for this
situation inside `mp_row_consumer` by consulting the mutation fragment
filter again when the rest of the row arrives.
Fast forwarding is not supported at this point, which is fine given that
the cache is disabled for reversed queries for now (and the cache is the
only user of fast forwarding).
The `partition_slice` provided by callers is provided in 'half-reversed'
format for reversed queries, where the order of clustering ranges is
reversed, but the ranges themselves are not. This means we need to modify
the slice sometimes: for non-single-partition queries the mx reader must
use a non-reversed slice, and for single-partition queries the mx reader
must use a native-reversed slice (where the clustering ranges themselves
are reversed as well). The modified slice must be stored somewhere; we
store it inside the mx reader itself so we don't need to allocate more
intermediate readers at the call sites. This causes the interface of
`mx::make_reader` to be a bit weird: for non-single-partition queries
where the provided slice is reversed the reader will actually return a
non-reversed stream of fragments, telling the user to reverse the stream
on their own. The interface has been documented in detail with
appropriate comments.
This patch adds an implementation of a data source that wraps an sstable
data file and returns data buffers with contents of one partition in the
sstable as if the rows of the partition were present in a reversed
order. In other words, to the user of the source the partition appears
to be reversed. We shall call this an 'intermediary' data source.
As part of the interface of the intermediary source the user is also
given read access to the source's current position over the data file,
and the constructor of the source takes a reference to `index_reader`.
This is necessary because the index operates directly on data file
offsets and we want the user to be able to use the index to skip
sequences of rows.
In order to ask the source to skip a sequence of rows - e.g. when jumping
between clustering ranges - the user must advance the index' upper bound
in reverse (to an earlier position). The source will then notice that
the end position of the index has changed and take appropriate action.
An alternative would be to translate the data positions of
`index_reader` to 'reversed positions' of the intermediary and then use
`skip_to` for skipping, as we do for forward reads. However this
solution would introduce more complexity to `index_reader` and the
intermediary source. One reason for the complexity in the input stream
is that we would have two kinds of skips: a single row skip,
and a skip to a clustering range. We know the offset of the next row,
so we could check that to differentiate them. We would also need to add
an information about the position of first clustering row and end of
the last one in the index_reader. Skipping by checking the index seems
to be overall simpler.
For simplicity, the intermediary stream always starts with
parsing the partition header and (if present) the static row,
and returning the corresponding bytes as a result of the first
read.
After partition header and static row we must find the last row entry of
the requested range. If the range ends before the partition end (i.e.
there are more row entries after the range) we can use the 'previous
unfiltered size' of the row following the range; otherwise we must scan
the last promoted index block and take its last row.
After finding the data range of the last row, we parse rows
consecutively in reversed order. We must parse the rows partially
to learn their lengths and the positions of previous rows. We're
using similar constructs as in the sstable parser, but it only
contains a small part of the parsing coroutine and doesn't perform
any correctness checks. The parser for rows still turned out rather
big mostly because we can't always deduce the size of the clustering
blocks without reading the block header.
The parser allows reading rows while skipping their bodies also in
non-reversed order, which we are making use of while reading the
last promoted index block.
The intermediary data source has one more utility: reversing range
tombstones. When we read a tombstone bound/boundary, we modify
the data buffer so that the resulting bound/boundary has the reversed
kind (so we don't read ends before starts) and the boundaries have their
before/after timestamps swapped.
In the sstable reader, we iterate over clustering ranges using the
index_reader, which normally only accepts advancing to increasing
positions. In this patch we add methods for advancing the index
reader in reverse.
To simplify our job we restrict our attention to a single implementation
of the promoted index block cursor: `bsearch_clustered_cursor`. The
`index_reader` methods for advancing in reverse will thus assume that
this implementation is used. The assumption is correct given that we're
working only with sstables of versions >= mc, which is indeed the
intended use case. We add some documentation in appropriate places to
make this obvious.
We extend `bsearch_clustered_cursor` with two methods:
`advance_past(pos)`, which advances the cursor to the first block after
`pos` (or to the end if there is no such block), and
`last_block_offset()`, which returns the data file offset of the first
row from the last promoted index block.
To efficiently find the position in the data file of the last row
of the partition (which we need when performing a reversed query)
the sstable reader may need to read the span of the entire last promoted
index block in the data file. To learn where the block starts it can use
`index_reader::last_block_offset()`, which is implemented in terms of
`bsearch_clustered_cursor::last_block_offset()`.
When performing a single partition read in forward order, the reader
asks the index to position its lower bound at the start of the partition
and its upper bound after the end of the slice. It starts by reading the
first range. After exhausting a range it jumps to the next one by asking
the index to advance the lower bound.
For reverse single partition reads we'll take a similar approach: the
initial bound positions are as in the forward case. However, we start
with the last range and after exhausting a range we want to jump to a
previous one; we will do it by advancing the upper bound in reverse
(i.e. moving it closer to the beginning of the partition). For this
we introduce the `index_reader::advance_reverse` function.
The cql_config_updater is a sharded<> service that exists in main and
whose goal is to make sure some db::config's values are propagated into
cql_config. There's a more handy updateable_value<> glue for that.
tests: unit(dev)
refs: #2795
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210927090402.25980-1-xemul@scylladb.com>
Fixed some errors in .github/CODEOWNERS (which is used by Github to
recommend who should review which pull request), and also add a few
additional ownerships I thought of.
This file could still use more work - if you can think of specific
files or directories you'd like to review changes in, please send a
patch for this file to add yourself to the appropriate paths.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929141118.378930-1-nyh@scylladb.com>
Recently we observed an OOM caused by the partition based splitting
writer going crazy, creating 1.7K buckets while scrubbing an especially
broken sstable. To avoid situations like that in the future, this patch
provides a max limit for the number of live buckets. When the number of
buckets reach this number, the largest bucket is closed and replaced by
a bucket. This will end up creating more output sstables during scrub
overall, but now they won't all be written at the same time causing
insane memory pressure and possibly OOM.
Scrub compaction sets this limit to 100, the same limit the TWCS's
timestamp based splitting writer uses (implemented through the
classifier -
time_window_compaction_strategy::max_data_segregation_window_count).
Fixes: #9400
Tests: unit(dev)
Closes#9401
There are now 231 translation units that indirectly include commitlog.hh
due to the need to have access to db::commitlog::force_sync.
Move that type to a new file commitlog_types.hh and make it available
without access to the commitlog class.
This reduces the number of translation units that depend on commitlog.hh
to 84, improving compile time.
Unfortunately, defining metrics in Scylla requires some code
duplication, with the metrics declared in one place but exported in a
different place in the code. When we duplicated this code in Alternator,
we accidentally dropped the first metric - for BatchGetItem. The metric
was accounted in the code, but not exported to Prometheus.
In addition to fixing the missing metric, this patch also adds a test
that confirms that the BatchGetItem metric increases when the
BatchGetItem operation is used. This test failed before this patch, and
passes with it. The test only currently tests this for BatchGetItem
(and BatchWriteItem) but it can be later expanded to cover all the other
operations as well.
Fixes#9406
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210929121611.373074-1-nyh@scylladb.com>
Currently no mutation-source supports reading in reverse natively but
we are working on changing that, adding native reverse read support to
memtable, cache and sstable readers. To ensure that all mutation
sources work in a correct and uniform manner when reading in reverse,
we add a reverse test to the mutation source test suite. This test
reverses the data that it passes to `populate()`, then reads in
forward order (in reverse compared to the data order). For this we use
the currently established reverse read API: reverse schema (schema
order == query order) and half-reversed (legacy) slice. All mutation
sources are prepared to work with reversed reads, using the
`make_reversing_reader()` adapter. As we progress with our native
reverse support, we will replace these adapters with native reversing
support. As part of this, we push down the reversing reader adapter
currently existing on the `query::consume_page()` level, to the
individual mutation sources.
Closes#9384
* github.com:scylladb/scylla:
test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set
querier: consume_page(): remove now unused max_size parameter
test/lib: mutation_source_test: test reading in reverse
test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse
test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse
test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema
treewide: move reversing to the mutation sources
mutation_query: reconcilable_result_builder: document reverse query preconditions
sstable_set: time_series_sstable_set: reverse mode
mutlishard_mutation_query: set max result size on used permits
db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema
flat_mutation_reader: make_reversing_reader(): add convenience stored slice
mutation_reader: evictable_reader: add reverse read support
flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support
flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions
query-request: introduce `half_reverse_slice`
flat_mutation_reader_assertions: log what's expected
"
Backlog tracker isn't updated correctly when facing a schema change, and
may leak a SSTable if compaction strategy is changed, which causes
backlog to be computed incorrectly. Most of these problems happen because
sstable set and tracker are updated independently, so it could happen
that tracker lose track (pun intended) of changes applied to set.
The first patch will fix the leak when strategy is changed, and the third
patch will make sure that tracker is updated atomically with sstable set,
so these kind of problems will not happen anymore.
Fixes#9157
"
* 'fixes_to_backlog_tracker_v4' of github.com:raphaelsc/scylla:
compaction: Update backlog tracker correctly when schema is updated
compaction: Don't leak backlog of input sstable when compaction strategy is changed
compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables()
compaction: simplify removal of monitors
To ensure all mutation sources uniformly support the current API of
reverse reading: reversed schema and half-reversed slice. This test will
also ensure that once we switch to native-reverse slice, all
mutation-sources will keep on working.
For reversed reads we must adjust the lower/upper bounds used by the
`position_reader_queue` and `clustering_combined_reader`. The bounds are
calculated using the mutation schema, but we need bounds calculated
using the query schema which is reversed.
The mutation source test suite will soon test reads in reverse. Prepare
for this by checking the reversed flag on the slice and not reversing
the data when set. The test will have two modes effectively:
* Forward mode: data is reversed before read, the reversed again during
read.
* Reverse mode: data is already reversed and it is reversed back during
read.
The two might not be the same in case the schema was upgraded or if we
are reading in reverse. It is important to use the passed-in query
schema consistently during a read.
Push down reversing to the mutation-sources proper, instead of doing it
on the querier level. This will allow us to test reverse reads on the
mutation source level.
The `max_size` parameter of `consume_page()` is now unused but is not
removed in this patch, it will be removed in a follow-up to reduce
churn.
DynamoDB limits the number of items that a BatchWriteItem call can write
to 25. As noted in issue #5057, in Alternator we don't have this limit
or any limit on the number of items in a BatchWriteItem - which probably
isn't wise.
This patch adds a simple xfailing test for this.
Refs #5057
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210912140736.76995-1-nyh@scylladb.com>
The `expression` type (an std::variant) suffers from bad ergonomics:
- std::variant has poor/no constraints, so compiler error messages are long and uninformative
- it cannot be forward-declared (since std::variant does not support incomplete types)
- the type name is long, polluting compiler error messages and debug symbols
- it requires an artificial `nested_expression` when one expression is nested inside another
This series fixes those drawbacks by wrapping the variant in a class, adding constraints, and adding an extra indirection.
Test: unit (dev)
Closes#9402
* github.com:scylladb/scylla:
cql3: expr: drop nested_expression
cql3: expr: make expression forward declarable, easier to use
cql3: expr: construct column_value explicitly
cql3: expr: introduce as/as_if/is
cql3: expr: introduce expr::visit, replacing std::visit
On Ubuntu, scaling_governor becomes powersave after rebooted, even we configured cpufrequtils.
This is because ondemand.service, it unconditionally change scaling_governor to ondemand or powersave.
cpufrequtils will start before ondemand.service, scaling_governor overwrite by ondemand.service.
To configure scaling_governor correctly, we have to disable this service.
Fixes#9324Closes#9325
Now that expression can be nested in its component types
directly, we can remove nested_expression. Most of the patch
adjusts uses to drop the dereference that was needed for
nested_expression.
Make expression a class, holding a unique_ptr to a variant,
instead of just a variant.
This has some advantages:
- the constructor can be properly constrained
- the type can be forward-declared
- the type name is just "expression", rather than
a huge variant. This makes compiler error messages easier
to read.
- the internal indirection allows removal of nested_expression
(later in the series)
We have a few cases where a column_definition* is converted
directly to an expression without an explicit call to column_value{}.
The new expression implementation will not allow this, so make
these cases explicit. IMO this is better form than to rely
on the compiler picking the right expression subtype.
Simple wrappers for std::get, std::get_if, std::holds_alternative.
The new names are shorter and IMO more readable.
Call sites are updated.
We will later replace the implementation.
The new expr::visit() is just a wrapper around std::visit(),
but has better constraints. A call to expr::visit() with a
visitor that misses an overload will produce an error message
that points at the missing type. This is done using the new
invocable_on_expression concept. Note it lists the expression
types one by one rather than using template magic, since
otherwise we won't get the nice messages.
Later, we will change the implementation when expression becomes
our own type rather than std::variant.
Call sites are updated.
`time_series_sstable_set` uses `clustering_combined_reader` to implement
efficient single-partition reads. It provides a `position_reader_queue`
to the reader. This queue returns readers to the sstables from the set
in order of the sstables' lower bounds, and with each reader it provides
an upper bound for the positions-in-partition returned by the reader.
Until now we would assume non-reversed queries only. Reversed queries
were implemented by performing forward query in the lower layers
and reversing the results at the upper-most layer of the reader stack.
Before pushing the reversing down to the sources (in particular,
to sstable readers), we need to support the reverse mode in
`time_series_sstable_set` and the queue it provides to
`clustering_combined_reader`.
This requires using different lower and upper bounds in the queue.
For non-reversed reads we used `sstable::min_position()` as the lower
bound and `sstable::max_position()` as the upper bound. For reversed
reads all comparisons performed by `clustering_combined_reader` will be
reversed, as it will use a reversed schema. We can then use
`sstable::max_position().reversed()` for the lower bound and
`sstable::min_position().reversed()` for the upper bound.
08042c1688 added the query max result size
to the permit but only set it for single partition queries. This patch
does the same for range-scans in preparation of `query::consume_page()`
not propagating max size soon.
The two might not be the same in case the schema was upgraded (unlikely
for virtual tables) or if we are reading in reverse. It is important to
use the passed-in query schema consistently during a read.
This serves as a convenience slice storage for reads that have to
store an edited slice somewhere. This is common for reads that work
with a native-reversed slice and so have to convert the one used in the
query -- which is in half-reversed format.
Evictable reader has to be made aware of reverse reads as it checks/edits the
slice. This shouldn't require reverse awareness normally, it is only
required because we still use the half-reversed (legacy) slice format
for reversed reads. Once we switch to the native format this commit can
be reverted.
* scylla-dev/raft-misc-v4-docedit:
raft: document pre-voting and protection against disruptive leaders
raft: style edits of README.md.
raft: document snapshot API
Currently the following can happen:
1) there's ongoing compaction with input sstable A, so sstable set
and backlog tracker both contains A.
2) ongoing compaction replaces input sstable A by B, so sstable set
contains only B now.
3) schema is updated, so a new backlog tracker is built without A
because sstable set now contains only B.
4) ongoing compaction tries to remove A from tracker, but it was
excluded in step 3.
5) tracker can now have a negative value if table is decreasing in
size, which leads to log(<negative number>) == -NaN
This problem happens because backlog tracker updates are decoupled
from sstable set updates. Given that the essential content of
backlog tracker should be the same as one of sstable set, let's move
tracker management to table.
Whenever sstable set is updated, backlog tracker will be updated with
the same changes, making their management less error prone.
Fixes#9157
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The generic backlog formula is: ALL + PARTIAL - COMPACTING
With transfer_ongoing_charges() we already ignore the effect of
ongoing compactions on COMPACTING as we judge them to be pointless.
But ongoing compactions will run to completion, meaning that output
sstables will be added to ALL anyway, in the formula above.
With stop_tracking_ongoing_compactions(), input sstables are never
removed from the tracker, but output sstables are added, which means
we end up with duplicate backlog in the tracker.
By removing this tracking mechanism, pointless ongoing compaction
will be ignored as expected and the leaks will be fixed.
Later, the intention is to force a stop on ongoing compactions if
strategy has changed as they're pointless anyway.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>