We will use this type as the input to the BTI row index writer.
Since it will be implemented in other translation units,
the definition of the type has to be moved to a header.
A recent commit a0c29055e5 added
some trace printouts which print an std::reference_wrapper<>.
Apparently a formatter for this type was only added to fmt
in version 11.1.0, and it doesn't exist on earlier versions,
such as fmt 11.0.2 on Fedora 41.
Let's avoid requiring shiny-new versions of fmt. The workaround
is easy: just unwrap the reference_wrapper - print pr.get()
instead of just pr, and Scylla returns to building correctly on
Fedora 41.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25228
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.
However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.
For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)
Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.
Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.
Preparation for new functionality, no backporting needed.
Closesscylladb/scylladb#25093
* github.com:scylladb/scylladb:
sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
test/boost: add a test for inexact index lookups
sstables/mx/reader: allow passing a custom index reader to the constructor
sstables/index_reader: remove advance_to
sstables/mx/reader: handle inexact lookups in `advance_context()`
sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
sstables/index_reader: make the return value of `get_partition_key` optional
sstables/mx/reader: handle "backward jumps" in forward_to
sstables/mx/reader: filter out partitions outside the queried range
sstables/mx/reader: update _pr after `fast_forward_to`
`advance_context()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`
But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.
E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).
And this is enough for `advance_context()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
This makes `advance_to` unused, so we remove it.
`advance_to_next_partition()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`
But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.
E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).
And this is enough for `advance_to_next_partition()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
A bunch of code assumes that the Data.db stream can only go forward.
But with BTI indexes, if we perform an advance_to, the index can point to a position
which the data reader has already passed, since the index is inexact.
The logic of the data reader ensures that it has stopped
within the last partition range, or just immediately
after it, after reading the next partition key and
noticing that it doesn't belong to the range.
But forward_to can only be used with increasing ranges.
The start of the next range must be greater or equal to the
end of the previous range.
This means that the exact start of the next partition range
must be no earlier than:
1. Before the partition key just read by the data reader,
if the data reader is positioned immediately after a partition key.
2. The start of the first partition after the current data reader
position, if the data reader isn't positioned immediately after a
partition key.
So, if the index returns a position smaller than the current data
reader position, then:
1. If the reader is immediately after a partition key,
we have to reuse this partition key (since we can't go back
in the stream to read it again), and keep reading from
the current position.
2. Otherwise we can safely walk the index to the first partition
that lies no earlier than the current position.
The current index format is exact: it always returns the position of the
first partition in the queried partition range.
But we are about the add an index format where that doesn't have to be the case.
In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares
the reader for that, by skipping the partitions which were read by the
data reader but don't belong to the queried range.
Note: as of this patch, only the "normal path" is ever used.
We add tests exercising these code paths later.
Also note that, as of this patch, actually stepping outside
the queried range would cause the reader to end up in a
state where the underlying parser is positioned right after
partition key immediately following the queried range.
If the reader was forwarded to that key in this state,
it would trip an assert, because the parser can't handle backward
jumps. We will add logic to handle this case in the next patch.
In later patches, we will prepare the reader for inexact index
implementations (ones which can return a Data file range that
includes some partitions before or after the queried range).
For that, we will need to filter out the partitions outside of the
range, and for that we need to remember the range. This is the
goal of this patch.
Note that we are storing a reference to an argument of
`fast_forward_to`. This is okay, because the contract
of `mutation_reader` specifies that the caller must
keep `pr` alive until the next `fast_forward_to`
or until the reader is destroyed.
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.
Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh
Fixes: #22102Fixes: #22103Fixes: #22105Closesscylladb/scylladb#25082
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).
In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.
This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.
Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.
No backports needed, this is a preparation for new functionality.
Closesscylladb/scylladb#25000
* github.com:scylladb/scylladb:
sstables: add sstable::make_index_reader() and use where appropriate
sstables/mx: in readers, use abstract_index_reader instead of index_reader
sstables: in validate(), use abstract_index_reader instead of index_reader where possible
test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
sstables/index_reader: introduce abstract_index_reader
sstables/index_reader: extract a prefetch_lower_bound() method
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.
We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.
So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
After we add a second index implementation, we will probably want to
adjust validate() to work with either implementation.
Some validations will be format-specific, but some will be common.
For now, let's use abstract_index_reader for the validations which
can be done through that interface, and let's have downcast-specific
codepaths for the others.
Note: we change a `get_data_file_position()` call to `data_file_positions().start`.
The call happens at the beginning of a partition, and at this points
these two expressions are supposed to be equivalent.
The sstable reader reaches directly for a `clustered_index_cursor`.
But a BTI index reader won't be able to implement
`clustered_index_cursor`, because a BTI index doesn't store
full clustering keys, only some trie-encoded prefixes.
So we want to weaken the dependency. Instead of reaching
for `clustered_index_cursor`, we add a method which expresses
our intent, and we let `index_reader` touch the cursor internally.
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full prefix.
This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions.
Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery.
Add a full-stack test which checks that rows with bad keys are correctly handled.
Fixes: https://github.com/scylladb/scylladb/issues/24489
The bug is present in all versions, has to be backported to all supported versions.
Closesscylladb/scylladb#24492
* github.com:scylladb/scylladb:
test/boost/sstable_datafile_test: add test for corrupt data
sstables/mx/writer: handler rows with empty keys
test/lib/cql_assertions: introduce columns_assertions
sstables: add corrupt_data_handler to sstables::sstables
tools/scylla-sstable: make large_data_handler a local
db: introduce corrupt_data_handler
mutation: introduce frozen_mutation_fragment_v2
mutation/mutation_partition_view: read_{clustering,static}_row(): return row type
mutation/mutation_partition_view: extract de-ser of {clustering,static} row
idl-compiler.py: generate skip() definition for enums serializers
idl: extract full_position.idl from position_in_partition.idl
db/system_keyspace: add apply_mutation()
db/system_keyspace: introduce the corrupt_data table
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.
Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.
Remove `compressor::create()`. This enforces that compressors
are only created through the `sstable_compressor_factory`.
Unlike the synchronous `compressor::create()`, the factory will be able
to create dict-aware compressors.
SSTable readers and writers use `compressor` objects to compress and
decompress chunks of SSTable data files.
`compressor` objects are read-only, so only one of them is needed
for each SSTable. Before this commit, each reader and writer has
its own `compressor` object. This isn't necessary, but it's okay.
But later in this series it will stop being okay, because the creation
of a `compressor` will become an expensive cross-shard
operation (because it might require sharing a compression dictionary
from another shard). So we have to adjust the code so that there is
only once `compressor` per sstable, not one per reader/writer.
We stuff the ownership of this compressor into `sstable::compression`.
To make the ownership clear, we remove `compression_ptr` shared
pointers from readers and writers, and make them access the
compressor via the `sstable::compression` instead.
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.
Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).
However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.
Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.
Fixes#23484Closesscylladb/scylladb#23485
The class in question is a wrapper around output_stream that writes,
flushes and closes the stream in async context. For logging it also
keeps the component filename on board, and now it's good time to patch
it and keep the component_filename instead.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include
- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests
these are all good to explicitly convert component_names to strings.
There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Most of the method callers use it as log parameter. There are few more
places that push it to malformed_sstable_exception, which immediately
converts it to string, so this patch makes the exception be constructed
with the component_name either.
And there's one more place that passes this string to file_writer
constructor. For now, convert it to string explicitly, but next patches
will fix that place to use pure component_name too.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Drop it from files that obviously don't need it. Also kill some forward
declarations while at it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#22979
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::stable_partition`.
in this change, we:
- replace `boost::range::stable_parition()` to
`std::ranges::stable_parition()`
- since `std::ranges::stable_parition()` returns a subrange instead of
an iterator, change the names of variables which were previously used
for holding the return value of `boost::range::stable_partition()`
accordingly for better readability.
- remove unused `#include` of boost headers
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21911
Reads which need sstable index were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1-2% of time.
Computing it involves type name parsing.
Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the least-recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
Also, fixes a potential use-after-free on schema in column_translation.
Closesscylladb/scylladb#21347
* github.com:scylladb/scylladb:
sstables: Fix potential use-after-free on column_translation::column_info::name
sstables: Avoid computing column_values_fixed_lengths on each read
Replace manual subrange advancement with the more concise and readable
`subrange.advance()` method. This change:
- Eliminates unnecessary subrange instance creation
- Improves code readability
- Reduces potential for unnecessary object allocation
- Leverages the built-in `advance()` method for cleaner iterator handling
The modification simplifies the iteration logic while maintaining the
same functional behavior.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21865
Replace boost::make_iterator_range() with std::ranges::subrange.
This change improves code modernization and reduces external dependencies:
- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>
This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21787
Reads which need clustering index cursor were computing
column_values_fixed_lengths each time. This showed up in perf profile
for a sstable-read heavy workload, and amounted to about 1%.
Avoid by using cached per-sstable mapping. There is already
sstable::_column_translation which can be used for this. It caches the
mapping for the most recently used schema. Since the cursor uses the
mapping only for primary key columns, which are stable, any schema
will do, so we can use the last _column_translation. We only need to
make sure that it's always armed, so sstable loading is augmented with
arming with sstable's schema.
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.
in this change, we:
- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
`std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
to feed `std::views::transform()` a view range.
to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.
this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.
limitations:
there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21700
For historic reasons, we have (in bytes.hh) a type sstring_view which
is an alias for std::string_view - since the same standard type can hold
a pointer into both a seastar::sstring and std::string.
This alias in unnecessary and misleading to new developers (who might
assume it is somehow different from std::string_view). This patch doesn't
yet remove all occurances of sstring_view (the request in #4062), but
begins to do it by renaming one commonly-used function, to_sstring_view(bytes)
to to_string_view() and of course changes all its uses to the new name.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This PR enables compaction tasks to verify the integrity of the input data through checksum and digest checks. The mechanism for integrity checking was introduced in previous PRs (#20207, #20720) as a built-in functionality of the input streams. This PR integrates this mechanism with compaction. The change applies to all compaction types and covers both compressed and uncompressed SSTables adhering to the 3.x format. If a compaction task reads only part of an SSTable, then only the per-chunk checksums are verified, not the digest.
The PR consists of:
* Changes to mx readers to support integrity checking. The kl readers, considered as compatibility-only, were left unchanged. Also, integrity checking on single-partition reversed reads (`data_consume_reversed_partition()`) remains unsupported by mx readers as this is not used in compaction.
* Changes to `sstable` and `sstable_set` APIs to allow toggling integrity checks for mx readers.
* Activation of integrity checking for all compaction types.
* Tests for all compaction types with corrupted SSTables.
Integrity checks come at a cost. For uncompressed SSTables, the cost is the loading of the CRC and Digest components from disk, and the calculation of checksums and digest from the actual data. For compressed SSTables, checksums are stored in-place and they are being checked already on all reads, so the only extra cost is the loading and calculation of the digest. The measurements show a ~5% regression in compaction performance for uncompressed SSTables, and a negligible regression for compressed SSTables.
Command: `perf-sstable --smp=1 --cpuset=1 --poll-mode --mode=compaction --iterations=1000 --partitions 10000 --sstables=1 --key_size=4096 --num_columns=15 --column_size={32, 1024, 3500, 7000, 14500}`
Uncompressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec) | Integrity (p/sec) | Regression |
+--------------+-----------------------+----------------------+------------+
| 50 MiB | 65175.59 +- 80.82 | 61814.63 +- 72.88 | 5.16% |
| 200 MiB | 41795.10 +- 60.39 | 39686.28 +- 45.05 | 5.05% |
| 500 MiB | 21087.41 +- 30.72 | 20092.93 +- 25.05 | 4.72% |
| 1 GiB | 12781.64 +- 21.77 | 12233.94 +- 21.71 | 4.29% |
| 2 GiB | 6629.99 +- 9.40 | 6377.13 +- 8.28 | 3.81% |
+--------------+-----------------------+----------------------+------------+
```
Compressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec) | Integrity (p/sec) | Regression |
+--------------+-----------------------+----------------------+------------+
| 50 MiB | 53975.05 +- 63.18 | 53825.93 +- 62.28 | 0.28% |
| 200 MiB | 28687.94 +- 26.58 | 28689.41 +- 26.91 | 0% |
| 500 MiB | 13865.35 +- 15.50 | 13790.41 +- 14.88 | 0.54% |
| 1 GiB | 7858.10 +- 7.71 | 7829.75 +- 9.66 | 0.36% |
| 2 GiB | 4023.11 +- 2.43 | 4010.54 +- 2.55 | 0.31% |
+--------------+-----------------------+----------------------+------------+
(p/sec = partitions/sec)
```
Refs #19071.
New feature, no backport is needed.
Closesscylladb/scylladb#21153
* github.com:scylladb/scylladb:
test: Add test for compaction with corrupted SSTables
compaction: Enable integrity checks for all compaction types
sstables: Add integrity option to factories for sstable_set readers
sstables: Add integrity option to sstable::make_reader()
sstables: Add integrity option to mx::make_reader()
sstables: Load checksums and digests in mx full-scan reader
sstables: Add integrity option to data_consume_single_partition()
sstables: Disengage integrity_check from sstable class
sstables: Allow data sources to disable digest check
data_consume_rows_context_m has a _column_value buffer it uses to read
key and column values into, preparing for parsing and consuming them.
This buffer is reset (released) in a few different cases:
* When using it for key - after consuming its content
* When using it for column value - when a colum has no value
However, the buffer is not released when used for a column value and the
column is consumed. This means that if a large column is read from the
sstable, this buffer can potentially linger and keep consuming memory
until either one of the other release scenarios is hit, or the reader is
destroyed.
Add a third release scenario, releasing the buffer after the row end was
consumed. This allows the buffer to be re-used between columns of the
same row, at the same time ensuring that a large buffer will not linger.
This patch can almost halve the memory consumption of reads in certain
circumstances. Point in case: the test
test_reader_concurrency_semaphore_memory_limit_engages starts to fail
after this fix, because the read doesn't trigger the OOM limit anymore
and needs doubling of the concurrency to keep passing.
This issue was found in a dtest
(`test_ics_refresh_with_big_sstable_files`), which writes some large
cells of up to 7MiB. After reading the row containing this large cell,
the reader holds on to the 7MiB buffer causing the semaphore's OOM
protection to kick in down the line.
Fixes: https://github.com/scylladb/scylladb/issues/21160Closesscylladb/scylladb#21132
In previous patch we added support for integrity checking in the mx
full-scan reader.
Do the same for the mx reader, which is the one used by all compaction
types except for scrub compaction. The mx reader should now support
integrity checking for single-partition and multi-partition reads.
Single-partition reversed reads were excluded from this patch because
they are not used in compaction.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In 716fc487fd we introduced integrity checking in the mx crawling reader
(later renamed to full-scan reader in 6250ff18eb).
When integrity checking is enabled, the full-scan reader expects that
the checksum and digest components have been loaded from disk by the
caller. This is true for the validation path, in which
`sstable::validate()` loads the components before creating the full-scan
reader, but it doesn't hold if a full-scan reader is created directly by
a higher-level function through `sstable::make_full_scan_reader()`.
As part of the effort to enable integrity checking for compaction, this
becomes a blocker for scrub compaction, which relies solely on full-scan
readers.
Solve this by allowing the mx full-scan reader to load the checksum and
digest components internally. The loading is an asynchronous operation,
so it has to be deferred until the first buffer fill.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
The `integrity_check` flag was first introduced as a parameter in
`sstable::data_stream()` to support creating input streams with
integrity checking. As such, it was defined in the sstable class.
However, we also use this flag in the kl/mx full-scan readers, and, in
a later patch, we will use it in `class sstable_set` as well.
Move the definition into `types_fwd.hh` since it is no longer bound to
the sstable class.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
To reduce the dependency load, replace use of boost ranges
with the std equivalent.
Files that lost the indirect boost dependency have it added as a
direct dependency.
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.
There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.
The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.
We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately. This is especially important
for single-row reads, where the bounds are around the same key. In
this case we want to read the data file range which belongs to a
single promoted index block. It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache. Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.
Fixes#10030.
The change was tested with perf-fast-forward.
I populated the data set with `column_index_size_in_kb` set to 1
scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1
Test run:
build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0
This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.
Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.
Also, the first read which misses in cache is still faster after the change.
Before:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009802 1 1 102 0 102 102 21.0 21 196 2 1 0 1 1 0 0 0 568 269 4716050 53.4%
500001 1 0.000321 1 1 3113 0 3113 3113 2.0 2 64 1 0 1 0 0 0 0 0 116 26 555110 45.0%
```
After:
```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride rows time (s) iterations frags frag/s mad f/s max f/s min f/s avg aio aio (KiB) blocked dropped idx hit idx miss idx blk c hit c miss c blk allocs tasks insns/f cpu
500000 1 0.009609 1 1 104 0 104 104 20.0 20 137 2 1 0 1 1 0 0 0 561 268 4633407 43.1%
500001 1 0.000217 1 1 4602 0 4602 4602 1.0 1 2 1 0 1 0 0 0 0 0 110 26 313882 64.1%
```
Backports: none, not a regression
Closesscylladb/scylladb#20522
* github.com:scylladb/scylladb:
perf: perf_fast_forward: Add test case for querying missing rows
perf-fast-forward: Allow overriding promoted index block size
perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
sstables: bsearch_clustered_cursor: Add more tracing points
sstables: reader: Log data file range
sstables: bsearch_clustered_cursor: Unify skip_info logging
sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
sstables: bsearch_clustered_cursor: Skip even to the first block
test: sstables: sstable_3_x_test: Improve failure message
sstables: mx: writer: Never include partition_end marker in promoted index block width
sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
sstables: clustered_cursor: Track current block
To reduce dependency load, use std ranges instead of boost ranges.
The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.
Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.