Commit Graph

131 Commits

Author SHA1 Message Date
Michał Chojnowski
4ca215abbc sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
Partitions.db uses a piece of the murmur hash of the partition key
internally. The same hash is used to query the bloom filter.
So to avoid computing the hash twice (which involves converting the
key into a hashable linearized form) it would make sense to use
the same `hashed_key` for both purposes.

This is what we do in this patch. We extract the computation
of the `hashed_key` from `make_pk_filter` up to its parent
`sstable_set_impl::create_single_key_sstable_reader`,
and we pass this hash down both to `make_pk_filter` and
to the sstable reader. (And we add a pointer to the `hashed_key`
as a parameter to all functions along the way, to propagate it).

The number of parameters to `mx::make_reader` is getting uncomfortable.
Maybe they should be packed into some structs.
2025-09-29 13:01:22 +02:00
Avi Kivity
f6b6312cf4 Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: introducing the new components, Partitions.db and Rows.db

This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller.

This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details.

The changes are for the sake of new and unreleased code. No backporting should be done.

Closes scylladb/scylladb#26075

* github.com:scylladb/scylladb:
  sstables/mx/reader: remove mx::make_reader_with_index_reader
  test/boost/bti_index_test: fix indentation
  sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
  sstables/trie: support reader_permit and trace_state properly
  sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached
  sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
  sstables/trie/bti_index_reader: support BYPASS CACHE
  test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate
  sstables/trie: change the signature of bti_partition_index_writer::finish
  sstables/bti_index: improve signatures of special member functions in index writers
  streaming/stream_transfer_task: coroutinize `estimate_partitions()`
  types/comparable_bytes: add a missing implementation for date_type_impl
  sstables: remove an outdated FIXME
  storage_service: delete `get_splits()`
  sstables/trie: fix some comment typos in bti_index_reader.cc
  sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
2025-09-18 12:10:27 +03:00
Ernest Zaslavsky
54aa552af7 treewide: Move type related files to a type directory As requested in #22110, moved the files and fixed other includes and build system.
Moved files:
- duration.hh
- duration.cc
- concrete_types.hh

Fixes: #22110

This is a cleanup, no need to backport

Closes scylladb/scylladb#25088
2025-09-17 17:32:19 +03:00
Michał Chojnowski
b7afda5030 sstables/mx/reader: remove mx::make_reader_with_index_reader
When `mx::make_reader` is used to construct an sstable reader,
it constructs its own index reader internally.

`mx::make_reader_with_index_reader` was originally added
as a variant of `mx::make_reader` which can be used to inject
a custom `index_reader` for testing that the mx Data reader
tolerates inexact indexes.

But now we want the ability to choose between BIG index readers
and BTI index readers if both are present. And at this point,
it seems to me that it makes sense to just construct the index
reader in the caller and pass it via argument to `mx::make_reader`
instead of putting the index selection inside it.

So that's what we do in this patch. And we remove `mx::make_reader_with_index_reader`
because it's no longer different from `mx::make_reader`.
2025-09-17 12:22:41 +02:00
Nadav Har'El
f6a3e6fbf0 sstables: don't depend on fmt 11.1 to build
A recent commit a0c29055e5 added
some trace printouts which print an std::reference_wrapper<>.
Apparently a formatter for this type was only added to fmt
in version 11.1.0, and it doesn't exist on earlier versions,
such as fmt 11.0.2 on Fedora 41.

Let's avoid requiring shiny-new versions of fmt. The workaround
is easy: just unwrap the reference_wrapper - print pr.get()
instead of just pr, and Scylla returns to building correctly on
Fedora 41.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25228
2025-07-29 11:32:06 +02:00
Michał Chojnowski
810eb93ff0 sstables/mx/reader: allow passing a custom index reader to the constructor
For tests.
Will be used for testing how the data reader reacts to various
combinations of inexact index lookup results.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
03bf6347e2 sstables/mx/reader: handle inexact lookups in advance_context()
`advance_context()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_context()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.

This makes `advance_to` unused, so we remove it.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
11792850dd sstables/mx/reader: handle inexact lookups in advance_to_next_partition()
`advance_to_next_partition()` needs an ability to advance the index to
the partition immediately following the reader's current partition.
For this, it uses `abstract_index_reader::advance_to(dht::ring_position_view)`

But BTI (and any index format which stores only the prefixes of keys
instead of whole keys) can't implement `advance_to` with its current
semantics. The Data position returned by the index for a generic
`advance_to` might be off by one partition.

E.g. if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first entry after `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.

However, BTI can be used exactly if the partition is known to
be present in the sstable. (In the above example, if `bb` is known
to be present in the sstable, then it must correspond to `b`.
So the index can reliably advance to `bb` or the first partition after it).

And this is enough for `advance_to_next_partition()`, because the
current partition is known to be present.
So we can replace the usage of `advance_to` with an equivalent API call
which only works with present keys, but in exchange is implementable
by BTI.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Michał Chojnowski
a0c29055e5 sstables/mx/reader: handle "backward jumps" in forward_to
A bunch of code assumes that the Data.db stream can only go forward.
But with BTI indexes, if we perform an advance_to, the index can point to a position
which the data reader has already passed, since the index is inexact.

The logic of the data reader ensures that it has stopped
within the last partition range, or just immediately
after it, after reading the next partition key and
noticing that it doesn't belong to the range.

But forward_to can only be used with increasing ranges.
The start of the next range must be greater or equal to the
end of the previous range.

This means that the exact start of the next partition range
must be no earlier than:
1. Before the partition key just read by the data reader,
if the data reader is positioned immediately after a partition key.
2. The start of the first partition after the current data reader
position, if the data reader isn't positioned immediately after a
partition key.

So, if the index returns a position smaller than the current data
reader position, then:
1. If the reader is immediately after a partition key,
we have to reuse this partition key (since we can't go back
in the stream to read it again), and keep reading from
the current position.
2. Otherwise we can safely walk the index to the first partition
that lies no earlier than the current position.
2025-07-25 10:49:58 +02:00
Michał Chojnowski
218b2dffff sstables/mx/reader: filter out partitions outside the queried range
The current index format is exact: it always returns the position of the
first partition in the queried partition range.

But we are about the add an index format where that doesn't have to be the case.
In BTI indexes, the lookup can be off by one partition sometimes. This patch prepares
the reader for that, by skipping the partitions which were read by the
data reader but don't belong to the queried range.

Note: as of this patch, only the "normal path" is ever used.
We add tests exercising these code paths later.

Also note that, as of this patch, actually stepping outside
the queried range would cause the reader to end up in a
state where the underlying parser is positioned right after
partition key immediately following the queried range.
If the reader was forwarded to that key in this state,
it would trip an assert, because the parser can't handle backward
jumps. We will add logic to handle this case in the next patch.
2025-07-25 10:49:57 +02:00
Michał Chojnowski
2b81fdf09b sstables/mx/reader: update _pr after fast_forward_to
In later patches, we will prepare the reader for inexact index
implementations (ones which can return a Data file range that
includes some partitions before or after the queried range).

For that, we will need to filter out the partitions outside of the
range, and for that we need to remember the range. This is the
goal of this patch.

Note that we are storing a reference to an argument of
`fast_forward_to`. This is okay, because the contract
of `mutation_reader` specifies that the caller must
keep `pr` alive until the next `fast_forward_to`
or until the reader is destroyed.
2025-07-25 10:49:57 +02:00
Botond Dénes
20693edb27 Merge 'sstables: put index_reader behind a virtual interface' from Michał Chojnowski
This is a refactoring patch in preparation for BTI indexes. It contains no functional changes (or at least it's not intended to).

In this patch, we modify the sstable readers to use index readers through a new virtual `abstract_index_readers` interface.
Later, we will add BTI indexes which will also implement this interface.

This interface contains the methods of `index_reader` which are needed by sstable readers, and leaves out all other methods, such as `current_clustered_cursor`.

Not all methods of this interface will be implementable by a trie-based index later. For example, a trie-based index can't provide a reliable `get_partition_key()`, because — unlike the current index — it only stores partition keys for partitions which have a row index. So the interface will have to be further restricted later. We don't do that in this patch because that will require changes to sstable reader logic, and this patch is supposed to only include cosmetic changes.

No backports needed, this is a preparation for new functionality.

Closes scylladb/scylladb#25000

* github.com:scylladb/scylladb:
  sstables: add sstable::make_index_reader() and use where appropriate
  sstables/mx: in readers, use abstract_index_reader instead of index_reader
  sstables: in validate(), use abstract_index_reader instead of index_reader where possible
  test/lib/index_reader_assertions: accept abstract_index_reader instead of index_reader
  sstables/index_reader: introduce abstract_index_reader
  sstables/index_reader: extract a prefetch_lower_bound() method
2025-07-17 14:32:08 +03:00
Michał Chojnowski
4e4a4b6622 sstables: add sstable::make_index_reader() and use where appropriate
If we add multiple index implementations, users of index readers won't
easily know which concrete index reader type is the right one to construct.

We also don't want pieces of code to depend on functionality specific to
certain concrete types, if that's not necessary.

So instead of constructing the readers by themselves, they can use a helper
function, which will return an abstract (virtual) index reader.
This patch adds such a function, as a method of `sstable`.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
1c4065e7dd sstables/mx: in readers, use abstract_index_reader instead of index_reader
This makes clear which methods of index_reader are available for use
by sstable readers, and which aren't.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
efcf3f5d66 sstables: in validate(), use abstract_index_reader instead of index_reader where possible
After we add a second index implementation, we will probably want to
adjust validate() to work with either implementation.

Some validations will be format-specific, but some will be common.
For now, let's use abstract_index_reader for the validations which
can be done through that interface, and let's have downcast-specific
codepaths for the others.

Note: we change a `get_data_file_position()` call to `data_file_positions().start`.
The call happens at the beginning of a partition, and at this points
these two expressions are supposed to be equivalent.
2025-07-17 10:32:57 +02:00
Michał Chojnowski
1e7a292ef4 sstables/index_reader: extract a prefetch_lower_bound() method
The sstable reader reaches directly for a `clustered_index_cursor`.
But a BTI index reader won't be able to implement
`clustered_index_cursor`, because a BTI index doesn't store
full clustering keys, only some trie-encoded prefixes.

So we want to weaken the dependency. Instead of reaching
for `clustered_index_cursor`, we add a method which expresses
our intent, and we let `index_reader` touch the cursor internally.
2025-07-16 00:13:20 +02:00
Ernest Zaslavsky
dff9a229a7 sstables: refactor readers and sources to use coroutines
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.
2025-07-15 10:10:23 +03:00
Botond Dénes
bce89c0f5e sstables: replace SCYLLA_ASSERT() with parse_assert() on the read path
So parse errors on corrupt SSTables don't result in crashes, instead
just aborting the read in process.
There are a lot of SCYLLA_ASSERT() usages remaining in sstables/. This
patch tried to focus on those usages which are in the read path. Some
places not only used on the read path may have been converted too, where
the usage of said method is not clear.
2025-06-24 09:16:28 +03:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kefu Chai
714d12014e sstable/mx: use subrange.advance() when appropriate
Replace manual subrange advancement with the more concise and readable
`subrange.advance()` method. This change:

- Eliminates unnecessary subrange instance creation
- Improves code readability
- Reduces potential for unnecessary object allocation
- Leverages the built-in `advance()` method for cleaner iterator handling

The modification simplifies the iteration logic while maintaining the
same functional behavior.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21865
2024-12-12 10:04:12 +02:00
Kefu Chai
ce2f80c227 treewide: migrate from boost::make_iterator_range to ranges::subrange
Replace boost::make_iterator_range() with std::ranges::subrange.

This change improves code modernization and reduces external dependencies:

- Replace boost::make_iterator_range() with std::ranges::subrange
- Remove boost/range/iterator_range.hpp include
- Improve iterator type detection in interval.hh using std::ranges::const_iterator_t<Range>

This is part of ongoing efforts to modernize our codebase and minimize
external dependencies.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21787
2024-12-09 21:31:53 +02:00
Kefu Chai
bab12e3a98 treewide: migrate from boost::adaptors::transformed to std::views::transform
now that we are allowed to use C++23. we now have the luxury of using
`std::views::transform`.

in this change, we:

- replace `boost::adaptors::transformed` with `std::views::transform`
- use `fmt::join()` when appropriate where `boost::algorithm::join()`
  is not applicable to a range view returned by `std::view::transform`.
- use `std::ranges::fold_left()` to accumulate the range returned by
  `std::view::transform`
- use `std::ranges::fold_left()` to get the maximum element in the
  range returned by `std::view::transform`
- use `std::ranges::min()` to get the minimal element in the range
  returned by `std::view::transform`
- use `std::ranges::equal()` to compare the range views returned
  by `std::view::transform`
- remove unused `#include <boost/range/adaptor/transformed.hpp>`
- use `std::ranges::subrange()` instead of `boost::make_iterator_range()`,
  to feed `std::views::transform()` a view range.

to reduce the dependency to boost for better maintainability, and
leverage standard library features for better long-term support.

this change is part of our ongoing effort to modernize our codebase
and reduce external dependencies where possible.

limitations:

there are still a couple places where we are still using
`boost::adaptors::transformed` due to the lack of a C++23 alternative
for `boost::join()` and `boost::adaptors::uniqued`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21700
2024-12-03 09:41:32 +02:00
Nadav Har'El
da99dc3a7f cross-tree: change to_sstring_view() to to_string_view()
For historic reasons, we have (in bytes.hh) a type sstring_view which
is an alias for std::string_view - since the same standard type can hold
a pointer into both a seastar::sstring and std::string.

This alias in unnecessary and misleading to new developers (who might
assume it is somehow different from std::string_view). This patch doesn't
yet remove all occurances of sstring_view (the request in #4062), but
begins to do it by renaming one commonly-used function, to_sstring_view(bytes)
to to_string_view() and of course changes all its uses to the new name.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2024-11-18 14:57:49 +02:00
Avi Kivity
3a6c0a9b36 Merge 'compaction: Perform integrity checks on compacting SSTables' from Nikos Dragazis
This PR enables compaction tasks to verify the integrity of the input data through checksum and digest checks. The mechanism for integrity checking was introduced in previous PRs (#20207, #20720) as a built-in functionality of the input streams. This PR integrates this mechanism with compaction. The change applies to all compaction types and covers both compressed and uncompressed SSTables adhering to the 3.x format. If a compaction task reads only part of an SSTable, then only the per-chunk checksums are verified, not the digest.

The PR consists of:
* Changes to mx readers to support integrity checking. The kl readers, considered as compatibility-only, were left unchanged. Also, integrity checking on single-partition reversed reads (`data_consume_reversed_partition()`) remains unsupported by mx readers as this is not used in compaction.
* Changes to `sstable` and `sstable_set` APIs to allow toggling integrity checks for mx readers.
* Activation of integrity checking for all compaction types.
* Tests for all compaction types with corrupted SSTables.

Integrity checks come at a cost. For uncompressed SSTables, the cost is the loading of the CRC and Digest components from disk, and the calculation of checksums and digest from the actual data. For compressed SSTables, checksums are stored in-place and they are being checked already on all reads, so the only extra cost is the loading and calculation of the digest. The measurements show a ~5% regression in compaction performance for uncompressed SSTables, and a negligible regression for compressed SSTables.

Command: `perf-sstable --smp=1 --cpuset=1 --poll-mode --mode=compaction --iterations=1000 --partitions 10000 --sstables=1 --key_size=4096 --num_columns=15 --column_size={32, 1024, 3500, 7000, 14500}`

Uncompressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec)  | Integrity (p/sec)    | Regression |
+--------------+-----------------------+----------------------+------------+
| 50  MiB      | 65175.59 +- 80.82     | 61814.63 +- 72.88    | 5.16%      |
| 200 MiB      | 41795.10 +- 60.39     | 39686.28 +- 45.05    | 5.05%      |
| 500 MiB      | 21087.41 +- 30.72     | 20092.93 +- 25.05    | 4.72%      |
| 1   GiB      | 12781.64 +- 21.77     | 12233.94 +- 21.71    | 4.29%      |
| 2   GiB      |  6629.99 +-  9.40     |  6377.13 +-  8.28    | 3.81%      |
+--------------+-----------------------+----------------------+------------+
```
Compressed SSTables:
```
+--------------+-----------------------+----------------------+------------+
| SSTable Size | No Integrity (p/sec)  | Integrity (p/sec)    | Regression |
+--------------+-----------------------+----------------------+------------+
| 50  MiB      | 53975.05 +- 63.18     | 53825.93 +- 62.28    |  0.28%     |
| 200 MiB      | 28687.94 +- 26.58     | 28689.41 +- 26.91    |  0%        |
| 500 MiB      | 13865.35 +- 15.50     | 13790.41 +- 14.88    |  0.54%     |
| 1   GiB      |  7858.10 +-  7.71     |  7829.75 +-  9.66    |  0.36%     |
| 2   GiB      |  4023.11 +-  2.43     |  4010.54 +-  2.55    |  0.31%     |
+--------------+-----------------------+----------------------+------------+
(p/sec = partitions/sec)
```

Refs #19071.

New feature, no backport is needed.

Closes scylladb/scylladb#21153

* github.com:scylladb/scylladb:
  test: Add test for compaction with corrupted SSTables
  compaction: Enable integrity checks for all compaction types
  sstables: Add integrity option to factories for sstable_set readers
  sstables: Add integrity option to sstable::make_reader()
  sstables: Add integrity option to mx::make_reader()
  sstables: Load checksums and digests in mx full-scan reader
  sstables: Add integrity option to data_consume_single_partition()
  sstables: Disengage integrity_check from sstable class
  sstables: Allow data sources to disable digest check
2024-11-17 20:59:31 +02:00
Botond Dénes
fed2c6ba83 sstables/mx/reader: release column value buffer after consumed
data_consume_rows_context_m has a _column_value buffer it uses to read
key and column values into, preparing for parsing and consuming them.
This buffer is reset (released) in a few different cases:
* When using it for key - after consuming its content
* When using it for column value - when a colum has no value

However, the buffer is not released when used for a column value and the
column is consumed. This means that if a large column is read from the
sstable, this buffer can potentially linger and keep consuming memory
until either one of the other release scenarios is hit, or the reader is
destroyed.
Add a third release scenario, releasing the buffer after the row end was
consumed. This allows the buffer to be re-used between columns of the
same row, at the same time ensuring that a large buffer will not linger.

This patch can almost halve the memory consumption of reads in certain
circumstances. Point in case: the test
test_reader_concurrency_semaphore_memory_limit_engages starts to fail
after this fix, because the read doesn't trigger the OOM limit anymore
and needs doubling of the concurrency to keep passing.

This issue was found in a dtest
(`test_ics_refresh_with_big_sstable_files`), which writes some large
cells of up to 7MiB. After reading the row containing this large cell,
the reader holds on to the 7MiB buffer causing the semaphore's OOM
protection to kick in down the line.

Fixes: https://github.com/scylladb/scylladb/issues/21160

Closes scylladb/scylladb#21132
2024-11-14 17:24:53 +01:00
Nikos Dragazis
64688fdad6 sstables: Add integrity option to mx::make_reader()
In previous patch we added support for integrity checking in the mx
full-scan reader.

Do the same for the mx reader, which is the one used by all compaction
types except for scrub compaction. The mx reader should now support
integrity checking for single-partition and multi-partition reads.
Single-partition reversed reads were excluded from this patch because
they are not used in compaction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:40:30 +02:00
Nikos Dragazis
1993aa5261 sstables: Load checksums and digests in mx full-scan reader
In 716fc487fd we introduced integrity checking in the mx crawling reader
(later renamed to full-scan reader in 6250ff18eb).

When integrity checking is enabled, the full-scan reader expects that
the checksum and digest components have been loaded from disk by the
caller. This is true for the validation path, in which
`sstable::validate()` loads the components before creating the full-scan
reader, but it doesn't hold if a full-scan reader is created directly by
a higher-level function through `sstable::make_full_scan_reader()`.

As part of the effort to enable integrity checking for compaction, this
becomes a blocker for scrub compaction, which relies solely on full-scan
readers.

Solve this by allowing the mx full-scan reader to load the checksum and
digest components internally. The loading is an asynchronous operation,
so it has to be deferred until the first buffer fill.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
609b16307e sstables: Add integrity option to data_consume_single_partition()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Nikos Dragazis
5b896cdbb7 sstables: Disengage integrity_check from sstable class
The `integrity_check` flag was first introduced as a parameter in
`sstable::data_stream()` to support creating input streams with
integrity checking. As such, it was defined in the sstable class.

However, we also use this flag in the kl/mx full-scan readers, and, in
a later patch, we will use it in `class sstable_set` as well.

Move the definition into `types_fwd.hh` since it is no longer bound to
the sstable class.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-11-11 20:26:27 +02:00
Pavel Emelyanov
f3f956841f sstables: Remove unused mp_row_consumer_m::range_tombstone_start
It's only used by its operator<< so remove it as well

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#21380
2024-11-03 16:40:02 +02:00
Tomasz Grabiec
95b864497a sstables: reader: Log data file range 2024-10-03 14:16:05 +02:00
Tomasz Grabiec
a29501ed67 sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to reduce selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, but it's not effective becasue it's a
non-populating lookup and the upper bound cursor has its own private
cached_promoted_index, which is cold when positions are computed. It's
non-populating on purpose, to avoid extra index file IO to read upper
bound. In case upper bound is far-enough from the lower bound, this
will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test reads two rows from the middle of a large partition (1M
rows), of subsequent keys. The first read will miss in the index file
page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%

After:

running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%

(cherry picked from commit dfb339376aff1ed961b26c4759b1604f7df35e54)
2024-10-01 18:40:34 +02:00
Kefu Chai
df7f332a58 sstable: s/crawling_sstable_mutation_reader/sstable_full_scan_reader
"crawling" is a little bit obscure in this context. so let's rename this
class to reflect the fact that this reader only reads the entire content
of the sstable.

both crawling reader for kl and mx formats are renamed. also, in order
to be consistent, all "crawling reader" in variable names are updated
as well.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-17 10:39:37 +08:00
Kefu Chai
c1ed2f0ea4 sstable/mx/reader: add comment for mx_crawling_sstable_mutation_reader
to explain its typical usage.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-17 10:39:25 +08:00
Nikos Dragazis
719757fba9 sstables: Enable checksum validation for uncompressed SSTables
Extend the `sstable::validate()` to validate the checksums of
uncompressed SSTables. Given that this is already supported for
compressed SSTables, this allows us to provide consistent behavior
across any type of SSTable, be it either compressed or uncompressed.

The most prominent use case for this is scrub/validate, which is now
able to detect file-level corruption in uncompressed SSTables as
well.

Note that this change will not affect normal user reads which skip
checksum validation altogether.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
716fc487fd sstables: Expose integrity option via crawling mutation readers
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Nikos Dragazis
1d2dc9f2e1 sstables: Expose integrity option via data_consume_rows()
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2024-09-11 12:28:59 +03:00
Łukasz Paszkowski
da95f44adc readers: Use reversed schema and native reversed slices
The reconcilable_result is built as it would be constructed for
forward read queries for tables with reversed order.

Mutations constructed for reversed queries are consumed forward.

Drop overloaded reversed functions that reverse read_command and
reconcilable_result directly and keep only those requiring smart
pointers. They are not used any more.
2024-08-13 10:03:46 +02:00
Łukasz Paszkowski
7b201e9165 kl::reader::make_reader: Unify interface with mx::reader::make_reader
Ensure both readers have the same interfaces to avoid mistakes as
both readers are used in sstable::make_reader. Less error prone.
2024-08-13 10:02:43 +02:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Avi Kivity
fdc1449392 treewide: rename flat_mutation_reader_v2 to mutation_reader
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:

  e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
  08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"

as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.

The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit

  026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"

In turn, flat_mutation_reader was introduced in 2017 in commit

  748205ca75 "Introduce flat_mutation_reader"

To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.

Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.

Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.

Some notes about the transition:

 - files were also renamed. In one case (flat_mutation_reader_test.cc), the
   rename target already existed, so we rename to
    mutation_reader_another_test.cc.

 - a namespace 'mutation_reader' with two definitions existed (in
   mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
   class. As a result, a few #includes had to be adjusted.

Closes scylladb/scylladb#19356
2024-06-21 07:12:06 +03:00
Kefu Chai
372a4d1b79 treewide: do not define FMT_DEPRECATED_OSTREAM
since we do not rely on FMT_DEPRECATED_OSTREAM to define the
fmt::formatter for us anymore, let's stop defining `FMT_DEPRECATED_OSTREAM`.

in this change,

* utils: drop the range formatters in to_string.hh and to_string.c, as
  we don't use them anymore. and the tests for them in
  test/boost/string_format_test.cc are removed accordingly.
* utils: use fmt to print chunk_vector and small_vector. as
  we are not able to print the elements using operator<< anymore
  after switching to {fmt} formatters.
* test/boost: specialize fmt::details::is_std_string_like<bytes>
  due to a bug in {fmt} v9, {fmt} fails to format a range whose
  element type is `basic_sstring<uint8_t>`, as it considers it
  as a string-like type, but `basic_sstring<uint8_t>`'s char type
  is signed char, not char. this issue does not exist in {fmt} v10,
  so, in this change, we add a workaround to explicitly specialize
  the type trait to assure that {fmt} format this type using its
  `fmt::formatter` specialization instead of trying to format it
  as a string. also, {fmt}'s generic ranges formatter calls the
  pair formatter's `set_brackets()` and `set_separator()` methods
  when printing the range, but operator<< based formatter does not
  provide these method, we have to include this change in the change
  switching to {fmt}, otherwise the change specializing
  `fmt::details::is_std_string_like<bytes>` won't compile.
* test/boost: in tests, we use `BOOST_REQUIRE_EQUAL()` and its friends
  for comparing values. but without the operator<< based formatters,
  Boost.Test would not be able to print them. after removing
  the homebrew formatters, we need to use the generic
  `boost_test_print_type()` helper to do this job. so we are
  including `test_utils.hh` in tests so that we can print
  the formattable types.
* treewide: add "#include "utils/to_string.hh" where
  `fmt::formatter<optional<>>` is used.
* configure.py: do not define FMT_DEPRECATED_OSTREAM
* cmake: do not define FMT_DEPRECATED_OSTREAM

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-04-19 22:57:36 +08:00
Botond Dénes
a19a2d76c9 sstablex/mx/reader: validate(): print trace message when finishing the PI block 2024-03-12 11:05:18 -04:00
Botond Dénes
677be168c4 sstablex/mx/reader: validate(): make index-data PI position check message consistent
The message says "index-data" but when printing the position, the data
position is printed first, causing confusion. Fix this and while at it,
also print the position of the partition start.
2024-03-12 11:05:18 -04:00
Botond Dénes
5bff7c40d3 sstablex/mx/reader: validate(): only load the next PI block if current is exhausted
The validate() consumes the content of partitions in a consume-loop.
Every time the consumer asks for a "break", the next PI block is loaded
and set on the validator, so it can validate that further clustering
elements are indeed from this block.
This loop assumed the consumer would only request interruption when the
current clustering block is finished. This is wrong, the consumer can
also request interruption when yielding is needed. When this is the
case, the next PI block doesn't have to be loaded yet, the current one
is not exhausted yet. Check this condition, before loading the next PI
block, to prevent false positive errors, due to mismatched PI block
and clustering elements from the sstable.
2024-03-12 11:05:18 -04:00
Botond Dénes
e073df1dbb sstablex/mx/reader: validate(): reset the current PI block on partition-start
It is possible that the next partition has no PI and thus there won't be
a new PI block to overwrite the old one. This will result in
false-positive messages about rows being outside of the finished PI
block.
2024-03-12 11:05:18 -04:00
Botond Dénes
2737899c21 sstablex/mx/reader: validate(): consume_range_tombstone(): check for finished clustering blocked
Promoted index entries can be written on any clustering elements,
icluding range tombstones. So the validating consumer also has the check
whether the current expected clustering block is finished, when
consuming a range tombstone. If it is, consumption has to be
interrupted, so that the outer-loop can load up the next promoted index
block, before moving on to the next clustering element.
2024-03-12 11:05:18 -04:00
Botond Dénes
f46b458f0d sstablex/mx/reader: validate(): fix validator for range tombstone end bounds
For range tombstone end-bounds, the validate_fragment_order() should be
passed a null tombstone, not a disengaged optional. The latter means no
change in the current tombstone. This caused the end bound of range
tombstones to not make it to the validator and the latter complained
later on partition-end that the partition has unclosed range tombstone.
2024-03-12 11:05:18 -04:00
Michał Chojnowski
f9e97fa632 sstables: fix a use-after-free in key_view::explode()
key_view::explode() contains a blatant use-after-free:
unless the input is already linearized, it returns a view to a local temporary buffer.

This is rare, because partition keys are usually not large enough to be fragmented.
But for a sufficiently large key, this bug causes a corrupted partition_key down
the line.

Fixes #17625

Closes scylladb/scylladb#17626
2024-03-07 09:07:07 +02:00