Commit Graph

35 Commits

Author SHA1 Message Date
Michał Chojnowski
d8e299dbb2 sstables/trie/trie_writer: free nodes after they are flushed
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.

This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.

Fix that, and add some test that catches this.

Fixes scylladb/scylladb#27082

Closes scylladb/scylladb#27083
2025-11-19 14:54:16 +02:00
Michał Chojnowski
b82c2aec96 sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key
_last_key is a multi-fragment buffer.

Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.

The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.

But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).

Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.

Fix that.

We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.

Fixes scylladb/scylladb#26819

Closes scylladb/scylladb#26839
2025-11-07 11:25:07 +02:00
Michał Chojnowski
3cf51cb9e8 sstables: fix some typos in comments
I added those typos recently, and spellcheckers complain.

Closes scylladb/scylladb#26376
2025-10-07 13:20:06 +03:00
Avi Kivity
15fa1c1c7e Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.

In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.

perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch        9233   109.930us     0.000ns   109.930us   109.930us    4356.000       0.000   2615394.3    614709.6

After:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch       50952    19.487us     0.000ns    19.487us    19.487us     198.000       0.000    603120.1    109042.9
```

Enhancement, backport not required.

Closes scylladb/scylladb#26302

* github.com:scylladb/scylladb:
  sstables/trie: BTI-translate the entire partition key at once
  sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
  sstables/trie: perform the BTI-encoding of position_in_partition eagerly
  types/comparable_bytes: add comparable_bytes_from_compound
  test/perf: add perf_bti_key_translation
2025-10-01 14:59:06 +03:00
Michał Chojnowski
621bfbe6d9 sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
test_column_family.py::test_sstables_by_key_reader_closed
injects a failure into `index_reader::advance_lower_and_check_if_present`.
To preserve this tests when BTI indexes are made the default,
we have to add a corresponding error injection to
`bti_index_reader::advance_lower_and_check_if_present`.
2025-09-29 22:15:25 +02:00
Michał Chojnowski
420e215873 sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.

That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.

So in this patch we let the caller pass that hash instead.

The old index interface, without the hash, is kept for convenience.

In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
2025-09-29 13:01:21 +02:00
Michał Chojnowski
cee4011e7a sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.

That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
2025-09-29 13:01:21 +02:00
Michał Chojnowski
ff5add4287 sstables/trie: BTI-translate the entire partition key at once
Delaying the BTI encoding of partition keys is a good idea,
because most of the time they don't have to be encoded.
Usually the token alone is enough for indexing purposes.

But for the translation of the `partition_key` part itself,
there's no good reason to make it lazy,
especially after we made the translation of clustering keys
eager in a previous commit. Let's get rid of the `std::generator`
and convert all cells of the partition key in one go.
2025-09-29 04:10:40 +02:00
Michał Chojnowski
4e35220734 sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
Using `std::generator` could incurs some unnecessary allocation
or confuse the optimizer. Let's replace it with something simpler.
2025-09-29 04:10:40 +02:00
Michał Chojnowski
88c9af3c80 sstables/trie: perform the BTI-encoding of position_in_partition eagerly
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.

In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
so that it performs the translation eagerly,
all components to a single bytes_ostream.

Note: the name *lazy*_comparable_bytes_from_clustering_position
stays, because the interface is still lazy.

perf_bti_key_translation:

Before:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch        9233   109.930us     0.000ns   109.930us   109.930us    4356.000       0.000   2615394.3    614709.6

After:
test                            iterations      median         mad         min         max      allocs       tasks        inst      cycles
lcb_mismatch_test.lcb_mismatch       50952    19.487us     0.000ns    19.487us    19.487us     198.000       0.000    603120.1    109042.9
2025-09-29 04:10:40 +02:00
Michał Chojnowski
191405fc51 sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file
Before this patch, `bti_index_reader::last_block_offset()` returns the
offset of the last block within the file.
But the old `index_reader::last_block_offset()` returns the offset
within the partition, and that's what the callers (i.e. reversed
sstable reader) expect.

Fix `bti_index_reader::last_block_offset()` (and the corresponding
comment and test) to match `index_reader::last_block_offset()`.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
1f85069389 sstables/trie: support reader_permit and trace_state properly
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.

Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.

(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
2025-09-17 12:22:40 +02:00
Michał Chojnowski
c9b0dbc580 sstables/trie/bti_node_reader: avoid calling into cached_file if the target position is already cached
Small optimization. In some places we call `load` on a position
that is in the currently-held page. In those cases we are needlessly
calling `cached_file::get_shared_page` for the same page again,
adding some work and some noise in CQL tracing. This patch adds
an `if` against that.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
caecc02d75 sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header
Before this patch, the "stack" of wrappers in `read_row_index_header`
is:
- row_index_header_parser
- continuous_data_consumer
- input_stream
- file_data_source_impl
- cached_file_impl
- cached_file::stream
- cached_file

The `cached_file_impl` and `file_data_source_impl` are superfluous.
We don't need to pretent the `cached_file` is a `seastar::file`,
in particular we don't need random reads.

We can use `cached_file::stream` to provide buffers for `input_stream`
directly.

Note: we use the `cached_file::stream` without any size hints (readahead).
This means that parsing a large partition key -- which spans many pages -- might
require multiple serialized disk fetches. This could be improved
(e.g. if the first two bytes of the entry, which contain the partition key,
are in the cached page, we could make a size hint out of them) but
we ignore this for now, under the assumption that multi-page partition
keys are a fringe use case.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
98b7655d2b sstables/trie/bti_index_reader: support BYPASS CACHE
Before this patch, `bti_index_reader` doesn't have a good
way to implement BYPASS CACHE.

In this patch we add a way, similar to what `index_reader` does:
we allow the caller to pass in the `cached_file` via a shared pointer.

If the caller wants the loads done by the index reader to remain cached,
he can pass in the `cached_file` owned by the `sstable`, shared by all
caching index readers.

If the caller doesn't want the loads to remain cached, he can pass
in a fresh `cached_file` which will be privately owned by the index
reader, and will be evicted when the index reader dies.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
5934639c4b sstables/trie: change the signature of bti_partition_index_writer::finish
Let's return `bti_partitions_db_footer` so that it can be directly
saved to `sstables::shareable_components` after the index write is
finished, without re-reading the footer from the file.

Let's take `const sstables::key&` arguments instead of
`disk_string_view<uint16_t>`, that's more natural.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
0b0f6a1cc4 sstables/bti_index: improve signatures of special member functions in index writers
Just a bit of constructor boilerplate: noexcept, move constructors,
`bool` operators for `optimized_optional`.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
e8f86315c7 sstables/trie: fix some comment typos in bti_index_reader.cc
Spell checkers are complaining.
2025-09-17 12:22:40 +02:00
Radosław Cybulski
436150eb52 treewide: fix spelling errors
Fix spelling errors reported by copilot on github.
Remove single use namespace alias.

Closes scylladb/scylladb#25960
2025-09-12 15:58:19 +03:00
Michał Chojnowski
11a5eec272 sstables/trie: introduce bti_index_reader
Implements an index reader (implementing abstract_index_reader,
which is the interface between Data.db readers and index readers)
for the BTI index written by bti_partition_index_writer and bti_row_index_writer.

The underlying trie walking logic is a thin wrapper around the logic
added in `trie_traversal.hh` in an earlier patch series.
2025-09-07 00:32:02 +02:00
Michał Chojnowski
0cbf6d8d24 sstables/trie: add bti_partition_index_writer.cc
Implements the Partition.db writer, which maps the (BTI-encoded)
decorated keys to the position of the partition header
in either Rows.db (iff the partition posesses an intra-partition
index at all) or Data.db>

The format of Partition.db is supposed to be as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (partition-index)

I didn't actually test it for compatibility with Cassandra (yet?).
(I followed the docs linked above and Cassandra's source code,
but it could be that I have made a mistake somewhere).
2025-09-07 00:32:02 +02:00
Michał Chojnowski
7e8fd13208 sstables/trie: add bti_row_index_writer.cc
Implements the Rows.db writer, which for each partition that
possesses an intra-partition index writes a trie of separators
between clustering key blocks, and a header (footer?)
with enough metadata (partition key, partition tombstone)
to allow for a direct jump to a clustering row in Data.db.

The format of Rows.db is supposed to be as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (row-index)
and the arbitrary details about the choice of clustering block separators
follows Cassandra's choices.

I didn't actually test it for compatibility with Cassandra (yet?).
(I followed the docs linked above and Cassandra's source code,
but it could be that I have made a mistake somewhere).
2025-09-07 00:32:02 +02:00
Michał Chojnowski
9a4b739b8d sstables/trie: add trie_writer::add_partial()
`trie_writer` has an `add()` method which can
add a new branch (key) with a payload at the end.
In a later patch, we will want a way to split this addition
into several steps, just a way to more naturally deal with fragmented
keys.

So this commit adds a method which adds new nodes but doesn't
attach a payload at the end.
This allows for a situation where leaf nodes are added without a
payload, which is not supposed to happen.
It's the user's responsibility to avoid that.

Note: this might be overengineered. Maybe we should force the
user to linearize the key. Maybe caring so much about fragmented
keys as first-class citizens is the wrong thing to do.
2025-09-07 00:30:15 +02:00
Michał Chojnowski
c6ad5511db sstables/trie: make bti_node_reader::page_ptr copy-constructible
In later commits, a trie cursor will be holding a `page_ptr`.
Sometimes we will want to copy a cursor, in particular
to do reset the upper bound of the index reader with
`_lower = _upper`. But currently `page_ptr` is non-copyable
-- it's a shared pointer, but with an explicit `share()` method
-- so a default operator= won't work for this.

Let's add a copy-assignable for `page_ptr` for this purpose.
2025-09-07 00:30:15 +02:00
Michał Chojnowski
475cb18e90 sstables/trie: add an accessor to the file_writer under bti_node_sink
The row index writer will need both a trie_writer and
direct access to its underlying file_writer
(to write partition headers, which are "outside" of the trie).
And it would be weird to keep a separate reference to the file_writer
if the trie_writer already does that.

So let's add accessors needed to get to the file_writer&.
2025-09-07 00:30:15 +02:00
Michał Chojnowski
0684dbb5bd sstables/trie: fix a special case in max_offset_from_child
A `writer_node` contains a chain of multiple BTI nodes.
`writer_node::_node_size` is the size (in bytes) of the
entire chain.

But the parent of the `writer_node` wants to know the offset
to the rootmost node in the chain. This can be deduced from the
`writer_node::_transition_length` and the `writer_node::_node_size`.

But there's an error in the logic for that, for the special case
when there are two nodes in the chain.
The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if
*the leafmost node* is smaller than 16 bytes, which is
true if and only if `_node_size` is smaller than 19 bytes.

But the current logic compares `_node_size` against 16.
That's incorrect. This patch fixes that.

There was a test for this branch, but it was not good enough.
It only tested payloads of size 1 and 20, but the problem
is only revealed by payloads of size 13-14.
The test has been extended to cover all sizes between 1 and 20.
2025-09-07 00:30:15 +02:00
Michał Chojnowski
b2d793552f sstables/trie: handle partition_regions other than clustered in BTI position encoding
As far as I know, positions outside the clustering region are never
passed to sstable indexes. But since they are representable within
the argument type of lazy_comparable_bytes_from_clustering_position,
let's make it handle them.
2025-09-07 00:30:08 +02:00
Michał Chojnowski
b3c20e8cc3 sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
When cleaning up `lcb_mismatch` for review, I managed to forget
that in the follow-up series I want to use it with iterators
for which `*it` points to data which is invalidated by `++it`.

(The data in `lazy_comparable_bytes_*` generators
is kept a vector of `bytes`, so `*it` can point to
the internal storage of `bytes`. Generating a new fragment
via `++it` can resize the vector, move the `bytes`,
and invalidate the `*it`.)

So during the pre-review cleanup, `lcb_mismatch` ended up
in a shape which isn't prepared for that.

This commits shuffles the control flow around so that
`++it` is delayed after the span obtained with `*it` is exhausted.
2025-09-04 15:02:29 +02:00
Michał Chojnowski
413dcf8891 sstables/trie: add BTI key translation routines
This file provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.

This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.

For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)

The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.
2025-08-15 11:13:00 +02:00
Michał Chojnowski
0ffe336887 sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file
In a later commit, this concept will be used in a place that's not
dependent on trie traversal routines. So extract it to its own header.
2025-08-14 02:06:34 +02:00
Michał Chojnowski
347e5c534a sstables/trie: allow comparable_bytes_iterator to return a mutable span
`comparable_bytes_iterator` is a concept for iterating over the
fragments of a key translated to BTI encoding.
In `trie_traversal.hh`, those fragments are
`std::span<const std::byte>`, because the traversal routines
have no use for modifying the fragments.

But in a later commit we will also have to deal with encoded
keys during row index writes, and the row index writer will want
to modify the bytes, to nudge the mismatch byte by one in order
to obtain a key separator.

Let's extend this concept to allow both span<const byte>
and span<byte>, so that it can be used in both situations.
2025-08-14 01:54:57 +02:00
Michał Chojnowski
3017dbb204 sstables/trie: add trie traversal routines
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.

This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.

Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.

Closes scylladb/scylladb#25396
2025-08-11 19:15:09 +03:00
Michał Chojnowski
85964094f6 sstables/trie: implement BTI node traversal
This commit implements routines for traversal of BTI nodes in their
on-disk format.
The `node_reader` concept is currently unused (i.e. not asserted by any
template).

It will only be used in the next PR, which will implement trie cursor
routines parametrized `node_reader`.
But I'm including it in this PR to make it clear which functions
will be needed by the higher layer.
2025-08-05 00:56:50 +02:00
Michał Chojnowski
302adfb50d sstables/trie: implement BTI serialization
This commit introduces code responsibe for serializing
trie nodes (`writer_node`) into the on-disk BTI format,
as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md
2025-08-05 00:56:50 +02:00
Michał Chojnowski
c8682af418 sstables: introduce trie_writer
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).

As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.

This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:

1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).

It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.

This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:

1. The writer operates on chains of characters, rather than single characters.

   In Cassandra's implementation, the writer creates one node per character.
   A single long key can be translated to thousands of nodes.
   We create only one node per key. (Actually we split very long keys into
   a few nodes, but that's arbitrary and beside the point).

   For BTI's partition key index this doesn't matter.
   Since it only stores a minimal unique prefix of each key,
   and the trie is very balanced (due to token randomness),
   the average number of new characters added per key is very close to 1 anyway.
   (And the string-based logic might actually be a small pessimization, since
   manipulating a 1-byte string might be costlier than manipulating a single byte).

   But the row index might store arbitrarily long entries, and in that case the
   character-based logic might result in catastrophically bad performance.
   For reference: when writing a partition index, the total processing cost
   of a single node in the trie_writer is on the order of 800 instructions.
   Total processing cost of a single tiny partition during a `upgradesstables`
   operation is on the order of 10000 instructions. A small INSERT is on the
   order of 40000 instructions.

   So processing a single 1000-character clustering key in the trie_writer
   could cost as much as 20 INSERTs, which is scary. Even 100-character keys
   can be very expensive. With extremely long keys like that, the string-based
   logic is more than ~100x cheaper than character-based logic.
   (Note that only *new* characters matter here. If two index entries share a
   prefix, that prefix is only processed once. And the index is only populated
   with the minimal prefix needed to distinguish neighbours. So in practice,
   long chains might not happen often. But still, they are possible).

   I don't know if it makes sense to care about this case, but I figured the
   potential for problems is too big to ignore, so I switched to chain-based logic.

2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
   than a full page after revising the estimate, Cassandra splits it in a
   different way than us.

For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.

The serialization logic is passed to trie_writer via a template parameter.

There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
2025-07-31 12:51:37 +02:00