Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.
This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.
Fix that, and add some test that catches this.
Fixesscylladb/scylladb#27082Closesscylladb/scylladb#27083
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.
perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
```
Enhancement, backport not required.
Closesscylladb/scylladb#26302
* github.com:scylladb/scylladb:
sstables/trie: BTI-translate the entire partition key at once
sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
sstables/trie: perform the BTI-encoding of position_in_partition eagerly
types/comparable_bytes: add comparable_bytes_from_compound
test/perf: add perf_bti_key_translation
test_column_family.py::test_sstables_by_key_reader_closed
injects a failure into `index_reader::advance_lower_and_check_if_present`.
To preserve this tests when BTI indexes are made the default,
we have to add a corresponding error injection to
`bti_index_reader::advance_lower_and_check_if_present`.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.
So in this patch we let the caller pass that hash instead.
The old index interface, without the hash, is kept for convenience.
In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
Delaying the BTI encoding of partition keys is a good idea,
because most of the time they don't have to be encoded.
Usually the token alone is enough for indexing purposes.
But for the translation of the `partition_key` part itself,
there's no good reason to make it lazy,
especially after we made the translation of clustering keys
eager in a previous commit. Let's get rid of the `std::generator`
and convert all cells of the partition key in one go.
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
so that it performs the translation eagerly,
all components to a single bytes_ostream.
Note: the name *lazy*_comparable_bytes_from_clustering_position
stays, because the interface is still lazy.
perf_bti_key_translation:
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
Before this patch, `bti_index_reader::last_block_offset()` returns the
offset of the last block within the file.
But the old `index_reader::last_block_offset()` returns the offset
within the partition, and that's what the callers (i.e. reversed
sstable reader) expect.
Fix `bti_index_reader::last_block_offset()` (and the corresponding
comment and test) to match `index_reader::last_block_offset()`.
Before this patch, `reader_permit` taken by `bti_index_reader`.
wasn't actually being passed down to disk reads. In this patch,
we fix this FIXME by propagating the permit down to the I/O
operations on the `cached_file`.
Also, it didn't take `trace_state_ptr` at all.
In this patch, we add a `trace_state_ptr` argument and propagate
it down to disk reads.
(We combine the two changes because the permit and the trace state
are passed together everywhere anyway).
Small optimization. In some places we call `load` on a position
that is in the currently-held page. In those cases we are needlessly
calling `cached_file::get_shared_page` for the same page again,
adding some work and some noise in CQL tracing. This patch adds
an `if` against that.
Before this patch, the "stack" of wrappers in `read_row_index_header`
is:
- row_index_header_parser
- continuous_data_consumer
- input_stream
- file_data_source_impl
- cached_file_impl
- cached_file::stream
- cached_file
The `cached_file_impl` and `file_data_source_impl` are superfluous.
We don't need to pretent the `cached_file` is a `seastar::file`,
in particular we don't need random reads.
We can use `cached_file::stream` to provide buffers for `input_stream`
directly.
Note: we use the `cached_file::stream` without any size hints (readahead).
This means that parsing a large partition key -- which spans many pages -- might
require multiple serialized disk fetches. This could be improved
(e.g. if the first two bytes of the entry, which contain the partition key,
are in the cached page, we could make a size hint out of them) but
we ignore this for now, under the assumption that multi-page partition
keys are a fringe use case.
Before this patch, `bti_index_reader` doesn't have a good
way to implement BYPASS CACHE.
In this patch we add a way, similar to what `index_reader` does:
we allow the caller to pass in the `cached_file` via a shared pointer.
If the caller wants the loads done by the index reader to remain cached,
he can pass in the `cached_file` owned by the `sstable`, shared by all
caching index readers.
If the caller doesn't want the loads to remain cached, he can pass
in a fresh `cached_file` which will be privately owned by the index
reader, and will be evicted when the index reader dies.
Let's return `bti_partitions_db_footer` so that it can be directly
saved to `sstables::shareable_components` after the index write is
finished, without re-reading the footer from the file.
Let's take `const sstables::key&` arguments instead of
`disk_string_view<uint16_t>`, that's more natural.
Implements an index reader (implementing abstract_index_reader,
which is the interface between Data.db readers and index readers)
for the BTI index written by bti_partition_index_writer and bti_row_index_writer.
The underlying trie walking logic is a thin wrapper around the logic
added in `trie_traversal.hh` in an earlier patch series.
Implements the Partition.db writer, which maps the (BTI-encoded)
decorated keys to the position of the partition header
in either Rows.db (iff the partition posesses an intra-partition
index at all) or Data.db>
The format of Partition.db is supposed to be as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (partition-index)
I didn't actually test it for compatibility with Cassandra (yet?).
(I followed the docs linked above and Cassandra's source code,
but it could be that I have made a mistake somewhere).
Implements the Rows.db writer, which for each partition that
possesses an intra-partition index writes a trie of separators
between clustering key blocks, and a header (footer?)
with enough metadata (partition key, partition tombstone)
to allow for a direct jump to a clustering row in Data.db.
The format of Rows.db is supposed to be as described in:
f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (row-index)
and the arbitrary details about the choice of clustering block separators
follows Cassandra's choices.
I didn't actually test it for compatibility with Cassandra (yet?).
(I followed the docs linked above and Cassandra's source code,
but it could be that I have made a mistake somewhere).
`trie_writer` has an `add()` method which can
add a new branch (key) with a payload at the end.
In a later patch, we will want a way to split this addition
into several steps, just a way to more naturally deal with fragmented
keys.
So this commit adds a method which adds new nodes but doesn't
attach a payload at the end.
This allows for a situation where leaf nodes are added without a
payload, which is not supposed to happen.
It's the user's responsibility to avoid that.
Note: this might be overengineered. Maybe we should force the
user to linearize the key. Maybe caring so much about fragmented
keys as first-class citizens is the wrong thing to do.
In later commits, a trie cursor will be holding a `page_ptr`.
Sometimes we will want to copy a cursor, in particular
to do reset the upper bound of the index reader with
`_lower = _upper`. But currently `page_ptr` is non-copyable
-- it's a shared pointer, but with an explicit `share()` method
-- so a default operator= won't work for this.
Let's add a copy-assignable for `page_ptr` for this purpose.
The row index writer will need both a trie_writer and
direct access to its underlying file_writer
(to write partition headers, which are "outside" of the trie).
And it would be weird to keep a separate reference to the file_writer
if the trie_writer already does that.
So let's add accessors needed to get to the file_writer&.
A `writer_node` contains a chain of multiple BTI nodes.
`writer_node::_node_size` is the size (in bytes) of the
entire chain.
But the parent of the `writer_node` wants to know the offset
to the rootmost node in the chain. This can be deduced from the
`writer_node::_transition_length` and the `writer_node::_node_size`.
But there's an error in the logic for that, for the special case
when there are two nodes in the chain.
The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if
*the leafmost node* is smaller than 16 bytes, which is
true if and only if `_node_size` is smaller than 19 bytes.
But the current logic compares `_node_size` against 16.
That's incorrect. This patch fixes that.
There was a test for this branch, but it was not good enough.
It only tested payloads of size 1 and 20, but the problem
is only revealed by payloads of size 13-14.
The test has been extended to cover all sizes between 1 and 20.
As far as I know, positions outside the clustering region are never
passed to sstable indexes. But since they are representable within
the argument type of lazy_comparable_bytes_from_clustering_position,
let's make it handle them.
When cleaning up `lcb_mismatch` for review, I managed to forget
that in the follow-up series I want to use it with iterators
for which `*it` points to data which is invalidated by `++it`.
(The data in `lazy_comparable_bytes_*` generators
is kept a vector of `bytes`, so `*it` can point to
the internal storage of `bytes`. Generating a new fragment
via `++it` can resize the vector, move the `bytes`,
and invalidate the `*it`.)
So during the pre-review cleanup, `lcb_mismatch` ended up
in a shape which isn't prepared for that.
This commits shuffles the control flow around so that
`++it` is delayed after the span obtained with `*it` is exhausted.
This file provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.
This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.
For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)
The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.
`comparable_bytes_iterator` is a concept for iterating over the
fragments of a key translated to BTI encoding.
In `trie_traversal.hh`, those fragments are
`std::span<const std::byte>`, because the traversal routines
have no use for modifying the fragments.
But in a later commit we will also have to deal with encoded
keys during row index writes, and the row index writer will want
to modify the bytes, to nudge the mismatch byte by one in order
to obtain a key separator.
Let's extend this concept to allow both span<const byte>
and span<byte>, so that it can be used in both situations.
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.
This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.
Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.
Closesscylladb/scylladb#25396
This commit implements routines for traversal of BTI nodes in their
on-disk format.
The `node_reader` concept is currently unused (i.e. not asserted by any
template).
It will only be used in the next PR, which will implement trie cursor
routines parametrized `node_reader`.
But I'm including it in this PR to make it clear which functions
will be needed by the higher layer.
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).