This is the first part of a larger project meant to implement a trie-based index format. (The same or almost the same as Cassandra's BTI). As of this patch, the new code isn't used for anything yet, but we introduced separately from its users to keep PRs small enough for reviewability. This commit introduces trie_writer, a class responsible for turning a stream of (key, value) pairs (already sorted by key) into a stream of serializable nodes, such that: 1. Each node lies entirely within one page (guaranteed). 2. Parents are located in the same page as their children (best-effort). 3. Padding (unused space) is minimized (best-effort). It does mostly what you would expect a "sorted keys -> trie" builder to do. The hard part is calculating the sizes of nodes (which, in a well-packed on-disk format, depend on the exact offsets of the node from its children) and grouping them into pages. This implementation mostly follows Cassandra's design of the same thing. There are some differences, though. Notable ones: 1. The writer operates on chains of characters, rather than single characters. In Cassandra's implementation, the writer creates one node per character. A single long key can be translated to thousands of nodes. We create only one node per key. (Actually we split very long keys into a few nodes, but that's arbitrary and beside the point). For BTI's partition key index this doesn't matter. Since it only stores a minimal unique prefix of each key, and the trie is very balanced (due to token randomness), the average number of new characters added per key is very close to 1 anyway. (And the string-based logic might actually be a small pessimization, since manipulating a 1-byte string might be costlier than manipulating a single byte). But the row index might store arbitrarily long entries, and in that case the character-based logic might result in catastrophically bad performance. For reference: when writing a partition index, the total processing cost of a single node in the trie_writer is on the order of 800 instructions. Total processing cost of a single tiny partition during a `upgradesstables` operation is on the order of 10000 instructions. A small INSERT is on the order of 40000 instructions. So processing a single 1000-character clustering key in the trie_writer could cost as much as 20 INSERTs, which is scary. Even 100-character keys can be very expensive. With extremely long keys like that, the string-based logic is more than ~100x cheaper than character-based logic. (Note that only *new* characters matter here. If two index entries share a prefix, that prefix is only processed once. And the index is only populated with the minimal prefix needed to distinguish neighbours. So in practice, long chains might not happen often. But still, they are possible). I don't know if it makes sense to care about this case, but I figured the potential for problems is too big to ignore, so I switched to chain-based logic. 2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger than a full page after revising the estimate, Cassandra splits it in a different way than us. For testability, there is some separation between the logic responsible for turning a stream of keys into a stream of nodes, and the logic responsible for turning a stream of nodes into a stream of bytes. This commit only includes the first part. It doesn't implement the target on-disk format yet. The serialization logic is passed to trie_writer via a template parameter. There is only one test added in this commit, which attempts to be exhaustive, by testing all possible datasets up to some size. The run time of the test grows exponentially with the parameter size. I picked a set of parameters which runs fast enough while still being expressive enough to cover all the logic. (I checked the code coverage). But I also tested it with greater parameters on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
139 lines
6.2 KiB
C++
139 lines
6.2 KiB
C++
/*
|
|
* Copyright (C) 2024-present ScyllaDB
|
|
*/
|
|
|
|
/*
|
|
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
|
|
*/
|
|
|
|
#pragma once
|
|
|
|
#include "writer_node.hh"
|
|
#include "utils/small_vector.hh"
|
|
|
|
namespace sstables::trie {
|
|
|
|
template <trie_writer_sink Output>
|
|
sink_offset writer_node::recalc_sizes(ptr<writer_node> self, const Output& out, sink_pos global_pos) {
|
|
// Note: this routine is very similar to `write` in its overall structure.
|
|
|
|
// A trie can be arbitrarily deep, so we can't afford to use recursion.
|
|
// We have to maintain a walk stack manually.
|
|
struct local_state {
|
|
ptr<writer_node> _node;
|
|
// The index of the next child to be visited.
|
|
// When equal to the number of children, it's time to walk out of the node.
|
|
int _stage;
|
|
// The start position of the node's entire subtree.
|
|
sink_pos _startpos;
|
|
};
|
|
utils::small_vector<local_state, 8> stack;
|
|
// Depth-first walk.
|
|
// To calculate the sizes, we essentially simulate a write of the tree.
|
|
stack.push_back({self, 0, global_pos});
|
|
while (!stack.empty()) {
|
|
// Caution: note that `node`, `pos`, `stage` are references to contents of `stack`.
|
|
// Any pushing or popping will invalidate them, so be sure not to use them afterwards!
|
|
auto& [node, stage, startpos] = stack.back();
|
|
// If the node has out of page grandchildren, the sizes of children might have changed from the estimate,
|
|
// so we have to recurse into the children and update them.
|
|
if (stage < static_cast<int>(node->get_children().size()) && node->_has_out_of_page_descendants) {
|
|
stage += 1;
|
|
stack.push_back({node->get_children()[stage - 1], 0, global_pos});
|
|
continue;
|
|
}
|
|
// If we got here, then either we have recursed into children
|
|
// (and then global_pos was updated accordingly),
|
|
// or we skipped that because there was no need. In the latter case,
|
|
// we have to update global_pos manually here.
|
|
if (!node->_has_out_of_page_descendants) {
|
|
global_pos = global_pos + node->_branch_size;
|
|
}
|
|
node->_branch_size = global_pos - startpos;
|
|
if (node->_has_out_of_page_children || node->_has_out_of_page_descendants) {
|
|
// The size of children might have changed, which might have in turn changed
|
|
// our offsets to children, which might have changed our size.
|
|
// We have to recalculate.
|
|
node->_node_size = out.serialized_size(*node, sink_pos(global_pos));
|
|
expensive_assert(uint64_t(node->_node_size.value) <= out.page_size());
|
|
}
|
|
global_pos = global_pos + node->_node_size;
|
|
stack.pop_back();
|
|
}
|
|
return self->_branch_size + self->_node_size;
|
|
}
|
|
|
|
template <trie_writer_sink Output>
|
|
void writer_node::write(ptr<writer_node> self, Output& out, bool guaranteed_fit) {
|
|
expensive_log(
|
|
"writer_node::write: subtree={} pos={}, ps={} sz={} ",
|
|
fmt::ptr(self.get()), out.pos().value, out.page_size(), (self->_branch_size + self->_node_size).value);
|
|
expensive_assert(!self->_pos.valid());
|
|
expensive_assert(self->_node_size.valid());
|
|
if (guaranteed_fit) {
|
|
auto page_of_first_byte = round_down(out.pos().value, out.page_size());
|
|
auto page_of_last_byte = round_down((out.pos() + self->_branch_size + self->_node_size).value - 1, out.page_size());
|
|
expensive_assert(page_of_first_byte == page_of_last_byte);
|
|
}
|
|
auto starting_pos = out.pos();
|
|
// A trie can be arbitrarily deep, so we can't afford to use recursion.
|
|
// We have to maintain a walk stack manually.
|
|
struct local_state {
|
|
ptr<writer_node> _node;
|
|
// The index of the next child to be visited.
|
|
// When equal to the number of children, it's time to walk out of the node.
|
|
int _stage;
|
|
// The start position of the node's entire subtree.
|
|
sink_pos _startpos;
|
|
};
|
|
// Note: partition index should almost always have depth smaller than 6.
|
|
// Row index can have depth of up to several thousand.
|
|
utils::small_vector<local_state, 8> stack;
|
|
// Depth-first walk.
|
|
// We write out the subtree in postorder.
|
|
stack.push_back({self, 0, starting_pos});
|
|
while (!stack.empty()) {
|
|
// Caution: note that `node`, `pos`, `stage` are references to contents of `stack`.
|
|
// Any pushing or popping will invalidate them, so be sure not to use them afterwards!
|
|
auto& [node, stage, startpos] = stack.back();
|
|
if (stage < static_cast<int>(node->get_children().size())) {
|
|
stage += 1;
|
|
if (!node->get_children()[stage - 1]->_pos.valid()) {
|
|
stack.push_back({node->get_children()[stage - 1], 0, out.pos()});
|
|
}
|
|
continue;
|
|
}
|
|
|
|
// Leaf nodes must have a payload.
|
|
expensive_assert(node->get_children().size() || node->_payload._payload_bits);
|
|
|
|
expensive_log("writer_node::write: node={} pos={} n_children={} size={} transition_length={}",
|
|
fmt::ptr(node.get()), out.pos().value, node->get_children().size(), node->_node_size.value, node->_transition_length);
|
|
|
|
if (guaranteed_fit) {
|
|
SCYLLA_ASSERT(out.pos() - startpos == node->_branch_size);
|
|
node->_pos = sink_pos(out.write(*node, sink_pos(out.pos())));
|
|
SCYLLA_ASSERT(out.pos() - startpos == node->_branch_size + node->_node_size);
|
|
} else {
|
|
if (uint64_t(out.serialized_size(*node, sink_pos(out.pos())).value) > out.bytes_left_in_page()) {
|
|
out.pad_to_page_boundary();
|
|
}
|
|
node->_branch_size = out.pos() - startpos;
|
|
auto before = out.pos();
|
|
node->_pos = sink_pos(out.write(*node, sink_pos(out.pos())));
|
|
node->_node_size = node_size((out.pos() - before).value);
|
|
expensive_assert(uint64_t(node->_node_size.value) <= out.page_size());
|
|
}
|
|
|
|
node->_first_transition_byte = std::byte(node->_transition[0]);
|
|
stack.pop_back();
|
|
}
|
|
expensive_assert(out.pos() - starting_pos == self->_branch_size + self->_node_size);
|
|
self->_branch_size = sink_offset(0);
|
|
self->_node_size = node_size(0);
|
|
self->_has_out_of_page_children = false;
|
|
self->_has_out_of_page_descendants = false;
|
|
}
|
|
|
|
} // namespace sstables::trie
|