Introduce `ms` -- a new sstable format version which
is a hybrid of Cassandra's `me` and `da`.
It is based on `me`, but with the index components
(Summary.db and Index.db) replaced with the index
components of `da` (Partitions.db and Rows.db).
As of this patch, the version is never chosen
anywhere for writing sstables yet. It is only introduced.
We will add it to unit tests in a later commit,
and expose it to users in yet later commit.
Later in this patch series we will introduce `ms` as the new highest
format, but we won't be able to make it the default within the same
series due to some dtest incompatibilities.
Until `ms` is the default, we don't `scylla sstable` to default to
it, even though it's the highest. Let's choose the default
version in `scylla sstable` using the same method which is
used by Scylla in general: by letting the `sstable_manager` choose.
Partitions.db uses a piece of the murmur hash of the partition key
internally. The same hash is used to query the bloom filter.
So to avoid computing the hash twice (which involves converting the
key into a hashable linearized form) it would make sense to use
the same `hashed_key` for both purposes.
This is what we do in this patch. We extract the computation
of the `hashed_key` from `make_pk_filter` up to its parent
`sstable_set_impl::create_single_key_sstable_reader`,
and we pass this hash down both to `make_pk_filter` and
to the sstable reader. (And we add a pointer to the `hashed_key`
as a parameter to all functions along the way, to propagate it).
The number of parameters to `mx::make_reader` is getting uncomfortable.
Maybe they should be packed into some structs.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.
So in this patch we let the caller pass that hash instead.
The old index interface, without the hash, is kept for convenience.
In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
In previous patches we (hopefully) modified all users of
Index and Summary components so that they don't longer
need those components to exist. (And can use Partitions and
Rows components instead).
If there's no metadata file with sharding metadata,
the owning shards of an sstable are computed based on the partition key
range within the sstable.
This range is set in `set_first_and_last_keys()`, which
(since another commit in this commit series) reads it
either from the Summary component or from the footer of the Partitions
component, whichever is available.
But in some code paths `set_first_and_last_keys()` is called
before the footer of Partitions is loaded. If the sstable
doesn't have Summary, only Partitions, then the
`set_first_and_last_keys()` will fail. To prevent that,
in those cases we have to open the file and read its footer
early, before the `set_first_and_last_keys()` calls.
Note: the changes in this commit shouldn't matter during
normal operation, in which a Scylla component with sharding
metadata is available. But it might be used when
old and/or incomplete sstables are read.
`sstable::set_first_and_last_keys` currently takes the first and last
key from the Summary component. But if only BTI indexes are used,
this component will be nonexistent. In this case, we can use the first
and last keys written in the footer of Partitions.db.
For efficiency, the cardinality of the bloom filter
(i.e. the number of partition keys which will be written into the sstable)
has to be known before elements are inserted into the filter.
In some cases (e.g. memtables flush) this number is known exactly.
But in others (e.g. repair) it can only be estimated,
and the estimation might be very wrong, leading to an oversized filter.
Because of that, some time ago we added a piece of logic
(ran after the sstable is written, but before it's sealed)
which looks at the actual number of written partitions,
compares it to the initial estimate (on which the size of the bloom
filter was based on), and if the difference is unacceptably large,
it rewrites the bloom filter from partition keys contained in Index.db.
But the idea to rebuild the bloom filters from index files
isn't going to work with BTI indexes, because they don't store
whole partition keys. If we want sstables which don't have Index.db
files, we need some other way to deal with oversized filters.
Partition keys can be recovered from Data.db,
but that would often be way too expensive.
This patch adds another way. We introduce a new component file,
TemporaryHashes. This component, if written at all,
contains the 16-byte murmur hash for every partition key, in order,
and can be used in place of Index to reconstruct the bloom filter.
(Our bloom filters are actually built from the set of murmur hashes of
partition keys. The first step of inserting a partition key into a
filter is hashing the key. Remembering the hashes is sufficient
to build the filter later, without looking at partition keys again.)
As of this patch, if the Index component is not being written,
we don't allocate and populate a bloom filter during the Data.db write.
Instead, we write the murmur hashes to TemporaryHashes, and only
later, after the Data write finishes, we allocate the optimal-size,
bloom filter, we read the hashes back from TemporaryHashes,
and we populate the filter with them.
That is suboptimal.
Writing the hashes to disk (or worse, to S3) and reading
them back is more expensive than building the bloom filter
during the main Data pass.
So ideally it should be avoided in cases where we know
in advance that the partition key count estimate is good enough.
(Which should be the case in flushes and compactions).
But we defer that to a future patch.
(Such a change would involve passing some flag to the sstable writer
if the cardinality estimate is trustworthy, and not creating
TemporaryHashes if the estimate is trustworthy).
In one of the next patches, we will want to use (in BTI partition
index writer) the same hash as used by the bloom filter,
and we'll also want to allow rebuilding the filter in a second
pass (after the whole sstable is written) from hashes (as opposed
to rebuilding from partition keys saved in Index.db, which is
something we sometimes do today) saved to a temporary file.
For those, we need an interface that allows us to compute the hash
externally, and only pass the hash to `add()`.
Before this patch, `estimated_keys_for_range` assumes the presence
of the Summary component. But we want to make this component optional
in this series.
This patch adds a second branch to this function, for sstables
which don't have a BIG index (in particular, Summary component),
but have a BTI index (Partitions component).
In this case, instead of calculating the estimate as
"fraction of summary overlapping with given range,
multiplied by the total key estimate", we calculate
it as "fraction of Data file overlapping with given range,
multiplied by the total key estimate".
(With an extra conditional for the special case when the given range
doesn't overlap with the sstable's range at all. In this case, if the
ranges are adjacent, the main path could easily return "1 partition"
instead of "0 partitions", due to the inexactness of BTI indexes for
range queries. Returning something non-zero in this case would
be unfortunate, so the extra conditional makes sure that
we return 0).
Currently, `sstable::estimated_keys_for_range` works by
checking what fraction of Summary is covered by the given
range, and multiplying this fraction to the number of all keys.
Since computing things on Summary doesn't involve I/O (because Summary
is always kept in RAM), this is synchronous.
In a later patch, we will modify `sstable::estimated_keys_for_range`
so that it can deal with sstables that don't have a Summary
(because they use BTI indexes instead of BIG indexes).
In that case, the function is going to compute the relevant fraction
by using the index instead of Summary. This will require making
the function asynchronous. This is what we do in this patch.
(The actual change to the logic of `sstable::estimated_keys_for_range`
will come in the next patch. In this one, we only make it asynchronous).
`sstable::get_estimated_key_count()` estimates the partition count from the
size of Summary, and the interval between Summary entries.
But we want to allow writing sstables without a Summary
(i.e. sstables that use BTI indexes instead of BIG indexes),
so we want a way to get the key count without involving Summary.
For that, we can use the `estimated_partition_size` histogram in
Statistics. By counting the histogram entries, we get the exact
number of partitions in the sstable.
Add a function which computes an estimated number of partitions
in the given token range. We will use this helper in a later patch
to replace a few places in the code which de facto do the same
thing "manually".
A BTI index isn't able to determine if a given key is present in
the sstable, because it doesn't store full keys.
(It only stores prefixes of decorated keys, so it might give false positives).
If the sstable only has BTI index, and no BIG index, then
`sstable::has_partition_key()` will have to be implemented with
with something else than just the index reader.
We might as well ignore the index in any cases and just check
that a regular data read for the given partition returns a non-empty result.
`sstable::has_partition_key` is only used in the
`column_family/sstables/by_key` REST API call that nobody
uses anyway, no point in trying to make special optimizations for it.
This patch teaches `sstable::make_index_reader` how to create
a BTI index reader, from the the `Partitions.db` and `Rows.db`
components, if they exist (in which case they are opened by this point).
In the previous patch we added code responsible
for creating and opening Partitions.db and Rows.db,
but we left those files empty.
In this patch, we populate the files using
`trie::bti_row_index_writer` and `trie::bti_partition_index_writer`.
Note: for the row index, we insert the same clustering blocks to
both indexes. The logic for choosing the size of the blocks
hasn't been changed in any way.
Much of this patch has to do with propagating the current range
tombstone down to all places which can start a new clustering block.
The reason we need that is that, for each clustering block,
BIG indexes store the range tombstone succeeding the block
(i.e. the range tombstone in between the given block and its successor)
BTI indexes store the range tombstone preceding the block,
(i.e. the range tombstone in between the given block and its predecessor).
So before the patch there's no code which looks at the current tombstone
when *starting* the block, only when *ending* the block.
This patch adds an extra copy for each `decorated_key`.
This is mostly unavoidable -- the BTI partition writer just
has to remember the key until its successor appears, to find the
common prefix. (We could avoid the key copy if the BTI isn't used, though.
We don't do that in this patch, we just let the copy happen).
This patch adds code responsible for creation and opening
of BTI index components (Rows.db, Partitions.db) when
BTI index writing is enabled.
(It is enabled if the cluster feature is enabled and the relevant
config entry permits it).
The files are empty for now, and are never read.
We will populate and use them in following patches.
BTI indexes are made up of Partition.db and Rows.db files.
In this patch we introduce the corresponding component types.
In Cassandra, BTI is a separate "sstable format", with a new set
of versions. (I.e. `bti-da`, as opposed to `big-me`).
In this patch series, we are doing something different:
we are introducing version `ms`, which is like `me`, except with
`Index.db` and `Summary.db` replaced with `Partitions.db` and `Rows.db`.
With a setup like that, Scylla won't yet be able to read Cassandra's
BTI (`da`) files, because this patch doesn't teach Scylla
about `da`.
(But the way to that is open. It would just require first implementing
several other things which changed between `me` and `da`).
(And, naturally Cassandra will reject `ms` sstables.
But this isn't the first time we are breaking file
compatibility with Cassandra to some degree.
Other examples include encryption and dictionary compression).
Note: Partitions.db and Rows.db contain prefixes of keys,
which is sensitive information, so they have to be encrypted.
There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test)
which tests the value of `_stats.capped_tombstone_deletion_time`.
Before this patch, for "ms" sstables, `to_deletion_time` would have
be called twice for each written partition tombstone, which would fail
the test.
Since `_pi_write_m.partition_tombstone` always ends up being
converted from `tombstone` to `sstables::deletion_time` anyway,
let's just make it a `sstables::deletion_time` to begin with.
This will ensure that `to_deletion_time` will be able to be
only called once per partition tombstone.
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.
Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.
Closesscylladb/scylladb#26297
The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now.
Code cleanup, no backport required.
Closesscylladb/scylladb#26279
* github.com:scylladb/scylladb:
test/boost: rename multishard_mutation_query_test to multishard_query_test
replica/multishard_query: move code into namespace replica
replica/multishard_query.cc: update logger name
docs/paged-queries.md: update references to readers
root,replica: move multishard_mutation_query to replica/
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
Closesscylladb/scylladb#26003
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
Apparently the group0 server object dies (and is freed) during drain/shutdown, and I didn't take that into account in my https://github.com/scylladb/scylladb/pull/23025, which still attempts to use it afterwards.
The patch fixes two problems.
The problem with `is_raft_leader` has been observed in tests.
The problems with `publish_new_sstable_dict` has not been observed, but AFAIU (based on code inspection) it exists. I didn't attempt to prove its existence with a test.
Should be backported to 2025.3.
Closesscylladb/scylladb#25115
* github.com:scylladb/scylladb:
storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source
storage_service: hold group0 gate in `publish_new_sstable_dict`
An offline, scylla-sstable variant of nodetool upgradesstables command.
Applies latest (or selected) sstable version and latest schema.
Closesscylladb/scylladb#26109
This method was once implemented by calling table::for_all_partitions(), which was supposed to be non-slow version. Then callers of "non-slow" method were updated and the method itself was renamed into "_slow()" one. Nowadays only one test still uses it.
At the same time the method itself mostly consists of a boilerplate code that moves bits around to call lambda on the partitions read from reader. Open-coding the method into the calling test results in much shorter and simpler to follow code.
Code cleanup, no backport needed
Closesscylladb/scylladb#26283
* github.com:scylladb/scylladb:
test: Fix indentation after previous patch
test: Opencode for_all_partitions_slow()
test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda
table: Move for_all_partitions_slow() to test
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
References: VECTOR-187
No backport is needed as this is a new feature.
Closesscylladb/scylladb#26205
* github.com:scylladb/scylladb:
vector_store_client: Add support for load balancing
vector_store_client_test: Introduce vs_mock_server
vector_store_client_test: Relocate to a dedicated directory
The method is a large boilerplate that moves stuff around to do simple
thing -- read mutations from reader in a row and "check" them with a
lambda, optionally breaking the loop if lambda wants it.
The whole thing is much shorter if the caller kicks reader itsown.
One thing to note -- reader is not closed if something throws in
between, but that's test anyway, if something throws, test fails and not
closed reader is not a big deal.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
Introduce the `vs_mock_server` test class, which is capable of counting
incoming requests. This will be used in subsequent tests to verify
load balancing logic.
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`compression_parameters` provides two levels of validation:
* syntactic checks - implemented in the constructor
* semantic checks - implemented by `compression_parameters::validate()`
The former are applied implicitly when parsing the options from the
command line or from scylla.yaml. The latter are currently not applied,
but they should.
In lack of a better place, apply them in main, right after joining the
cluster, to make sure that the cluster features have been negotiated.
The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation
will fail if the feature is disabled and a dictionary compression
algorithm has been selected.
Also, mark `validate()` as const so that it can be called from a config
object.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
ScyllaDB offers the `compression` DDL property for configuring
compression per user table (compression algorithm and chunk size). If
not specified, the default compression algorithm is the LZ4Compressor
with a 4KiB chunk size (refer to the default constructor for
`compression_parameters`). The same default applies to system tables as
well.
Add a new configuration option to allow customizing the default for user
tables. Use the previously hardcoded default as the new option's default
value.
Note that the option has no effect on ALTER TABLE statements. An altered
table either inherits explicit compression options from the CQL
statement, or maintains its existing options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
It belongs there, it is a completely replica-side thing. Also take the
opportunity to rename it to multishard_query.{hh,cc}, it is not just
mutation anymore (data query is also implemented).
Most of it's then-chains are quire hairy and look much nicer as coroutines.
Last patch restores indentation.
Code cleanup, no backport required.
Closesscylladb/scylladb#26271
* github.com:scylladb/scylladb:
snitch: Reindent after previous changes
snitch: Make periodic_reader_callback() a coroutine
snitch: Coroutinize pause_io()
snitch: Coroutinize stop()
snitch: Coroutinize reload_configuration()
snitch: Coroutinize read_property_file()
snitch: Coroutinize start()
snitch: Coroutinize property_file_was_modified()
Sstables store a basic schema in the statistics component. The scylla-sstable tool uses this to be able to read and dump sstables in a self-contained manner, without requiring an external schema source.
The problem is that the schema stored int he statistics component is incomplete: it doesn't store column names for key columns, so these have placeholder names in dump outputs where column names are visible.
This is not a disaster but it is confusing and it can cause errors in scripts which want to check the content of sstables, while also knowing the schema and expecting the proper names for key columns.
To make sstables truly self-contained w.r.t. the schema, add a complete schema to the scylla component. This schema contains the names and types of all columns, as well as some basic information about the schema: keyspace name, table name, id and version.
When available, scylla-sstable's schema loader will use this new more complete schema and fall-back to the old method of loading the (incomplete) schema from the statistics component otherwise.
New feature, no backport required.
Closesscylladb/scylladb#24187
* github.com:scylladb/scylladb:
test/boost/schema_loader_test: add specific test with interesting types
test/lib/random_schema: add random_schema(schema_ptr) constructor
test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test
tools/schema_loader: add support for loading from scylla-metadata
tools/schema_loader: extract code which load schema from statistics
sstables: scylla_metadata: add schema member