Extend the `sstable_format` config enum with a "ms" value,
and, if it's enabled (in the config and in cluster features),
use it for new sstables on the node.
(Before this commit, writing `ms` sstables should only be possible
in unit tests, via internal APIs. After this commit, the format
can be enabled in the config and the database will write it during
normal operation).
As of this commit, the new format is not the default yet.
(But it will become the default in a later commit in the same series).
Add `ms` to tests which already test many format versions.
The tests check that sstable files in newer verisons are
the same as in `mc`.
Arbitrarily, for `ms`, we only check the files common
between `mc` and `ms`.
If we want to extend this test more, so that it checks
that `Partitions.db` and `Rows.db` don't change over time,
we have to add `ms` versions of all the sstables under
`test/resources` which are used in this test. We won't do that
in this patch series. And I'm not sure if we want to do that at all.
Block monotonicity checks can't be implemented for BTI row indexes
because they don't store full clustering positions, only some encoded
prefixes.
The emptiness check could be implemented with some effort,
but we currently don't bother.
The two tests which use this `is_empty()` method aren't very
useful anyway. (They check that the promoted index is empty when
there are no clustering keys. That doesn't really need a dedicated
test).
The test currently implicitly uses the default sstable format.
But it assumes that the index reader type is `sstables::index_reader`,
and it wants some methods specific to that type (and absent from
the base `abstract_index_reader`).
If we switch the default format from `me` to `ms`,
without doing something about this,
this test will start failing on the downcast to `sstables::index_reader`.
We deal with this by explicitly specifying `me`.
`me` and `ms` data readers are identical.
And this is a test of the data reader, not the index reader.
So it's perfectly fine to just use `me`.
This is an old test for some workaround for incorrectly-generated
promoted indexes. It doesn't make sense to port this test to newer
sstable formats. So just skip it for the new sstable versions.
There are some tests which want sstables of all format versions
in `test/resource`. This tests adds `ms` files for those tests.
I didn't think much about this change, I just mechanically
generated the `ms` from the existing `me` sstables in the same directories
(using `scylla sstable upgrade`) for the tests which were complaining
about the lack of `ms` files.
Introduce `ms` -- a new sstable format version which
is a hybrid of Cassandra's `me` and `da`.
It is based on `me`, but with the index components
(Summary.db and Index.db) replaced with the index
components of `da` (Partitions.db and Rows.db).
As of this patch, the version is never chosen
anywhere for writing sstables yet. It is only introduced.
We will add it to unit tests in a later commit,
and expose it to users in yet later commit.
Later in this patch series we will introduce `ms` as the new highest
format, but we won't be able to make it the default within the same
series due to some dtest incompatibilities.
Until `ms` is the default, we don't `scylla sstable` to default to
it, even though it's the highest. Let's choose the default
version in `scylla sstable` using the same method which is
used by Scylla in general: by letting the `sstable_manager` choose.
Partitions.db uses a piece of the murmur hash of the partition key
internally. The same hash is used to query the bloom filter.
So to avoid computing the hash twice (which involves converting the
key into a hashable linearized form) it would make sense to use
the same `hashed_key` for both purposes.
This is what we do in this patch. We extract the computation
of the `hashed_key` from `make_pk_filter` up to its parent
`sstable_set_impl::create_single_key_sstable_reader`,
and we pass this hash down both to `make_pk_filter` and
to the sstable reader. (And we add a pointer to the `hashed_key`
as a parameter to all functions along the way, to propagate it).
The number of parameters to `mx::make_reader` is getting uncomfortable.
Maybe they should be packed into some structs.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.
So in this patch we let the caller pass that hash instead.
The old index interface, without the hash, is kept for convenience.
In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
In previous patches we (hopefully) modified all users of
Index and Summary components so that they don't longer
need those components to exist. (And can use Partitions and
Rows components instead).
If there's no metadata file with sharding metadata,
the owning shards of an sstable are computed based on the partition key
range within the sstable.
This range is set in `set_first_and_last_keys()`, which
(since another commit in this commit series) reads it
either from the Summary component or from the footer of the Partitions
component, whichever is available.
But in some code paths `set_first_and_last_keys()` is called
before the footer of Partitions is loaded. If the sstable
doesn't have Summary, only Partitions, then the
`set_first_and_last_keys()` will fail. To prevent that,
in those cases we have to open the file and read its footer
early, before the `set_first_and_last_keys()` calls.
Note: the changes in this commit shouldn't matter during
normal operation, in which a Scylla component with sharding
metadata is available. But it might be used when
old and/or incomplete sstables are read.
`sstable::set_first_and_last_keys` currently takes the first and last
key from the Summary component. But if only BTI indexes are used,
this component will be nonexistent. In this case, we can use the first
and last keys written in the footer of Partitions.db.
For efficiency, the cardinality of the bloom filter
(i.e. the number of partition keys which will be written into the sstable)
has to be known before elements are inserted into the filter.
In some cases (e.g. memtables flush) this number is known exactly.
But in others (e.g. repair) it can only be estimated,
and the estimation might be very wrong, leading to an oversized filter.
Because of that, some time ago we added a piece of logic
(ran after the sstable is written, but before it's sealed)
which looks at the actual number of written partitions,
compares it to the initial estimate (on which the size of the bloom
filter was based on), and if the difference is unacceptably large,
it rewrites the bloom filter from partition keys contained in Index.db.
But the idea to rebuild the bloom filters from index files
isn't going to work with BTI indexes, because they don't store
whole partition keys. If we want sstables which don't have Index.db
files, we need some other way to deal with oversized filters.
Partition keys can be recovered from Data.db,
but that would often be way too expensive.
This patch adds another way. We introduce a new component file,
TemporaryHashes. This component, if written at all,
contains the 16-byte murmur hash for every partition key, in order,
and can be used in place of Index to reconstruct the bloom filter.
(Our bloom filters are actually built from the set of murmur hashes of
partition keys. The first step of inserting a partition key into a
filter is hashing the key. Remembering the hashes is sufficient
to build the filter later, without looking at partition keys again.)
As of this patch, if the Index component is not being written,
we don't allocate and populate a bloom filter during the Data.db write.
Instead, we write the murmur hashes to TemporaryHashes, and only
later, after the Data write finishes, we allocate the optimal-size,
bloom filter, we read the hashes back from TemporaryHashes,
and we populate the filter with them.
That is suboptimal.
Writing the hashes to disk (or worse, to S3) and reading
them back is more expensive than building the bloom filter
during the main Data pass.
So ideally it should be avoided in cases where we know
in advance that the partition key count estimate is good enough.
(Which should be the case in flushes and compactions).
But we defer that to a future patch.
(Such a change would involve passing some flag to the sstable writer
if the cardinality estimate is trustworthy, and not creating
TemporaryHashes if the estimate is trustworthy).
In one of the next patches, we will want to use (in BTI partition
index writer) the same hash as used by the bloom filter,
and we'll also want to allow rebuilding the filter in a second
pass (after the whole sstable is written) from hashes (as opposed
to rebuilding from partition keys saved in Index.db, which is
something we sometimes do today) saved to a temporary file.
For those, we need an interface that allows us to compute the hash
externally, and only pass the hash to `add()`.
Before this patch, `estimated_keys_for_range` assumes the presence
of the Summary component. But we want to make this component optional
in this series.
This patch adds a second branch to this function, for sstables
which don't have a BIG index (in particular, Summary component),
but have a BTI index (Partitions component).
In this case, instead of calculating the estimate as
"fraction of summary overlapping with given range,
multiplied by the total key estimate", we calculate
it as "fraction of Data file overlapping with given range,
multiplied by the total key estimate".
(With an extra conditional for the special case when the given range
doesn't overlap with the sstable's range at all. In this case, if the
ranges are adjacent, the main path could easily return "1 partition"
instead of "0 partitions", due to the inexactness of BTI indexes for
range queries. Returning something non-zero in this case would
be unfortunate, so the extra conditional makes sure that
we return 0).
Currently, `sstable::estimated_keys_for_range` works by
checking what fraction of Summary is covered by the given
range, and multiplying this fraction to the number of all keys.
Since computing things on Summary doesn't involve I/O (because Summary
is always kept in RAM), this is synchronous.
In a later patch, we will modify `sstable::estimated_keys_for_range`
so that it can deal with sstables that don't have a Summary
(because they use BTI indexes instead of BIG indexes).
In that case, the function is going to compute the relevant fraction
by using the index instead of Summary. This will require making
the function asynchronous. This is what we do in this patch.
(The actual change to the logic of `sstable::estimated_keys_for_range`
will come in the next patch. In this one, we only make it asynchronous).
`sstable::get_estimated_key_count()` estimates the partition count from the
size of Summary, and the interval between Summary entries.
But we want to allow writing sstables without a Summary
(i.e. sstables that use BTI indexes instead of BIG indexes),
so we want a way to get the key count without involving Summary.
For that, we can use the `estimated_partition_size` histogram in
Statistics. By counting the histogram entries, we get the exact
number of partitions in the sstable.
Add a function which computes an estimated number of partitions
in the given token range. We will use this helper in a later patch
to replace a few places in the code which de facto do the same
thing "manually".
A BTI index isn't able to determine if a given key is present in
the sstable, because it doesn't store full keys.
(It only stores prefixes of decorated keys, so it might give false positives).
If the sstable only has BTI index, and no BIG index, then
`sstable::has_partition_key()` will have to be implemented with
with something else than just the index reader.
We might as well ignore the index in any cases and just check
that a regular data read for the given partition returns a non-empty result.
`sstable::has_partition_key` is only used in the
`column_family/sstables/by_key` REST API call that nobody
uses anyway, no point in trying to make special optimizations for it.
This patch teaches `sstable::make_index_reader` how to create
a BTI index reader, from the the `Partitions.db` and `Rows.db`
components, if they exist (in which case they are opened by this point).
In the previous patch we added code responsible
for creating and opening Partitions.db and Rows.db,
but we left those files empty.
In this patch, we populate the files using
`trie::bti_row_index_writer` and `trie::bti_partition_index_writer`.
Note: for the row index, we insert the same clustering blocks to
both indexes. The logic for choosing the size of the blocks
hasn't been changed in any way.
Much of this patch has to do with propagating the current range
tombstone down to all places which can start a new clustering block.
The reason we need that is that, for each clustering block,
BIG indexes store the range tombstone succeeding the block
(i.e. the range tombstone in between the given block and its successor)
BTI indexes store the range tombstone preceding the block,
(i.e. the range tombstone in between the given block and its predecessor).
So before the patch there's no code which looks at the current tombstone
when *starting* the block, only when *ending* the block.
This patch adds an extra copy for each `decorated_key`.
This is mostly unavoidable -- the BTI partition writer just
has to remember the key until its successor appears, to find the
common prefix. (We could avoid the key copy if the BTI isn't used, though.
We don't do that in this patch, we just let the copy happen).
This patch adds code responsible for creation and opening
of BTI index components (Rows.db, Partitions.db) when
BTI index writing is enabled.
(It is enabled if the cluster feature is enabled and the relevant
config entry permits it).
The files are empty for now, and are never read.
We will populate and use them in following patches.
BTI indexes are made up of Partition.db and Rows.db files.
In this patch we introduce the corresponding component types.
In Cassandra, BTI is a separate "sstable format", with a new set
of versions. (I.e. `bti-da`, as opposed to `big-me`).
In this patch series, we are doing something different:
we are introducing version `ms`, which is like `me`, except with
`Index.db` and `Summary.db` replaced with `Partitions.db` and `Rows.db`.
With a setup like that, Scylla won't yet be able to read Cassandra's
BTI (`da`) files, because this patch doesn't teach Scylla
about `da`.
(But the way to that is open. It would just require first implementing
several other things which changed between `me` and `da`).
(And, naturally Cassandra will reject `ms` sstables.
But this isn't the first time we are breaking file
compatibility with Cassandra to some degree.
Other examples include encryption and dictionary compression).
Note: Partitions.db and Rows.db contain prefixes of keys,
which is sensitive information, so they have to be encrypted.
There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test)
which tests the value of `_stats.capped_tombstone_deletion_time`.
Before this patch, for "ms" sstables, `to_deletion_time` would have
be called twice for each written partition tombstone, which would fail
the test.
Since `_pi_write_m.partition_tombstone` always ends up being
converted from `tombstone` to `sstables::deletion_time` anyway,
let's just make it a `sstables::deletion_time` to begin with.
This will ensure that `to_deletion_time` will be able to be
only called once per partition tombstone.
When draining the view builder, we abort ongoing operations using the
view builder's abort source, which may cause them to fail with
abort_requested_exception or raft::request_aborted exceptions.
Since these failures are expected during shutdown, reduce the log level
in add_new_view from 'error' to 'debug' for these specific exceptions
while keeping 'error' level for unexpected failures.
Closesscylladb/scylladb#26297
The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now.
Code cleanup, no backport required.
Closesscylladb/scylladb#26279
* github.com:scylladb/scylladb:
test/boost: rename multishard_mutation_query_test to multishard_query_test
replica/multishard_query: move code into namespace replica
replica/multishard_query.cc: update logger name
docs/paged-queries.md: update references to readers
root,replica: move multishard_mutation_query to replica/
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
Closesscylladb/scylladb#26003
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
Apparently the group0 server object dies (and is freed) during drain/shutdown, and I didn't take that into account in my https://github.com/scylladb/scylladb/pull/23025, which still attempts to use it afterwards.
The patch fixes two problems.
The problem with `is_raft_leader` has been observed in tests.
The problems with `publish_new_sstable_dict` has not been observed, but AFAIU (based on code inspection) it exists. I didn't attempt to prove its existence with a test.
Should be backported to 2025.3.
Closesscylladb/scylladb#25115
* github.com:scylladb/scylladb:
storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source
storage_service: hold group0 gate in `publish_new_sstable_dict`
An offline, scylla-sstable variant of nodetool upgradesstables command.
Applies latest (or selected) sstable version and latest schema.
Closesscylladb/scylladb#26109
This method was once implemented by calling table::for_all_partitions(), which was supposed to be non-slow version. Then callers of "non-slow" method were updated and the method itself was renamed into "_slow()" one. Nowadays only one test still uses it.
At the same time the method itself mostly consists of a boilerplate code that moves bits around to call lambda on the partitions read from reader. Open-coding the method into the calling test results in much shorter and simpler to follow code.
Code cleanup, no backport needed
Closesscylladb/scylladb#26283
* github.com:scylladb/scylladb:
test: Fix indentation after previous patch
test: Opencode for_all_partitions_slow()
test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda
table: Move for_all_partitions_slow() to test
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
References: VECTOR-187
No backport is needed as this is a new feature.
Closesscylladb/scylladb#26205
* github.com:scylladb/scylladb:
vector_store_client: Add support for load balancing
vector_store_client_test: Introduce vs_mock_server
vector_store_client_test: Relocate to a dedicated directory
The method is a large boilerplate that moves stuff around to do simple
thing -- read mutations from reader in a row and "check" them with a
lambda, optionally breaking the loop if lambda wants it.
The whole thing is much shorter if the caller kicks reader itsown.
One thing to note -- reader is not closed if something throws in
between, but that's test anyway, if something throws, test fails and not
closed reader is not a big deal.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
Introduce the `vs_mock_server` test class, which is capable of counting
incoming requests. This will be used in subsequent tests to verify
load balancing logic.
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`compression_parameters` provides two levels of validation:
* syntactic checks - implemented in the constructor
* semantic checks - implemented by `compression_parameters::validate()`
The former are applied implicitly when parsing the options from the
command line or from scylla.yaml. The latter are currently not applied,
but they should.
In lack of a better place, apply them in main, right after joining the
cluster, to make sure that the cluster features have been negotiated.
The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation
will fail if the feature is disabled and a dictionary compression
algorithm has been selected.
Also, mark `validate()` as const so that it can be called from a config
object.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>