Partitions.db uses a piece of the murmur hash of the partition key
internally. The same hash is used to query the bloom filter.
So to avoid computing the hash twice (which involves converting the
key into a hashable linearized form) it would make sense to use
the same `hashed_key` for both purposes.
This is what we do in this patch. We extract the computation
of the `hashed_key` from `make_pk_filter` up to its parent
`sstable_set_impl::create_single_key_sstable_reader`,
and we pass this hash down both to `make_pk_filter` and
to the sstable reader. (And we add a pointer to the `hashed_key`
as a parameter to all functions along the way, to propagate it).
The number of parameters to `mx::make_reader` is getting uncomfortable.
Maybe they should be packed into some structs.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db reader computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is usually also computed
for bloom filter purposes just before that.
So in this patch we let the caller pass that hash instead.
The old index interface, without the hash, is kept for convenience.
In this patch we only add a new interface, we don't switch the callers
to it yet. That will happen in the next commit.
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.
That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
`sstable::set_first_and_last_keys` currently takes the first and last
key from the Summary component. But if only BTI indexes are used,
this component will be nonexistent. In this case, we can use the first
and last keys written in the footer of Partitions.db.
Currently, `sstable::estimated_keys_for_range` works by
checking what fraction of Summary is covered by the given
range, and multiplying this fraction to the number of all keys.
Since computing things on Summary doesn't involve I/O (because Summary
is always kept in RAM), this is synchronous.
In a later patch, we will modify `sstable::estimated_keys_for_range`
so that it can deal with sstables that don't have a Summary
(because they use BTI indexes instead of BIG indexes).
In that case, the function is going to compute the relevant fraction
by using the index instead of Summary. This will require making
the function asynchronous. This is what we do in this patch.
(The actual change to the logic of `sstable::estimated_keys_for_range`
will come in the next patch. In this one, we only make it asynchronous).
The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now.
Code cleanup, no backport required.
Closesscylladb/scylladb#26279
* github.com:scylladb/scylladb:
test/boost: rename multishard_mutation_query_test to multishard_query_test
replica/multishard_query: move code into namespace replica
replica/multishard_query.cc: update logger name
docs/paged-queries.md: update references to readers
root,replica: move multishard_mutation_query to replica/
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
Closesscylladb/scylladb#26003
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
An offline, scylla-sstable variant of nodetool upgradesstables command.
Applies latest (or selected) sstable version and latest schema.
Closesscylladb/scylladb#26109
This method was once implemented by calling table::for_all_partitions(), which was supposed to be non-slow version. Then callers of "non-slow" method were updated and the method itself was renamed into "_slow()" one. Nowadays only one test still uses it.
At the same time the method itself mostly consists of a boilerplate code that moves bits around to call lambda on the partitions read from reader. Open-coding the method into the calling test results in much shorter and simpler to follow code.
Code cleanup, no backport needed
Closesscylladb/scylladb#26283
* github.com:scylladb/scylladb:
test: Fix indentation after previous patch
test: Opencode for_all_partitions_slow()
test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda
table: Move for_all_partitions_slow() to test
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
References: VECTOR-187
No backport is needed as this is a new feature.
Closesscylladb/scylladb#26205
* github.com:scylladb/scylladb:
vector_store_client: Add support for load balancing
vector_store_client_test: Introduce vs_mock_server
vector_store_client_test: Relocate to a dedicated directory
The method is a large boilerplate that moves stuff around to do simple
thing -- read mutations from reader in a row and "check" them with a
lambda, optionally breaking the loop if lambda wants it.
The whole thing is much shorter if the caller kicks reader itsown.
One thing to note -- reader is not closed if something throws in
between, but that's test anyway, if something throws, test fails and not
closed reader is not a big deal.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change introduces a load balancing mechanism for the vector store client.
The client can now distribute requests across multiple vector store nodes.
The distribution mechanism performs random selection of nodes for each request.
Introduce the `vs_mock_server` test class, which is capable of counting
incoming requests. This will be used in subsequent tests to verify
load balancing logic.
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
It belongs there, it is a completely replica-side thing. Also take the
opportunity to rename it to multishard_query.{hh,cc}, it is not just
mutation anymore (data query is also implemented).
Sstables store a basic schema in the statistics component. The scylla-sstable tool uses this to be able to read and dump sstables in a self-contained manner, without requiring an external schema source.
The problem is that the schema stored int he statistics component is incomplete: it doesn't store column names for key columns, so these have placeholder names in dump outputs where column names are visible.
This is not a disaster but it is confusing and it can cause errors in scripts which want to check the content of sstables, while also knowing the schema and expecting the proper names for key columns.
To make sstables truly self-contained w.r.t. the schema, add a complete schema to the scylla component. This schema contains the names and types of all columns, as well as some basic information about the schema: keyspace name, table name, id and version.
When available, scylla-sstable's schema loader will use this new more complete schema and fall-back to the old method of loading the (incomplete) schema from the statistics component otherwise.
New feature, no backport required.
Closesscylladb/scylladb#24187
* github.com:scylladb/scylladb:
test/boost/schema_loader_test: add specific test with interesting types
test/lib/random_schema: add random_schema(schema_ptr) constructor
test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test
tools/schema_loader: add support for loading from scylla-metadata
tools/schema_loader: extract code which load schema from statistics
sstables: scylla_metadata: add schema member
1. Remove dumping cluster logs and print only the link to the log.
2. Fail the test (to fail CI and not ignore the problem) and mark the cluster as dirty (to avoid affecting subsequent tests) in case setup/teardown fails.
3. Add 2 cqlpy tests that fail after applying step 2 to the dirties_cluster list so the cluster is discarded afterward.
Closesscylladb/scylladb#26183
There's a code that tries to accumulate some counter across a sharded
service by hand. Using map_reduce0() looks nicer and avoids the smp-safe
atomic counter.
Also -- coroutinuze the thing while at it
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26259
The namespace usage in this directory is very inconsistent, with files and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace context
This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.
Code cleanup, no backport.
Closesscylladb/scylladb#26214
* github.com:scylladb/scylladb:
compaction: remove using namespace {compaction,sstables}
compaction: move code to namespace compaction
The `vector_store_client_test` is moved from `test/boost` to a new
`test/vector_search` directory.
This change prepares a dedicated location for all upcoming tests
related to the vector search feature.
Some files in compaction/ have using namespace {compaction,sstables}
clauses, some even in headers. This is considered bad practice and
muddies the namespace use. Remove them.
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables
With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.
This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
context
This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
The test starts a 3-node cluster and immediately creates a big file
on the first nodes in order to trigger the out of space prevention to
disable compaction, including the SPLIT compaction.
In order to trigger a SPLIT compaction, a keyspace with 1 initial tablet
is created followed by alter statement with `tablets = {'min_tablet_count': 2}`.
This triggers a resize decision that should not finalize due to
disabled compaction on the first node.
The test is flaky because, the keyspace is created with RF=1 and there
is no guarantee that the tablet replica will be located on the first node
with critical disk utilization. If that is not the case, the split
is finalized and the test fails, because it expect the split to be
blocked.
Change to RF=3. This ensures there is exactly one tablet replica on
each node, including the one with critical disk utilization. So SPLIT
is blocked until the disk utilization on the first node, drops below
the critical level.
Fixes: https://github.com/scylladb/scylladb/issues/25861Closesscylladb/scylladb#26225
Before this patch, if an ARN that is passed to Alternator requests
like TagResource is well-formatted but points to non-existent table,
Alternator returns the unhelpful error:
(AccessDeniedException) when calling the TagResource operation:
Incorrect resource identifier
This patch modifies this error to be:
(ResourceNotFoundException) when calling the TagResource operation:
ResourceArn 'arn:scylla:alternator:alternator_alternator_Test_
1758532308880:scylla:table/alternator_Test_1758532308880x' not found
This is the same error type (ResourceNotFoundException) that DynamoDB
returns in that case - and a more helpful error message.
This patch also includes a regression test that checks the error
type in this case. The new test fails on Alternator before this
patch, and passes afterwards (and also passes on DyanamoDB).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26179
The test now tests loading the schema from the scylla component by
default. Force testing the fall-back (read schema from statistics) by
deleting the Scylla.db component.
Also improve the test by comparing the column names and types, to check
that when loaded from the scylla component, the key names are also
correct.
The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB).
- change `corrupt_file` to overwrite a based on a percentage of the file's size instead of fixed number of chunks
- fix typos
- cleanup comments for clarity
Closes: #25627Closesscylladb/scylladb#25979
* github.com:scylladb/scylladb:
test: cleanup big mutation commitlog tests
test: fix test_one_big_mutation_corrupted_on_startup
Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead.
Fixes: https://github.com/scylladb/scylladb/issues/26023
Preparation for seastar update, no backport required.
Closesscylladb/scylladb#26024
* github.com:scylladb/scylladb:
treewide: move away from accessing httpd::request::query_parameters
test/pylib/s3_server_mock.py: better handle empty query params
Due to a missing functionality in PythonTest, `unshare` is never used
to mount volumes. As a consequence:
+ volumes are created with sudo which is undesired
+ they are not cleared automatically
Even having the missing support in place, the approach with mounting
volumes with `unshare` would not work as http server, a pool of clusters,
and scylla cluster manager are started outside of the new namespace.
Thus cluster would have no access to volumes created with `unshare`.
The new approach that works with and without dbuild and does not require
sudo, uses the following three commands to mount a volume:
truncate -s 100M /tmp/mydevice.img
mkfs.ext4 /tmp/mydevice.img
fuse2fs /tmp/mydevice.img test/
Additionally, a proper cleanup is performed, i.e. servers are stopped
gracefully and and volumes are unmounted after the tests using them are
completed.
Fixes: https://github.com/scylladb/scylladb/issues/25906Closesscylladb/scylladb#26065
The test starts a 3-node cluster and immediately creates a big file
on one of the nodes, to trigger the out of space prevention to start
rejecting writes on this node. Then a write is executed and checked it
did not reach the node with critical disk utilization but reached
the remaining nodes (it should, RF=3 is set)
However, when not specified, a default LOCAL_ONE consistency level
is used. This means that only one node is required to acknowledge the
write.
After the write, the test checks if the write
+ did NOT reach the node with critical disk utilization (works)
+ did reach the remaning nodes
This can cause the test to fail sporadically as the write might not
yet be on the last node.
Use CL=QUORUM instead.
Fixes: https://github.com/scylladb/scylladb/issues/26004Closesscylladb/scylladb#26030
vector_store_client: Add support for multiple IPs in DNS responses
The DNS resolution logic now processes all IP addresses returned in a DNS
response, not just the primary one.
The client will iterate through the list of resolved IPs, attempting to
query the next one if a request fails. This improves high availability
by allowing the client to query other available nodes if one is down.
References: VECTOR-187
As this is a new feature no backport is needed.
Closesscylladb/scylladb#26055
* github.com:scylladb/scylladb:
vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES
vector_store_client: Format with clang-format
vector_store_client: Add support for multiple IPs in DNS responses
vector_store_client_test: Extract `make_vs_server` helper function
vector_store_client_test: Ensure cleanup on exception
vector_store_client_test: Fix unreliable unavailable port tests
Consider the following:
The tablet load balancer is working on:
- node1: an empty node (no tablets) with a large disk capacity
- node2: an empty node (no tablets) with a lower disk capacity then node1
- node3: is being decommissioned and contains tablet replicas
In load_balancer::make_internode_plan() the initial destination
node/shard is selected like this:
// Pick best target shard.
auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)};
load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which
adds the host to load_sketch's internal hash maps in case the node was
not yet seen by load_sketch.
Let's assume dst is a shard on node1.
Later in load_balancer::make_internode_plan() we will call
pick_candidate() to try to find a better destination node than the
initial one:
// May choose a different source shard than src.shard or different destination host/shard than dst.
auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst,
drain_skipped);
auto source_tablets = candidate.tablets;
src = candidate.src;
dst = candidate.dst;
If pick_candidate() selects some other empty destination (due to larger
capacity: node1) node, and that node has not yet been seen by
load_sketch (because it was empty), a subsequent call to
load_sketch::pick() will search for the node using
std::unordered_map::at(), and because the node is not found it will
throw a std::out_of_bounds() exception crashing the load balancer.
This problem is fixed by changing load_sketch::populate() to initialize
its internal maps with all the nodes which populate()'s arguments
filter for.
Fixes: #26203Closesscylladb/scylladb#26207
This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm.
Fixes: #21937
Backport: No backport required
Closesscylladb/scylladb#25787
This PR improves the handling of malformed SSTables during scrub and adds tests to validate the updated behavior.
When scrub is used, there is an increased chance of encountering malformed SSTables. These should not be retried as in regular compaction. Instead, they must be handled according to the selected scrub mode: in skip mode, in case of malformed_sstable_exception, invalid data or whole SSTable should be removed, in abort and segregate modes, the scrub process should abort.
Previously, all modes treated malformed_sstable_exception the same way, causing scrub to abort even when skip mode was selected. This PR updates the scrub logic to properly handle malformed SSTable exceptions based on the selected mode.
Unit tests are added to verify the intended behavior.
Fixesscylladb/scylladb#19059
Backport is not required, it is an improvement
Closesscylladb/scylladb#25828
* github.com:scylladb/scylladb:
sstable_compaction_test: add scrub tests for malformed SSTables
scrub: skip sstable on malformed sstable exception in skip mode
As requested in #22104, moved the files and fixed other includes and build system.
Moved files:
- combine.hh
- collection_mutation.hh
- collection_mutation.cc
- converting_mutation_partition_applier.hh
- converting_mutation_partition_applier.cc
- counters.hh
- counters.cc
- timestamp.hh
Fixes: #22104
This is a cleanup, no need to backport
Closesscylladb/scylladb#25085
Instead of re-inventing empty param handling, use the built-in
keep_blank_values=True param of the urllib.parse.parse_qs().
Handles correctly the case where the `=` is also present but no value
follows, this is the sytnax used by the new query_params in
seastar::http::request.
Also add an exception to build_POST_response(). Better than a cryptic
message about encode() not callable on NoneType.