It seems that tests in test/boost/combined_tests have to define a test
suite name, otherwise they aren't picked up by test.py.
Fixes#24199Closesscylladb/scylladb#24200
This pull request adds support for creating custom indexes (at a metadata level) as long as a supported custom class is provided (currently only vector search).
The patch contains:
- a change in CREATE INDEX statement that allows for the USING keyword to be present as long as one of the supported classes is used
- support for describing custom indexes in the DESCRIBE statement
- unit tests
Co-authored by: @Balwancia
Closesscylladb/scylladb#23720
* github.com:scylladb/scylladb:
test/cqlpy: add custom index tests
index: support storing metadata for custom indices
chunked_vector is a replacement for std::vector that avoids large contiguous
allocations.
In this series, we add some missing modifiers and improve quality-of-life for
chunked_vector users (the static_assert patch).
Those modifiers were generally unused since they have O(n) complexity
and therefore not useful for hot paths, but they are used in some
control plane code on vectors which we'd like to replace with chunked_vectors.
A candidate for such a replacement is token_range_vector (see #3335).
This is a prerequisite for fixing some minor stalls; I don't expect we'll backport
fixes to those stalls.
Closesscylladb/scylladb#24162
* github.com:scylladb/scylladb:
utils: chunked_vector: add swap() method
utils: chunked_vector: add range insert() overloads
utils: chunked_vector: relax static_assert
utils: chunked_vector: implement erase() for single elements and ranges
utils: chunked_vector: implement insert() for single-element inserts
Inserts an iterator range at some position.
Again we insert the range at the end and use std::rotate() to
move the newly inserted elements into place, forgoing possible
optimizations.
Unit tests are added.
Implement using std::rotate() and resize(). The elements to be erased
are rotated to the end, then resized out of existence.
Again we defer optimization for trivially copyable types.
Unit tests are added.
Needed for range_streamer with token_ranges using chunked_vector.
partition_range_compat's unwrap() needs insert if we are to
use it for chunked_vector (which we do).
Implement using push_back() and std::rotate().
emplace(iterator, args) is also implemented, though the benefit
is diluted (it will be moved after construction).
The implementation isn't optimal - if T is trivially copyable
then using std::memmove() will be much faster that std::rotate(),
but this complex optimization is left for later.
Unit tests are added.
Added function returning custom index class name.
Added printing custom index class name when using DESCRIBE.
Changed validation to reflect current support of indices.
Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed.
Start catching the exact exception that we expect to be thrown.
Closesscylladb/scylladb#24065
Refs scylladb/scylla-enterprise#5321
Adds two small test cases, for slight variations on KMIP host config
being missing when rebooting a node, and table/sstable resolution
failing due to this.
Mainly to verify that we fail as expected, without crashing.
Closesscylladb/scylladb#23544
Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.
Code cleanup, no backport required.
Closesscylladb/scylladb#24087
* github.com:scylladb/scylladb:
test/boost/mutation_reader_another_test: drop v2 from reader and related names
test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
test/boost/mutation_test: s/consumer_v2/consumer/
test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
readers/mutation_readers: s/generating_reader_v2/generating_reader/
readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
readers/mutation_source: s/make_reader_v2/make_mutation_reader/
readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
mutation/mutation_compactor: drop v2 from compactor and related names
replica/table: s/make_reader_v2/make_mutation_reader/
mutation_writer: s/bucket_writer_v2/bucket_writer/
readers/queue: drop v2 from reader and related names
readers/multishard: drop v2 from reader and related names
readers/evictable: drop v2 from reader and related names
readers/multi_range: remove flat from name
The test is failing in CI sometimes due to performance reasons.
There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
reclaim, the `background_reclaim` scheduling group isn't actually
guaranteed to get any CPU, regardless of shares. If the process is
switched out inside the `background_reclaim` group, it might
accumulate so much vruntime that it won't get any more CPU again
for a long time.
We have seen both.
This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.
So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test). After the patch,
we only check that the background reclaim happens *eventually*.
Fixes https://github.com/scylladb/scylladb/issues/15677
Backporting this is optional. The test is flaky even in stable branches, but the failure is rare.
Closesscylladb/scylladb#24030
* github.com:scylladb/scylladb:
logalloc_test: don't test performance in test `background_reclaim`
logalloc: make background_reclaimer::free_memory_threshold publicly visible
compress: distribute compression dictionaries over shards
We don't want each shard to have its own copy of each dictionary.
It would unnecessary pressure on cache and memory.
Instead, we want to share dictionaries between shards.
Before this commit, all dictionaries live on shard 0.
All other shards borrow foreign shared pointers from shard 0.
There's a problem with this setup: dictionary blobs receive many random
accesses. If shard 0 is on a remote NUMA node, this could pose
a performance problem.
Therefore, for each dictionary, we would like to have one copy per NUMA node,
not one copy per the entire machine. And each shard should use the copy
belonging to its own NUMA node. This is the main goal of this patch.
There is another issue with putting all dicts on shard 0: it eats
an assymetric amount of memory from shard 0.
This commit spreads the ownership of dicts over all shards within
the NUMA group, to make the situation more symmetric.
(Dict owner is decided based on the hash of dict contents).
It should be noted that the last part isn't necessarily a good thing,
though.
While it makes the situation more symmetric within each node,
it makes it less symmetric across the cluster, if different node
sizes are present.
If dicts occupy 1% of memory on each shard of a 100-shard node,
then the same dicts would occupy 100% of memory on a 1-shard node.
So for the sake of cluster-wide symmetry, we might later want to consider
e.g. making the memory limit for dictionaries inversely proportional
to the number of shards.
New functionality, added to a feature which isn't in any stable branch yet. No backporting.
Closesscylladb/scylladb#23590
* github.com:scylladb/scylladb:
test: add test/boost/sstable_compressor_factory_test
compress: add some test-only APIs
compress: rename sstable_compressor_factory_impl to dictionary_holder
compress: fix indentation
compress: remove sstable_compressor_factory_impl::_owner_shard
compress: distribute compression dictionaries over shards
test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version
test: remove sstables::test_env::do_with()
The test checks that merging the partition versions on-the-fly using the
cursor gives the same results as merging them destructively with apply_monotonically.
In particular, it tests that the continuity of both results is equal.
However, there's a subtlety which makes this not true.
The cursor puts empty dummy rows (i.e. dummies shadowed by the partition
tombstone) in the output.
But the destructive merge is allowed (as an expection to the general
rule, for optimization reasons), to remove those dummies and thus reduce
the continuity.
So after this patch we instead check that the output of the cursor
has continuity equal to the merged continuities of version.
(Rather than to the continuity of merged versions, which can be
smaller as described above).
Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did
the same in a different test.
Fixes https://github.com/scylladb/scylladb/issues/13642Closesscylladb/scylladb#24044
In next patches, make_sstable_compressor_factory() will have to
disappear.
In preparation for that, we switch to a seastar::thread-dependent
replacement.
`sstable_manager` depends on `sstable_compressor_factory&`.
Currently, `test_env` obtains an implementation of this
interface with the synchronous `make_sstable_compressor_factory()`.
But after this patch, the only implementation of that interface
`sstable_compressor_factory&` will use `sharded<...>`,
so its construction will become asynchronous,
and the synchronous `make_sstable_compressor_factory()` must disappear.
There are several possible ways to deal with this, but I think the
easiest one is to write an asynchronous replacement for
`make_sstable_compressor_factory()`
that will keep the same signature but will be only usable
in a `seastar::thread`.
All other uses of `make_sstable_compressor_factory()` outside of
`test_env::do_with()` already are in seastar threads,
so if we just get rid of `test_env::do_with()`, then we will
be able to use that thread-dependent replacement. This is the
purpose of this commit.
We shouldn't be losing much.
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.
A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.
Fixes: #23762Closesscylladb/scylladb#23984
The test is failing in CI sometimes due to performance reasons.
There are at least two problems:
1. The initial 500ms (wall time) sleep might be too short. If the reclaimer
doesn't manage to evict enough memory during this time, the test will fail.
2. During the 100ms (thread CPU time) window given by the test to background
reclaim, the `background_reclaim` scheduling group isn't actually
guaranteed to get any CPU, regardless of shares. If the process is
switched out inside the `background_reclaim` group, it might
accumulate so much vruntime that it won't get any more CPU again
for a long time.
We have seen both.
This kind of timing test can't be run reliably on overcommitted machines
without modifying the Seastar scheduler to support that (by e.g. using
thread clock instead of wall time clock in the scheduler), and that would
require an amount of effort disproportionate to the value of the test.
So for now, to unflake the test, this patch removes the performance test
part. (And the tradeoff is a weakening of the test).
This PR contains changes that do not add new functionality, and have small refactoring of the existing code.
The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial`
This is part two of splitting the original PR.
This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1¬ifications_query=reason%3Aparticipating#pullrequestreview-2778582278.
Closesscylladb/scylladb#23882
* github.com:scylladb/scylladb:
test.py: add awareness of extra_scylla_cmdline_options
test.py: increase timeout for C++ tests in pytest
test.py: switch method of finding the root repo directory
test.py: move get_combined_tests to the correct facade
test.py: add common directory for reports
test.py: add the possibility to provide additional env vars
test.py: move setup cgroups to the generic method
test.py: refactor resource_gather.py
Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack.
Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election.
To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal.
This change ensures a more stable voter distribution and reduces unnecessary voter reassignments.
The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations.
Fixes: scylladb/scylladb#23950Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
No backport: The limited voters feature is currently only present in master.
Closesscylladb/scylladb#23888
* https://github.com/scylladb/scylladb:
raft: ensure topology coordinator retains votership
raft: retain existing voters across data centers and racks
raft: refactor limited voters calculator to prioritize nodes
raft: replace pointer with reference for non-null output parameter
raft: reduce code duplication in group0 voter handler
raft: unify and optimize datacenter and rack info creation
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail
* duplicate check of drops == 0 (just cosmetic)
Fix all three problems, the second is especially important because it made the test flaky.
Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them.
Fixes: #16794
Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there
Closesscylladb/scylladb#23783
* github.com:scylladb/scylladb:
test/boost/multishard_mutation_query_test: ensure test runs with vnodes
test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits
The limited voters feature did not account for the existing topology
coordinator (Raft leader) when selecting voters to be removed.
As a result, the limited voters calculator could inadvertently remove
the votership of the current topology coordinator, triggering
an unnecessary Raft leader re-election.
This change ensures that the existing topology coordinator's votership
status is preserved unless absolutely necessary. When choosing between
otherwise equivalent voters, the node other than the topology coordinator
is prioritized for removal. This helps maintain stability in the cluster
by avoiding unnecessary leader re-elections.
Additionally, only the alive leader node is considered relevant for this
logic. A dead existing leader (topology coordinator) is excluded from
consideration, as it is already in the process of losing leadership.
Fixes: scylladb/scylladb#23588Fixes: scylladb/scylladb#23786
Fix an issue in the voter calculator where existing voters were not
retained across data centers and racks in certain scenarios. This
occurred when voters were distributed across more data centers and racks
than the maximum allowed number of voters.
Previously, the prioritization logic for data centers and racks did not
consider the number of existing assigned voters. It only prioritized
nodes within a single data center or rack, which could result in
unnecessary reassignment of voters.
Improved the prioritization logic to account for the number of existing
voters in each data center and rack.
This change ensures a more stable voter distribution and reduces
unnecessary voter reassignments.
Fixes: scylladb/scylladb#23950
All tests in this suite use the default "ks" keyspace from cql_test_env.
This keyspace has tablet support and at any time we might decide to make
it use tablets by default. This would make all these tests use the
tablet path in multishard_mutation_query.cc. These tests were created to
test the vastly more complex vnodes code path in said file. The tablet
path is much simpler and it is only used by SELECT * FROM
MUTATION_FRAGMENTS() and which has its own correctness tests.
So explicitely create a vnodes keyspace and use it in all the tests to
restore the test functionality.
This test has multiple problems:
* has 3 embedded loops to run different scenarios, ignores variable from
2 of these, running with hardcoded settings instead
* initializes misses and lookups to 0 at the start of each scenario,
this throws off per-page increment checks, when the previous scenario
moved these metrics and they don't start from 0; this causes the test
to sometimes fail
* duplicate check of drops == 0 (just cosmetic)
Fix all three problems, the second is especially important because it
made the test flaky.
Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with.
A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range.
Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled).
There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried.
And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them.
Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario.
It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly.
Fixes#23634.
We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so.
Closesscylladb/scylladb#23806
* github.com:scylladb/scylladb:
test: Verify partitioned set store split and unsplit correctly
sstables: Fix quadratic space complexity in partitioned_sstable_set
compaction: Wire table_state into make_sstable_set()
compaction: Introduce token_range() to table_state
dht: Add overlap_ratio() for token range
It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet.
Closesscylladb/scylladb#23815
* github.com:scylladb/scylladb:
test: Remove sstable_assertions::get_stats_metadata()
test: Add sstable_assertions::operator->()
Interval map is very susceptible to quadratic space behavior when
it's flooded with many entries overlapping all (or most of)
intervals, since each such entry will have presence on all
intervals it overlaps with.
A trigger we observed was memtable flush storm, which creates many
small "L0" sstables that spans roughly the entire token range.
Since we cannot rely on insertion order, solution will be about
storing sstables with such wide ranges in a vector (unleveled).
There should be no consequence for single-key reads, since upper
layer applies an additional filtering based on token of key being
queried.
And for range scans, there can be an increase in memory usage,
but not significant because the sstables span an wide range and
would have been selected in the combined reader if the range of
scan overlaps with them.
Anyway, this is a protection against storm of memtable flushes
and shouldn't be the common scenario.
It works both with tablets and vnodes, by adjusting the token
range spanned by compaction group accordingly.
Fixes#23634.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This provides a way for compaction layer to know compaction group's
token range. It will be important for sstable set impl to know
the token range of underlying group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Due to the changes in creating schemas with base info the
test_schema_is_recovered_after_dying seems to be flaky when checking
that the schema is actually lost after 'grace_period'. We don't
actually guarantee that the the schema will be lost at that exact
moment so there's no reason to test this. To remove the flakiness,
we remove the check and the related sleep, which should also slightly
improve the speed of this test.