`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.
This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.
Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.
Closesscylladb/scylladb#25396
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
raft: enforce odd number of voters in group0
Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least three groups).
Fixes: scylladb/scylladb#23266
No backport: This is a new change that is to be only deployed in the new version, so it will not be backported.
Closesscylladb/scylladb#25332
* https://github.com/scylladb/scylladb:
raft: enforce odd number of voters in group0
test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
Currently, when a container or smart pointer holds a const payload
type, utils::clear_gently does not detect the object's clear_gently
method as the method is non-const and requires a mutable object,
as in the following example in class tablet_metadata:
```
using tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>;
using table_to_tablet_map = std::unordered_map<table_id, tablet_map_ptr>;
```
That said, when a container is cleared gently the elements it holds
are destroyed anyhow, so we'd like to allow to clear them gently before
destruction.
This change still doesn't allow directly calling utils::clear_gently
an const objects.
And respective unit tests.
Fixes#24605
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unlike clear_gently of SharedPtr, clear_gently of a
`foreign_ptr<shared_ptr<T>>` calls clear_gently on the contained object
even if it's still shared and may still be in use.
This change examines the foreign shared pointer's use_count
and calls clear_gently on the shard object only when
its use_count reaches 1.
Fixes#25026
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
This test currently uses gc_grace_seconds=0. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
This test will soon need to be changed to use tombstone-gc=repair. This
cannot work as of now, as the test uses a single-node cluster.
The options are the following:
* Make it use more than one nodes
* Make repair work with single node clusters
* Rewrite in C++ where repair can be done synthetically
We chose the last option, it is the simplest one both in terms of code
and runtime footprint.
The new test is in test/boost/row_cache_test.cc
Two changes were done during the migration
* Change the name to
test_populating_reader_tombstone_gc_with_data_in_memtable
to better express which cache component this test is targetting;
* Use NullCompactionStrategy on the table instead of disabling
auto-compaction.
These tests currently use tombstone-gc=immediate. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).
Fixes: scylladb/scylladb#23266
It is not needed anymore.
With that database::_sstable_generation_generator can
be a regular member rather than optional and initialized
later.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not needed anymore as we always generate
uuid generations.
Convert sstable_directory_test_table_simple_empty_directory_scan
to use the newly added empty() method instead of
checking the highest generation seen.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It is not needed anymore as we always generate
uuid generations.
Move highest_generation_seen(sharded<sstables::sstable_directory>& directory)
to sstables/sstable_directory module.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Wired the unrepaired, repairing and repaired views into compaction_group.
Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.
Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.
Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.
From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.
The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.
Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The separatation of sstables into the logical repaired and unrepaired
virtual sets, requires some adjustments for certain tests, in particular
for those that look at number of compaction tasks or number of sstables.
The following tests need adjustment:
* test/cluster/tasks/test_tablet_tasks.py
* test/boost/memtable_test.cc
The adjustments are done in such a way that they accomodate both the
case where there is separate repaired/unrepaired states and when there
isn't.
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Refs #22733
* No backport required
Closesscylladb/scylladb#25222
* github.com:scylladb/scylladb:
locator: abstract_replication_strategy: implement local_replication_strategy
locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
locator: abstract_replication_map: rename make_effective_replication_map
locator: abstract_replication_map: rename calculate_effective_replication_map
replica: database: keyspace: rename {create,update}_effective_replication_map
locator: effective_replication_map_factory: rename create_effective_replication_map
locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
keyspace: rename get_vnode_effective_replication_map
dht: range_streamer: use naked e_r_m pointers
storage_service: use naked e_r_m pointers
alternator: ttl: use naked e_r_m pointers
locator: abstract_replication_strategy: define is_local
This is the next part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25154
Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here.
The new code added here is not used for anything yet, but it's posted as a separate PR
to keep things reviewably small.
This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes.
It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR)
into the on-disk format, and the logic for traversing the on-disk nodes during a read.
New functionality, no backporting needed.
Closesscylladb/scylladb#25317
* github.com:scylladb/scylladb:
sstables/trie: add tests for BTI node serialization and traversal
sstables/trie: implement BTI node traversal
sstables/trie: implement BTI serialization
utils/cached_file: add get_shared_page()
utils/cached_file: replace a std::pair with a named struct
Test that we can load sstables with mixed, numerical and uuid
generation types, and verify the expected data.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The KMIP tests start a local PyKMIP server and configure it to write
logs in the test's temporary directory (`tmpdir`). However, the tmpdir
is a RAII object that deletes the directory once it goes out of scope,
causing PyKMIP server logs to be lost on test failures.
To assist with debugging, preserve the whole directory if the test
failed with an exception. Allow the user to disable this by setting the
SCYLLA_TEST_PRESERVE_TMP_ON_EXCEPTION environment variable.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to calculate_vnode_effective_replication_map since
it is specific to vnode-based range calculations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to static_effective_replication_map_ptr, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
to get_static_effective_replication_map, in preparation
for separating local_effective_replication_map from
vnode_effective_replication_map (both are per-keyspace).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API.
This endpoint allows dropping all quarantined SSTables either globally or
for a specific keyspace and tables.
Optional query parameters `keyspace` and `tables` (comma-separated table names) can be
provided to limit the scope of the operation.
Fixesscylladb/scylladb#19061
Backport is not required, it is new functionality
Closesscylladb/scylladb#25063
* github.com:scylladb/scylladb:
docs: Add documentation for the nodetool dropquarantinedsstables command
nodetool: add command for dropping quarantine sstables
rest_api: add endpoint which drops all quarantined sstables
The vector_store_client_test was observed to be flaky, sometimes hanging while waiting for a response from HTTP server.
Problem:
The default load balancing algorithm (in Seastar's posix_server_socket_impl::accept) could route an incoming connection to a different shard than the one executing the test.
Because the HTTP server is a non-sharded service running only on the test's originating shard, any connection submitted to another shard would never be handled, causing the test client to hang waiting for response.
Solution:
The patch resolves the issue by explicitly setting fixed cpu load balancing algorithm.
This ensures that incoming connections are always handled on the same shard where the HTTP server is running.
Closesscylladb/scylladb#25314
Adds tests which check that nodes serialized by `bti_node_sink`
are readable by `bti_node_reader` with the right result.
(Note: there are no tests which check compatibility of the encoded nodes
with Cassandra or with handwritten hexdumps. There are only tests
for mutual compatibility between Scylla's writers and readers.
This can be considered a gap in testing.)
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.
To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.
Fixes: https://github.com/scylladb/scylladb/issues/25044
Should be backported to 2025.3 since we need this fix for the restore
Closesscylladb/scylladb#24961
* github.com:scylladb/scylladb:
s3_creds: code cleanup
s3_creds: Make `reload` unconditional
s3_creds: Add test exposing credentials renewal issue
Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field.
Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator *connections*, we will list the currently active *requests*.
The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients.
Fixes#24993
This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request.
Closesscylladb/scylladb#25178
* github.com:scylladb/scylladb:
test/cqlpy: slightly strengthen test for system.clients
generic_server: use utils::scoped_item_list
docs/alternator: document the system.clients system table in Alternator
alternator: add test for Alternator clients in system.clients
alternator: list active Alternator requests in system.clients
utils: unit test for utils::scoped_item_list
utils: add a scoped_item_list utility class
utils: add "fatal" version of utils::on_internal_error()
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Refs scylladb/scylladb#19191
New functionality, no backporting needed.
Closesscylladb/scylladb#25154
* github.com:scylladb/scylladb:
sstables: introduce trie_writer
utils/bit_cast: add object_representation()
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.
This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).
Fix this by configuring the server to use an in-memory SQLite.
Fixes#24842.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#24995
The previous test introduced a new utility class, utils::scoped_item_list.
This patch adds a comprehensive unit test for the new class.
We test basic usage of scoped_item_list, its size() and empty() methods,
how items are removed from the list when their handle goes out of scope,
how a handle's move constructor works, how items can be read and written
through their handles, and finally that removing an item during a
for_each_gently() iteration doesn't break the iteration.
One thing I still didn't figure out how to properly test is how removing
an item during *multiple* iterations that run concurrently fixes
multiple iterators. I believe the code is correct there (we just have a
list of ongoing iterations - instead of just one), but haven't found
yet a way to reproduce this situation in a test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Fixes#22106
Moves the shared compress components to sstables, and rename to
match class type.
Adjust includes, removing redundant/unneeded ones where possible.
Closesscylladb/scylladb#25103
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.
Fixes: scylladb/scylladb#24963
Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).
Closesscylladb/scylladb#25188
* github.com:scylladb/scylladb:
test: sl: verify that legacy auth is not queried in sl to raft upgrade
qos: don't populate effective service level cache until auth is migrated to raft
Sprinkle constexpr where needed to make the default constructor,
move constructor, and destructor constexpr.
Add a test to verify.
This is needed to make a thread_local variable containing an
empty managed_bytes constinit, reducing thread-local guards.
When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message.
This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression.
This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream.
Tests are added to make sure the message split works.
Fixes#24808Closesscylladb/scylladb#25002
* github.com:scylladb/scylladb:
repair: Avoid too many fragments in a single repair_row_on_wire
repair: Change partition_key_and_mutation_fragments to use chunked_vector
utils: Allow chunked_vector::erase to work with non-default-constructible type
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.
Fixes: scylladb/scylladb#24963
With the change in "repair: Avoid too many fragments in a single
repair_row_on_wire", the
std::list<frozen_mutation_fragment> _mfs;
in partition_key_and_mutation_fragments will not contain large number of
fragments any more. Switch to use chunked_vector.