In commit 44a1daf we added the ability to read system tables through
the DynamoDB API (actually, the Scan and Query requests only).
This ability is useful for tests, and can also be useful to users who
want to read information that is only available through system tables.
This patch adds support also for *writing* into system tables. This will
be useful for Alternator tests, were we want to temporarily change
some live-updatable configuration option - and so far haven't been
able to do that like we did do in some cql-pytest tests.
For reasons explained in issue #23218, only superuser roles are allowed to
write to system tables - it is not enough for the role to be granted
MODIFY permissions on the system table or on ALL KEYSPACES. Moreover,
the ability to modify system tables carries special risks, so this
patch only allows writes to the system tables if a new configuration
option "alternator_allow_system_table_write" turned on. This option is
turned off by default.
This patch also includes a test for this new configuration-writing
capability. The test scripts test/alternator/run and test.py now
run Scylla with alternator_allow_system_table_write turned on, but
the new test can also run without this option, and will be skipped
in that case (to allow running the test suite against some manually-
run instance of Scylla).
Fixes: #12348
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Four tests had almost identical code to read an item from Scylla
configuration (using the system.config system table). It's time
to make this into a new utility function, scylla_config_read().
This is a good time to do it, because in a later patch I want
to also add a similar function to *write* into the configuration.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised.
The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands.
This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, and the offending replay positions which caused the check to fail.
This PR also adds a test which validates that TRUNCATE works correctly with concurrent writes. More specifically, it checks that:
- all data written before TRUNCATE starts is deleted
- none of the data after TRUNCATE completes is deleted
Fixes: #25173Fixes: #25013
Backport is needed in versions which check for truncate with concurrent writes using `on_internal_error()`: 2025.3 2025.2 2025.1
Closesscylladb/scylladb#25174
* github.com:scylladb/scylladb:
truncate: add test for truncate with concurrent writes
truncate: change check for write during truncate into a log warning
This PR introduces a refinement in how credential renewal is triggered. Previously, the system attempted to renew credentials one hour before their expiration, but the credentials provider did not recognize them as expired—resulting in a no-op renewal that returned existing credentials. This led the timer fiber to immediately retry renewal, causing a renewal storm.
To resolve this, we remove expiration (or any other checks) in `reload` method, assuming that whoever calls this method knows what he does.
Fixes: https://github.com/scylladb/scylladb/issues/25044
Should be backported to 2025.3 since we need this fix for the restore
Closesscylladb/scylladb#24961
* github.com:scylladb/scylladb:
s3_creds: code cleanup
s3_creds: Make `reload` unconditional
s3_creds: Add test exposing credentials renewal issue
Before this series, the "system.clients" virtual table lists active connections (and their various properties, like client address, logged in username and client version) only for CQL requests. This series adds also Alternator clients to system.clients. One of the interesting use cases of this new feature is understanding exactly which SDK a user is using -without inspecting their application code. Different SDKs pass different "User-Agent" headers in requests, and that User-Agent will be visible in the system.clients entries for Alternator requests as the "driver_name" field.
Unlike CQL where logged in username, driver name, etc. applies to a complete connection, in the Alternator API, different requests can theoretically be signed by different users and carry different headers but still arrive over the same HTTP connection. So instead of listing the currently open Alternator *connections*, we will list the currently active *requests*.
The first three patches introduce utilities that will be useful in the implementation. The fourth patch is the implementation itself (which is quite simple with the utility introduced in the second patch), and the fifth patch a regression test for the new feature. The sixth patch adds documentation, the seventh patch refactors generic_server to use the newly introduced utility class and reduce code duplication, and the eighth patch adds a small check to an existing check of CQL's system.clients.
Fixes#24993
This patch adds a new feature, so doesn't require a backport. Nevertheless, if we want it to get to existing customers more quickly to allow us to better understand their use case by reading the system.clients table, we may want to consider backporting this patch to existing branches. There is some risk involved in this patch, because it adds code that gets run on every Alternator request, so a bug on it can cause problems for every Alternator request.
Closesscylladb/scylladb#25178
* github.com:scylladb/scylladb:
test/cqlpy: slightly strengthen test for system.clients
generic_server: use utils::scoped_item_list
docs/alternator: document the system.clients system table in Alternator
alternator: add test for Alternator clients in system.clients
alternator: list active Alternator requests in system.clients
utils: unit test for utils::scoped_item_list
utils: add a scoped_item_list utility class
utils: add "fatal" version of utils::on_internal_error()
test_validate_truncate_with_concurrent_writes checks if truncate deletes
all the data written before the truncate starts, and does not delete any
data after truncate completes.
introduce tiering marks
1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next)
2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run.
set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run,
about 4 hours without paralelization)
1 test as unstable(as exaple ot marker usage)
Closesscylladb/scylladb#24974
The following steps are performed in sequence as part of the
Raft-based recovery procedure:
- set `recovery_leader` to the host ID of the recovery leader in
`scylla.yaml` on all live nodes,
- send the `SIGHUP` signal to all Scylla processes to reload the config,
- perform a rolling restart (with the recovery leader being restarted
first).
These steps are not intuitive and more complicated than they could be.
In this PR, we simplify these steps. From now on, we will be able to
simply set `recovery_leader` on each node just before restarting it.
Apart from making necessary changes in the code, we also update all
tests of the Raft-based recovery procedure and the user-facing
documentation.
Fixesscylladb/scylladb#25015
The Raft-based procedure was added in 2025.2. This PR makes the
procedure simpler and less error-prone, so it should be backported
to 2025.2 and 2025.3.
Closesscylladb/scylladb#25032
* github.com:scylladb/scylladb:
docs: document the option to set recovery_leader later
test: delay setting recovery_leader in the recovery procedure tests
gossip: add recovery_leader to gossip_digest_syn
db: system_keyspace: peers_table_read_fixup: remove rows with null host_id
db/config, gms/gossiper: change recovery_leader to UUID
db/config, utils: allow using UUID as a config option
Add a test demonstrating that renewing credentials does not update
their expiration. After requesting credentials again, the expiration
remains unchanged, indicating no actual update occurred.
Currently, `get_cas_shard` uses `sharder.shard_for_reads` to decide which shard to use for LWT execution—both on replicas and the coordinator.
If the coordinator is not a replica, `shard_for_reads` returns a default shard (shard 0). There are at least two problems with this:
* shard 0 can become overloaded, because all LWT coordinators-but-not-replacas are served on it.
* mismatch with replicas: the default shard doesn't match what `shard_for_reads` returns on replicas. This hinders the "same shard for client and server" RPC level optimization.
In this PR we change `get_cas_shard` to use a primary replica shard if the current node is not a replica. This guarantees that all LWT coordinators for the same tablet will be served on the same shard. This is important for LWT coordinator locks (`paxos::paxos_state::get_cas_lock`). Also, if all tablet replicas on different nodes live on the same shard, RPC optimization will make sure that no additional `smp::submit_to` will be needed on server side.
backport: not needed, since this fix applies only to LWT over tablets, and this feature is not released yet
Closesscylladb/scylladb#25224
* github.com:scylladb/scylladb:
test_tablets_lwt.py: make tests rf_rack_valid
test_tablets_lwt: add test_lwt_coordinator_shard
storage_proxy.cc: get_cas_shard: fallback to the primary replica shard
sharder: add try_get_shard_for_reads method
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Refs scylladb/scylladb#19191
New functionality, no backporting needed.
Closesscylladb/scylladb#25154
* github.com:scylladb/scylladb:
sstables: introduce trie_writer
utils/bit_cast: add object_representation()
With current implementation if pytest will be killed, it will not be
able to write the stdout from the boost test. With a new way it should
be updated while test executing, instead of writing it the end of the
test.
Closesscylladb/scylladb#25260
The PyKMIP server uses an SQLite database to store artifacts such as
encryption keys. By default, SQLite performs a full journal and data
flush to disk on every CREATE TABLE operation. Each operation triggers
three fdatasync(2) calls. If we multiply this by 16, that is the number
of tables created by the server, we get a significant number of file
syncs, which can last for several seconds on slow machines.
This behavior has led to CI stability issues from KMIP unit tests where
the server failed to complete its schema creation within the 20-second
timeout (observed on spider9 and spider11).
Fix this by configuring the server to use an in-memory SQLite.
Fixes#24842.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#24995
This patch adds a reproducer for issue #16261, where it was reported
that when Alternator read-modify-write (using LWT) operations to the
same partition are sent to different nodes, sometimes the operation
fails immediately, with an InternalServerError claiming to be a "timeout",
although this happens almost immediately (after a few milliseconds),
not after any real timeout.
The test uses 3 nodes, and 3 threads which send RMW operations to different
items in the same partition, and usually (though not with 100% certainty)
it reaches the InternalServerError in around 100 writes by each thread.
This InternalServerError looks like:
Internal server error: exceptions::mutation_write_timeout_exception
(Operation timed out for alternator_alternator_Test_1719157066704.alternator_Test_1719157066704 - received only 1 responses from 2 CL=LOCAL_SERIAL.)
The test also prints how much time it took for the request to fail,
for example:
In incrementing 1,0 on node 1: error after 0.017074108123779297
This is 0.017 seconds - it's not the cas_contention_timeout_in_ms
timeout (1 second) or any other timeout.
If we enable trace logging, adding to topology_experimental_raft/suite.yaml
extra_scylla_cmdline_options: ["--logger-log-level", "paxos=trace"]
we get the following TRACE-level message in the log:
paxos - CAS[0] accept_proposal: proposal is partially rejected
This again shows the problem is "uncertainty" (partial rejection) and not
a timeout.
Refs #16261
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#19445
We already have a rather rudimentary test for system.clients listing CQL
connections. However, as written the test will pass if system.clients is
empty :-) So let's strengthen the test to verify that there must be at
least one CQL connection listed in system.clients. Indeed, the test runs
the "SELECT FROM system.clients" over one CQL connection, so surely that
connection must be present.
This patch doesn't strengthen this test in any other way - it still has
just one connection, not many, it still doesn't validate the values of
most of the columns, and it is still written to assume the Scylla server
is running on localhost and not running any other workload in parallel.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a regression test for the feature added in the previous patch,
i.e that the system.clients virtual table also lists ongoing Alternator request.
The new test reads the system.clients system table using an Alternator Scan
request, so it should see its own request - at least - in the result. It
verifies that it sees Alternator requests (at least one), and that these
requests have the expected fields set, and for a couple of fields, we
even know which value to expect (the "client_type" field is "alternator",
and the "ssl_enabled" field depends on whether the test is checking an
http:// or https:// URL (you can try both in test/alternator/run - by
using or not using the "--https" parameter).
The new test fails before the previous patch (because system.clients
will not list any Alternator connection), and passes after it.
As all tests in test_system_tables.py for Scylla-specific system tables,
this test is marked scylla_only and skipped when running on AWS DynamoDB.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous test introduced a new utility class, utils::scoped_item_list.
This patch adds a comprehensive unit test for the new class.
We test basic usage of scoped_item_list, its size() and empty() methods,
how items are removed from the list when their handle goes out of scope,
how a handle's move constructor works, how items can be read and written
through their handles, and finally that removing an item during a
for_each_gently() iteration doesn't break the iteration.
One thing I still didn't figure out how to properly test is how removing
an item during *multiple* iterations that run concurrently fixes
multiple iterators. I believe the code is correct there (we just have a
list of ongoing iterations - instead of just one), but haven't found
yet a way to reproduce this situation in a test.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds'
that allows overriding the default value without using error injection.
Fixesscylladb/scylladb#24641Closesscylladb/scylladb#24746
This is the first part of a larger project meant to implement a trie-based
index format. (The same or almost the same as Cassandra's BTI).
As of this patch, the new code isn't used for anything yet,
but we introduced separately from its users to keep PRs small enough
for reviewability.
This commit introduces trie_writer, a class responsible for turning a
stream of (key, value) pairs (already sorted by key) into a stream of
serializable nodes, such that:
1. Each node lies entirely within one page (guaranteed).
2. Parents are located in the same page as their children (best-effort).
3. Padding (unused space) is minimized (best-effort).
It does mostly what you would expect a "sorted keys -> trie" builder to do.
The hard part is calculating the sizes of nodes (which, in a well-packed on-disk
format, depend on the exact offsets of the node from its children) and grouping
them into pages.
This implementation mostly follows Cassandra's design of the same thing.
There are some differences, though. Notable ones:
1. The writer operates on chains of characters, rather than single characters.
In Cassandra's implementation, the writer creates one node per character.
A single long key can be translated to thousands of nodes.
We create only one node per key. (Actually we split very long keys into
a few nodes, but that's arbitrary and beside the point).
For BTI's partition key index this doesn't matter.
Since it only stores a minimal unique prefix of each key,
and the trie is very balanced (due to token randomness),
the average number of new characters added per key is very close to 1 anyway.
(And the string-based logic might actually be a small pessimization, since
manipulating a 1-byte string might be costlier than manipulating a single byte).
But the row index might store arbitrarily long entries, and in that case the
character-based logic might result in catastrophically bad performance.
For reference: when writing a partition index, the total processing cost
of a single node in the trie_writer is on the order of 800 instructions.
Total processing cost of a single tiny partition during a `upgradesstables`
operation is on the order of 10000 instructions. A small INSERT is on the
order of 40000 instructions.
So processing a single 1000-character clustering key in the trie_writer
could cost as much as 20 INSERTs, which is scary. Even 100-character keys
can be very expensive. With extremely long keys like that, the string-based
logic is more than ~100x cheaper than character-based logic.
(Note that only *new* characters matter here. If two index entries share a
prefix, that prefix is only processed once. And the index is only populated
with the minimal prefix needed to distinguish neighbours. So in practice,
long chains might not happen often. But still, they are possible).
I don't know if it makes sense to care about this case, but I figured the
potential for problems is too big to ignore, so I switched to chain-based logic.
2. In the (assumed to be rare) case when a grouped subtree turns out to be bigger
than a full page after revising the estimate, Cassandra splits it in a
different way than us.
For testability, there is some separation between the logic responsible
for turning a stream of keys into a stream of nodes, and the logic
responsible for turning a stream of nodes into a stream of bytes.
This commit only includes the first part. It doesn't implement the target
on-disk format yet.
The serialization logic is passed to trie_writer via a template parameter.
There is only one test added in this commit, which attempts to be exhaustive,
by testing all possible datasets up to some size. The run time of the test
grows exponentially with the parameter size. I picked a set of parameters
which runs fast enough while still being expressive enough to cover all
the logic. (I checked the code coverage). But I also tested it with greater parameters
on my own machine (and with DEVELOPER_BUILD enabled, which adds extra sanitization).
Fixes#22106
Moves the shared compress components to sstables, and rename to
match class type.
Adjust includes, removing redundant/unneeded ones where possible.
Closesscylladb/scylladb#25103
Right now, service levels are migrated in one group0 command and auth is migrated in the next one. This has a bad effect on the group0 state reload logic - modifying service levels in group0 causes the effective service levels cache to be recalculated, and to do so we need to fetch information about all roles. If the reload happens after SL upgrade and before auth upgrade, the query for roles will be directed to the legacy auth tables in system_auth - and the query, being a potentially remote query, has a timeout. If the query times out, it will throw an exception which will break the group0 apply fiber and the node will need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module does not start populating and using the service level cache until both service levels and auth are migrated to raft. This is achieved by adding the check both to the cache population logic and the effective service level getter - they now look at service level's accessor new method, `can_use_effective_service_level_cache` which takes a look at the auth version.
Fixes: scylladb/scylladb#24963
Should be backported to all versions which support upgrade to topology over raft - the issue described here may put the cluster into a state which is difficult to get out of (group0 apply fiber can break on multiple nodes, which necessitates their restart).
Closesscylladb/scylladb#25188
* github.com:scylladb/scylladb:
test: sl: verify that legacy auth is not queried in sl to raft upgrade
qos: don't populate effective service level cache until auth is migrated to raft
Tests sometimes fail in ScyllaCluster.add_server on the
'replaced_srv.host_id' line because host_id is not resolved yet. In
this commit we introduce functions try_get_host_id and get_host_id
that resolve it when needed.
Closesscylladb/scylladb#25177
This PR implements solution proposed in scylladb/scylladb#24481
Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections.
The updated shutdown process is as follows:
1. Initial Shutdown Phase
* Close the accept gate to block new incoming connections.
* Abort all accept() calls.
* For all active connections:
* Close only the input side of the connection to prevent new requests.
* Keep the output side open to allow responses to be sent.
2. Drain Phase
* Wait for all in-progress requests to either complete or fail.
3. Final Shutdown Phase
* Fully close all connections.
Fixesscylladb/scylladb#24481Closesscylladb/scylladb#24499
* https://github.com/scylladb/scylladb:
test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout.
generic_server: Two-step connection shutdown.
transport: consmetic change, remove extra blanks.
transport: Handle sleep aborted exception in sleep_until_timeout_passes
generic_server: replace empty destructor with `= default`
generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output`
generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class.
test: Add test for query execution during CQL server shutdown
This patch adds an xfailing test reproducing a bug where when adding
an IF NOT EXISTS to a INSERT JSON statement, the IF NOT EXISTS is
ignored.
This bug has been known for 4 years (issue #8682) and even has a FIXME
referring to it in cql3/statements/update_statement.cc, but until now
we didn't have a reproducing test.
The tests in this patch also show that this bug is specific to
INSERT JSON - regular INSERT works correctly - and also that
Cassandra works correctly (and passes the test).
Refs #8682
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25244
This small series fixes two small bugs in the "--release" feature of test/cqlpy/run and test/alternator/run, which allows a developer to run signle-node functional tests against any past release of Scylla. The two patches fix:
1. Allow "run --release" to be used when Scylla has not even been built from source.
2. Fix a mistake in choosing the most recent release when only a ".0" and RC releases are available. This is currently the case for the 2025.2 branch, which is why I discovered the bug now.
Fixes#25223
This patch only affects developer's experience if using the test/cqlpy/run script manually (these scripts are not used by CI), so should not be backported.
Closesscylladb/scylladb#25227
* https://github.com/scylladb/scylladb:
test/cqlpy: fix fetch_scylla.py for .0 releases
test/cqlpy: fix "run --release" when Scylla hasn't been built
This is a refactoring commit. Remove the rf_rack_valid_keyspaces: False
flag because rf_rack_validy is going to become mundatory in
scylladb/scylladb#23526
When repairing a partition with many rows, we can store many fragments in a repair_row_on_wire object which is sent as a rpc stream message.
This could cause reactor stalls when the rpc stream compression is turned on, because the compression compresses the whole message without any split and compression.
This patch solves the problem at the higher level by reducing the message size that is sent to the rpc stream.
Tests are added to make sure the message split works.
Fixes#24808Closesscylladb/scylladb#25002
* github.com:scylladb/scylladb:
repair: Avoid too many fragments in a single repair_row_on_wire
repair: Change partition_key_and_mutation_fragments to use chunked_vector
utils: Allow chunked_vector::erase to work with non-default-constructible type
decrease request timeout.
In debug mode, queries may sometimes take longer than the default 30 seconds.
To address this, the timeout value `request_timeout_on_shutdown_in_seconds`
during tests is aligned with other request timeouts.
Change request timeout for tests from 180s to 90s since we must keep the request
timeout during shutdown significantly lower than the graceful shutdown timeout(2m),
or else a request timeout would cause a graceful shutdown timeout and fail a test.
Adjust `test_service_levels_upgrade`: right before upgrade to topology
on raft, enable an error injection which triggers when the standard role
manager is about to query the legacy auth tables in the
system_auth keyspace. The preceding commit which fixes
scylladb/scylladb#24963 makes sure that the legacy tables are not
queried during upgrade to topology on raft, so the error injection does
not trigger and does not cause a problem; without that commit, the test
fails.
Right now, service levels are migrated in one group0 command and auth
is migrated in the next one. This has a bad effect on the group0 state
reload logic - modifying service levels in group0 causes the effective
service levels cache to be recalculated, and to do so we need to fetch
information about all roles. If the reload happens after SL upgrade and
before auth upgrade, the query for roles will be directed to the legacy
auth tables in system_auth - and the query, being a potentially remote
query, has a timeout. If the query times out, it will throw
an exception which will break the group0 apply fiber and the node will
need to be restarted to bring it back to work.
In order to solve this issue, make sure that the service level module
does not start populating and using the service level cache until both
service levels and auth are migrated to raft. This is achieved by adding
the check both to the cache population logic and the effective service
level getter - they now look at service level's accessor new method,
`can_use_effective_service_level_cache` which takes a look at the auth
version.
Fixes: scylladb/scylladb#24963
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).
However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.
This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.
The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.
Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.
Refs #25115Fixes#24625
Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.
Closesscylladb/scylladb#25151
* https://github.com/scylladb/scylladb:
raft_group0: split shutdown into abort_and_drain and destroy
Revert "main.cc: fix group0 shutdown order"
When repairing a partition with many rows, we can store many fragments
in a repair_row_on_wire object which is sent as a rpc stream message.
This could cause reactor stalls when the rpc stream compression is
turned on, because the compression compresses the whole message without
any split and compression.
This patch solves the problem at the higher level by reducing the
message size that is sent to the rpc stream.
Tests are added to make sure the message split works.
Fixes#24808
With the change in "repair: Avoid too many fragments in a single
repair_row_on_wire", the
std::list<frozen_mutation_fragment> _mfs;
in partition_key_and_mutation_fragments will not contain large number of
fragments any more. Switch to use chunked_vector.
Nowadays the way to configure an internal service is
1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value
The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.
For the reference: similar changes with other services: #23705 , #20174 , #19166Closesscylladb/scylladb#25118
* github.com:scylladb/scylladb:
gms,init: Move get_disabled_features_from_db_config() from gms
code: Update callers generating feature service config
gms: Make feature_config a simple struct
gms: Split feature_config_from_db_config() into two
Compaction is routine and the log messages pollute the log files,
hiding important information.
All the data is available via `nodetool compactionhistory`.
Reduce noise by demoting those log messages to debug level.
One test is adjusted to use debug level for compaction, since it
listens for those messages.
Closesscylladb/scylladb#24949
The test/cqlpy/fetch_scylla.py script is used by test/cqlpy/run and
test/alternator/run to implement their "--release" option - which allows
you to run current tests against any official release of Scylla
downloaded from Scylla's S3 bucket.
When you ask to get release "2025.1", the idea is to fetch the latest
release available in the 2025.1 stream - currently it is 2025.1.5.
fetch_scylla.py does this by listing the available 2025.1 releases,
sorting them and fetching the last one.
We had a bug in the sort order - version 0 was sorted before version
0-rc1, which is incorrect (the version 2025.2.0 came after
2025.2.0~rc1).
For most releases this didn't cause any problem - 0~rc1 was sorted after
0, but 5 (for example) came after both, so 2025.1.5 got downloaded.
But when a release has **only** an rc and a .0 release, we incorrectly
used the rc instead of the .0.
This patch fixes the sort order by using the "/" character, which sorts
before "0", in rc version strings when sorting the release numbers.
Before this patch, we had this problem in "--release 2025.2" because
currently 2025.2 only has RC releases (rc0 and rc1) and a .0 release,
and we wrongly downloaded the rc1. After this patch, the .0 is chosen
as expected:
$ test/cqlpy/run --release 2025.2
Chosen download for ScyllaDB 2025.2: 2025.2.0
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "--release" option of test/cqlpy/run can be used to run current
cqlpy tests against any official release of Scylla, which is
automatically downloaded from Scylla's S3 bucket. You should be
able to run tests like that even without having compiled Scylla
from source. But we had a bug, where test/cqlpy/run looked for
the built Scylla executable *before* parsing the "--release"
option, and this bug is fixed in this patch.
The Alternator version of the run script, test/alternator/run,
doesn't need to be fixed because it already did things in the
right order.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We're enabling the configuration option `rf_rack_valid_keyspaces`
in all Python test suites. All relevant tests have been adjusted
to work with it enabled.
That encompasses the following suites:
* alternator,
* broadcast_tables,
* cluster (already enabled in scylladb/scylladb@ee96f8dcfc),
* cql,
* cqlpy (already enabled in scylladb/scylladb@be0877ce69),
* nodetool,
* rest_api.
Two remaining suites that use tests written in Python, redis and scylla_gdb,
are not affected, at least not directly.
The redis suite requires creating an instance of Scylla manually, and the tests
don't do anything that could violate the restriction.
The scylla_gdb suite focuses on testing the capabilities of scylla-gdb.py, but
even then it reuses the `run` file from the cqlpy suite.
Fixesscylladb/scylladb#25126Closesscylladb/scylladb#24617
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.
Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
* When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
* This breaks LWT correctness in certain scenarios.
* Failing test cases demonstrating this:
* test_lwt_state_is_preserved_on_tablet_migration
* test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
* Tablets may be placed on arbitrary shards by the tablet balancer.
* Accessing Paxos state in system tables could require a shard jump, degrading performance.
We move Paxos state into a dedicated Paxos table, colocated with the base table:
* Each base table gets its own Paxos state table.
* This table is lazily created on the first LWT operation.
* Its tablets are colocated with those of the base table, ensuring:
* Co-migration during tablet movement
* Co-rebuilding with the base table
* Shard alignment for local access to Paxos state
Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].
This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]:
* Tablet migration vs LWT correctness
* Paxos table sharding
Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.
References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)
Backport: not needed because this is a new feature.
Closesscylladb/scylladb#24819
* github.com:scylladb/scylladb:
create_keyspace: fix warning for tablets
docs: fix lwt.rst
docs: fix tablets.rst
alternator: enable LWT
random_failures: enable execute_lwt_transaction
test_tablets_lwt: add test_paxos_state_table_permissions
test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
test_tablets_lwt: test timeout creating paxos state table
test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
test_tablets_lwt: migrate test_lwt_support_with_tablets
test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
test_tablets_lwt: add simple test for LWT
check_internal_table_permissions: handle Paxos state tables
client_state: extract check_internal_table_permissions
paxos_store: handle base table removal
database: get_base_table_for_tablet_colocation: handle paxos state table
paxos_state: use node_local_only mode to access paxos state
query_options: add node_local_only mode
storage_proxy: handle node_local_only in query
storage_proxy: handle node_local_only in mutate
storage_proxy: introduce node_local_only flag
abstract_replication_strategy: remove unused using
storage_proxy: add coordinator_mutate_options
storage_proxy: rename create_write_response_handler -> make_write_response_handler
storage_proxy: simplify mutate_prepare
paxos_state: lazily create paxos state table
migration_manager: add timeout to start_group0_operation and announce
paxos_store: use non-internal queries
qp: make make_internal_options public
paxos_store: conditional cf_id filter
paxos_store: coroutinize
feature_service: add LWT_WITH_TABLETS feature
paxos_state: inline system_keyspace functions into paxos_store
paxos_state: extract state access functions into paxos_store
When shutting down in `generic_server`, connections are now closed in two steps.
First, only the RX (receive) side is shut down. Then, after all ongoing requests
are completed, or a timeout happened the connections are fully closed.
Fixesscylladb/scylladb#24481
This test simulates a scenario where a query is being executed while
the query coordinator begins shutting down the CQL server and client
connections. The shutdown process should wait until the query execution
is either completed or timed out.
Test for scylladb/scylladb#24481
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.
However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.
For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)
Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.
Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.
Preparation for new functionality, no backporting needed.
Closesscylladb/scylladb#25093
* github.com:scylladb/scylladb:
sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
test/boost: add a test for inexact index lookups
sstables/mx/reader: allow passing a custom index reader to the constructor
sstables/index_reader: remove advance_to
sstables/mx/reader: handle inexact lookups in `advance_context()`
sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
sstables/index_reader: make the return value of `get_partition_key` optional
sstables/mx/reader: handle "backward jumps" in forward_to
sstables/mx/reader: filter out partitions outside the queried range
sstables/mx/reader: update _pr after `fast_forward_to`
As they are wasteful in many cases, it is better
to move the tablet_map if possible, or clone
it gently in an async fiber.
Add clone() and clone_gently() methods to
allow explicit copies.
* minor optimization, no backport needed
Closesscylladb/scylladb#24978
* github.com:scylladb/scylladb:
tablets: prevent accidental copy of tablets_map
locator: tablets: get rid of synchronous mutate_tablet_map
A tablet repair started with /storage_service/repair_async/ API
bypasses tablet repair scheduler and repairs only the tablets
that are owned by the requested node. Due to that, to safely repair
the whole keyspace, we need to first disable tablet migrations
and then start repair on all nodes.
With the new API - /storage_service/tablets/repair -
tailored to tablet repair requirements, we do not need additional
preparation before repair. We may request it on one node in
a cluster only and, thanks to tablet repair scheduler,
a whole keyspace will be safely repaired.
Both nodetool and Scylla Manager have already started using
the new API to repair tablets.
Refuse repairing tablet keyspaces with /storage_service/repair_async -
403 Forbidden is returned. repair_async should still be used to repair
vnode keyspaces.
Fixes: https://github.com/scylladb/scylladb/issues/23008.
Breaking change; no backport.
Closesscylladb/scylladb#24678
* github.com:scylladb/scylladb:
repair: remove unused code
api: repair_async: forbid repairing tablet keyspaces
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).
However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.
This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.
The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.
Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().
Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().
The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.