It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
(cherry picked from commit 5321720853)
Closesscylladb/scylladb#26755
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.
We fix this issue in this PR and add a regression test.
This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.
Fixes#26534
This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).
- (cherry picked from commit 1d09b9c8d0)
- (cherry picked from commit 6b2e003994)
- (cherry picked from commit c57f097630)
Parent PR: #26612Closesscylladb/scylladb#26678
* https://github.com/scylladb/scylladb:
test: test group0 tombstone GC in the Raft-based recovery procedure
group0_state_id_handler: remove unused group0_server_accessor
group0_state_id_handler: consider state IDs of all non-ignored topology members
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.
We fix this issue in this commit by considering topology members
instead.
We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.
We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.
We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
effort to minimize external dependencies in Scylla.
(cherry picked from commit 1d09b9c8d0)
ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well.
This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality.
Fixes#25195.
- (cherry picked from commit 1106157756)
- (cherry picked from commit ea41f652c4)
- (cherry picked from commit a7e46974d4)
- (cherry picked from commit e1d9c83406)
- (cherry picked from commit 8d5bd212ca)
- (cherry picked from commit 6ba0fa20ee)
- (cherry picked from commit 8410532fa0)
Parent PR: #26003Closesscylladb/scylladb#26300
* github.com:scylladb/scylladb:
test/cluster: Add tests for invalid SSTable compression options
test/boost: Add tests for SSTable compression config options
main: Validate SSTable compression options from config
db/config: Add SSTable compression options for user tables
db/config: Prepare compression_parameters for config system
compressor: Validate presence of sstable_compression in parameters
compressor: Add missing space in exception message
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
We fix this issue and add a regression test in this PR.
Fixes#26569
This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).
- (cherry picked from commit ec3a35303d)
- (cherry picked from commit da8748e2b1)
- (cherry picked from commit 71de01cd41)
Parent PR: #26572Closesscylladb/scylladb#26596
* github.com:scylladb/scylladb:
test: test_raft_recovery_entry_loss: fix the typo in the test case name
test: verify that schema pulls are disabled in the Raft-based recovery procedure
raft topology: disable schema pulls in the Raft-based recovery procedure
Complementary to the previous patch. It triggers semantic validation
checks in `compression_parameters::validate()` and expects the server to
exit. The tests examine both command line and YAML options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8410532fa0)
Since patch 03461d6a54, all boost unit tests depending on `cql_test_env`
are compiled into a single executable (`combined_tests`). Add the new
test in there.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 6ba0fa20ee)
`compression_parameters` provides two levels of validation:
* syntactic checks - implemented in the constructor
* semantic checks - implemented by `compression_parameters::validate()`
The former are applied implicitly when parsing the options from the
command line or from scylla.yaml. The latter are currently not applied,
but they should.
In lack of a better place, apply them in main, right after joining the
cluster, to make sure that the cluster features have been negotiated.
The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation
will fail if the feature is disabled and a dictionary compression
algorithm has been selected.
Also, mark `validate()` as const so that it can be called from a config
object.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 8d5bd212ca)
ScyllaDB offers the `compression` DDL property for configuring
compression per user table (compression algorithm and chunk size). If
not specified, the default compression algorithm is the LZ4Compressor
with a 4KiB chunk size (refer to the default constructor for
`compression_parameters`). The same default applies to system tables as
well.
Add a new configuration option to allow customizing the default for user
tables. Use the previously hardcoded default as the new option's default
value.
Note that the option has no effect on ALTER TABLE statements. An altered
table either inherits explicit compression options from the CQL
statement, or maintains its existing options.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit e1d9c83406)
SSTable compression is currently configurable only per table, via the
`compression` property in CREATE/ALTER TABLE statements. This is
represented internally via the `compression_parameters` class. We plan
to offer the same options via the configuration as well, to make the
default compression method for user tables configurable.
This patch prepares the ground by making the `compression_parameters`
usable as a `config_file::named_value`, namely:
* Define an extraction operator (required by `boost::program_options`
for parsing the options from command line).
* Define a formatter (required by `named_value::operator()`).
* Define a template specialization for `config_type_for` (required by
`named_value` constructor).
* Define a yaml converter (required for parsing the options from
scylla.yaml).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit a7e46974d4)
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
(cherry picked from commit da8748e2b1)
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.
Fixes#26569
(cherry picked from commit ec3a35303d)
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.
Fixesscylladb/scylladb#24982Closesscylladb/scylladb#26472
(cherry picked from commit 7c6e84e2ec)
Closesscylladb/scylladb#26550
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.
Fix that by making the test wait for CQL availability via `get_ready_cql()`.
Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.
Fixesscylladb/scylladb#25362Closesscylladb/scylladb#25366
(cherry picked from commit 85fd4d23fa)
Closesscylladb/scylladb#26513
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixes https://github.com/scylladb/scylladb/issues/25290
backport possible to improve stability
- (cherry picked from commit a1161c156f)
- (cherry picked from commit 3c3dd4cf9d)
- (cherry picked from commit ad1a5b7e42)
Parent PR: #26180Closesscylladb/scylladb#26477
* github.com:scylladb/scylladb:
service/qos: set long timeout for auth queries on SL cache update
auth: add query_state parameter to query functions
auth: refactor query_all_directly_granted
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.
The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
example, all 3 nodes that first joined the new group 0 are in the same
dc and rack, but only one of them becomes a voter because the voter
handler tries to make non-members in other dcs/racks voters.
Fixes#26321Closesscylladb/scylladb#26327
(cherry picked from commit 67d48a459f)
Closesscylladb/scylladb#26426
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixesscylladb/scylladb#25290
(cherry picked from commit ad1a5b7e42)
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.
in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
(cherry picked from commit 3c3dd4cf9d)
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.
This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.
(cherry picked from commit a1161c156f)
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them
To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.
Fixes#25551.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#25563
(cherry picked from commit 149f9d8448)
Closesscylladb/scylladb#25631
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.
Fixes: scylladb/scylladb#26320
Broken documentation link fix for the tool help output, needs backport to all live source-available versions.
- (cherry picked from commit 5a69838d06)
- (cherry picked from commit 15a4a9936b)
- (cherry picked from commit fe73c90df9)
Parent PR: #26322Closesscylladb/scylladb#26387
* github.com:scylladb/scylladb:
tools/scylla-sstable: fix doc links
release: adjust doc_link() for the post source-available world
tools/scylla-nodetool: remove trailing " from doc urls
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.
(cherry picked from commit fe73c90df9)
Always set the node ops progress to 100% when the operation finishes,
regardless of success or failure. This ensures the progress never
remains below 100%, which would otherwise indicates a pending node
operation in case of an error.
Fixes#26193Closesscylladb/scylladb#26194
(cherry picked from commit b31e651657)
Closesscylladb/scylladb#26265
Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer.
Fixes#26184
The issue is present in 2025.x as well and looks cheap to backport
- (cherry picked from commit 8438c59ad3)
Parent PR: #26185
Also includes backport of #24835 which also applies to 2025.2 and is now crucial.
The scylla_io_queues.ticket() method is renamed by this backport, but without 24835 it will be problematic to fix all callers of it
Closesscylladb/scylladb#26263
* github.com:scylladb/scylladb:
scylla-gdb: Fix fair-queue entry printing
scylla-gdb: Don't show io_queue executing and queued resources
Currently, while stopping the compaction_manager, we stop task_manager
compaction module and concurrently run compaction_manager::really_do_stop.
really_do_stop stops and waits for all task_executors that are kept
in compaction_manager::_tasks, but nothing ensures that no more tasks will
be added there. Due to leftover tasks, we trigger on_fatal_internal_error.
Modify the order of compaction_manager::stop. After the change, we stop
compaction tasks in the following order:
- abort module abort source;
- close module gate in the background;
- stop_ongoing_compactions (kept in compaction_manager::_tasks);
- wait until module gate is closed.
Check module abort source before creating compaction executor and
adding it to _tasks.
Thanks to the above, we can be sure that:
- after module::stop there will be no tasks in _tasks;
- compaction_manager::stop aborts all tasks; we don't wait for any whole
compaction to finish.
Fixes: https://github.com/scylladb/scylladb/issues/25806.
Fixes shutdown bug; Needs backports to all version
- (cherry picked from commit 17707d0e6b)
- (cherry picked from commit 97c77d7cd5)
Parent PR: #25885Closesscylladb/scylladb#26223
* github.com:scylladb/scylladb:
compaction: move _tasks check
compaction: stop compaction module in really_do_stop
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.
To fix this, let's make split compaction always work with the state when it started, not a global state.
Fixes#24153.
All 2025.* versions are vulnerable, so fix must be backported to them.
- (cherry picked from commit 0c1587473c)
- (cherry picked from commit 68f23d54d8)
Parent PR: #25690Closesscylladb/scylladb#25934
* github.com:scylladb/scylladb:
replica: Fix split compaction when tablet boundaries change
replica: Futurize split_compaction_options()
Catching a live entry in IO queue is very rare event, so we haven't seen
it so far, but the `_ticket` member had been removed ~2 years ago and
had been replaced with `_capacity` which is plain 64bit integer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26185
(cherry picked from commit 8438c59ad3)
These counters are no longer accounted by io-queue code and are always
zero. Even more -- accounting removal happened years ago and we don't
have Scylla versions built with seastar older than that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#24835
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes
After last step, split compaction initiated in step 2 can fail
because it works with the global tablet map, rather than the
map when the compaction started. With the global state changing
under its feet, on merge, the mutation splitting writer will
think it's going backwards since sibling tablets are merged.
This problem was also seen when running load-and-stream, where
split initiated by the sstable writer failed, split completed,
and the unsplit sstable is left in the table dir, causing
problems in the restart.
To fix this, let's make split compaction always work with
the state when it started, not a global state.
Fixes#24153.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 68f23d54d8)
SSTable compression parameters should always define an algorithm via the
`sstable_compression` sub-option. Add a check in the constructor to
ensure this is always provided (unless no options are given, which is
interpreted as "no compression").
This change has no user-visible effect, since the same check is already
performed at a higher-level, while validating the CQL properties of
CREATE TABLE and ALTER TABLE statements (see `cf_prop_defs::validate()`).
However, it will become useful in later patches, when compression config
options will be introduced.
Although now redundant, keep the sanity check in
`cf_prop_defs::validate()` to maintain consistency of error messages
with Cassandra.
Note also that Cassandra uses 'class' instead of 'sstable_compression'
since version 3.11.10, but Scylla still doesn't support this, see:
https://github.com/scylladb/scylladb/issues/4200
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit ea41f652c4)
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
(cherry picked from commit 48bbe09c8b)
Closesscylladb/scylladb#26219
In compaction_manager::really_do_stop we check whether _tasks list
is empty after the compactions are stopped. However, a new task may
still sneak in, causing the assertion failure. Such a task won't
be there for long - module::make_task will fail as the module is
already stopped.
Move the assertion, that checks if _tasks is empty, after the
compaction_states' gates are closed.
Fixes: #25806.
(cherry picked from commit 97c77d7cd5)
Currently, compaction::task_manager_module is stopped in compaction_manager::stop,
concurrently to really_do_stop. We can't predict the order of the two.
Do not set _task_manager_module to nullptr at stop, because
compaction_manager::really_do_stop() may be called before the actual
shutdown, while other components still try to use it.
compaction::task_manager_module does not keep a pointer to compaction_manager,
so we won't end up with memory leak.
Stop compaction module in really_do_stop, after ongoing compactions
are stopped.
It's a preparation for further patches.
(cherry picked from commit 17707d0e6b)