During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus, finished request does not imply that a node is removed from
the topology.
Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.
Modify the get_children method to ignore the exception and warn
about the failure instead.
Keep token_metadata_ptr in get_children to prevent topology from changing.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867
Needs backports to all versions
Closesscylladb/scylladb#29035
* github.com:scylladb/scylladb:
tasks: fix indentation
tasks: do not fail the wait request if rpc fails
tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
(cherry picked from commit 2e47fd9f56)
Closesscylladb/scylladb#29531
The test was flaky. The scenario looked like this:
1. Stop server 1.
2. Set its rf_rack_valid_keyspaces configuration option to true.
3. Create an RF-rack-invalid keyspace.
4. Start server 1 and expect a failure during start-up.
It was wrong. We cannot predict when the Raft mutation corresponding to
the newly created keyspace will arrive at the node or when it will be
processed. If the check of the RF-rack-valid keyspaces we perform at
start-up was done before that, it won't include the keyspace. This will
lead to a test failure.
Unfortunately, it's not feasible to perform a read barrier during
start-up. What's more, although it would help the test, it wouldn't be
useful otherwise. Because of that, we simply fix the test, at least for
now.
The new scenario looks like this:
1. Disable the rf_rack_valid_keyspaces configuration option on server 1.
2. Start the server.
3. Create an RF-rack-invalid keyspace.
4. Perform a read barrier on server 1. This will ensure that it has
observed all Raft mutations, and we won't run into the same problem.
5. Stop the node.
6. Set its rf_rack_valid_keyspaces configuration option to true.
7. Try to start the node and observe a failure.
This will make the test perform consistently.
---
I ran the test (in dev mode, on my local machine) three times before
these changes, and three times with them. I include the time results
below.
Before:
```
real 0m47.570s
user 0m41.631s
sys 0m8.634s
real 0m50.495s
user 0m42.499s
sys 0m8.607s
real 0m50.375s
user 0m41.832s
sys 0m8.789s
```
After:
```
real 0m50.509s
user 0m43.535s
sys 0m9.715s
real 0m50.857s
user 0m44.185s
sys 0m9.811s
real 0m50.873s
user 0m44.289s
sys 0m9.737s
```
Fixes SCYLLADB-1137
Backport: The test is present on all supported branches, and so we
should backport these changes to them.
Closesscylladb/scylladb#29218
* github.com:scylladb/scylladb:
test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces
test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py
(cherry picked from commit d52fbf7ada)
Closesscylladb/scylladb#29654
In the unregistered-ID branch, ldap_msgfree() was called on a result
already owned by an RAII ldap_msg_ptr, causing a double-free on scope
exit. Remove the redundant manual free.
Fixes: SCYLLADB-1446
Backport: 2026.1, 2025.4, 2025.1 - it's a memory corruption, with a one-line fix, so better backport it everywhere.
Closesscylladb/scylladb#29302
* github.com:scylladb/scylladb:
test: ldap: add regression test for double-free on unregistered message ID
ldap: fix double-free of LDAPMessage in poll_results()
(cherry picked from commit 895fdb6d29)
Closesscylladb/scylladb#29393Closesscylladb/scylladb#29454Closesscylladb/scylladb#29646
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.
During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.
Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.
Fixes: CUSTOMER-317
Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.
Closesscylladb/scylladb#29242
(cherry picked from commit 91df3795fc)
Closesscylladb/scylladb#29526Closesscylladb/scylladb#29582Closesscylladb/scylladb#29615
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.
Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.
We also provide a regression test. It takes 6s in dev mode.
Fixes SCYLLADB-959
Closesscylladb/scylladb#29266
(cherry picked from commit b9f82f6f23)
Closesscylladb/scylladb#29291Closesscylladb/scylladb#29308Closesscylladb/scylladb#29419
`data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead.
Added tests after the fix. Manually checked that tests fail without the fix.
Fixes SCYLLADB-1350
This is a fix that prevents format crash. No known occurrence in production, but backport is desirable.
Closesscylladb/scylladb#29262
* github.com:scylladb/scylladb:
test: boost: test null data value to_parsable_string
cql3: fix null handling in data_value formatting
(cherry picked from commit 816f2bf163)
Closesscylladb/scylladb#29468
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
(cherry picked from commit 2e68f48068)
Closesscylladb/scylladb#29280
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.
Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.
This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103
Backport to all supported branches (other than 2026.1), as the test can fail
there.
Closesscylladb/scylladb#29108
(cherry picked from commit 1398a55d16)
Closesscylladb/scylladb#29206
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")
If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`
In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.
Fixes https://github.com/scylladb/scylladb/issues/26523Closesscylladb/scylladb#26635
(cherry picked from commit 0a22ac3c9e)
Closesscylladb/scylladb#26880
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
node A is missing data, digest mismatches and data reconciliation
is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
failing the entire read query
When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...
> await finish_writes()
test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
worker_id = 1
...
> rows = await cql.run_async(rd_stmt, [pk])
E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```
Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.
However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.
This change:
- Retries the verification read in start_writes() on failure to mitigate
races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs
Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529
Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125Closesscylladb/scylladb#28140
(cherry picked from commit e07fe2536e)
Closesscylladb/scylladb#28827
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
(cherry picked from commit f83f911bae)
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
(cherry picked from commit a256ba7de0)
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
(cherry picked from commit c5239edf2a)
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
(cherry picked from commit ac4af5f461)
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
(cherry picked from commit 628e74f157)
The test can currently fail like this:
```
> await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
because node A performs tablet migrations at the the same time.
What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.
The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.
Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.
The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.
Fixes#25938Closesscylladb/scylladb#28632
(cherry picked from commit 0693091aff)
Closesscylladb/scylladb#28670
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`.
We also extend `test_maintenance_mode` to provide a reproducer for #27988
and to avoid similar bugs in the future.
Fixes#27988
This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.
- (cherry picked from commit a08c53ae4b)
- (cherry picked from commit 9d4a5ade08)
- (cherry picked from commit c92962ca45)
- (cherry picked from commit 408c6ea3ee)
- (cherry picked from commit 53f58b85b7)
- (cherry picked from commit 867a1ca346)
- (cherry picked from commit 6c547e1692)
- (cherry picked from commit 7e7b9977c5)
Parent PR: #28322Closesscylladb/scylladb#28496
* https://github.com/scylladb/scylladb:
test: test_maintenance_mode: enable maintenance mode properly
test: test_maintenance_mode: shutdown cluster connections
test: test_maintenance_mode: run with different keyspace options
test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
test: test_maintenance_mode: get rid of the conditional skip
test: test_maintenance_mode: remove the redundant value from the query result
storage_proxy: skip validate_read_replica in maintenance mode
storage_service: set up topology properly in maintenance mode
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.
(cherry picked from commit 7e7b9977c5)
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.
The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.
(cherry picked from commit 867a1ca346)
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.
Additionally, we check if the error message is as expected.
(cherry picked from commit 53f58b85b7)
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.
We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).
It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.
(cherry picked from commit 408c6ea3ee)
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.
(cherry picked from commit 92dbde54a5)
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.
A reproducer is included which fails before and passes after the fix.
Fixes: #26847Closesscylladb/scylladb#28163
(cherry picked from commit a53f989d2f)
Closesscylladb/scylladb#28277
…played
Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
Needs backport to all live versions.
- (cherry picked from commit 4d0de1126f)
- (cherry picked from commit e3dcb7e827)
Parent PR: #26793Closesscylladb/scylladb#27088
* github.com:scylladb/scylladb:
test: extend test_batchlog_replay_failure_during_repair
db: batchlog_manager: update _last_replay only if all batches were replayed
test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits
for the first node that logs info about repair_writer using asyncio.wait.
The done group is never awaited, so we never learn about the error.
The test itself is incorrect and the log about repair_writer is never
printed. We never learn about that and tests finishes successfully
after 10 minutes timeout.
Fix the test:
- disable hinted handoff;
- repair tablets of the whole table:
- new table is added so that concurrent migration is possible;
- use wait_for_first_completed that awaits done group;
- do some cleanups.
Remove nightly mark.
Fixes: #26148.
Closesscylladb/scylladb#26209
(cherry picked from commit 48bbe09c8b)
Closesscylladb/scylladb#26217
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.
The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.
Fixes#27990Closesscylladb/scylladb#28009
(cherry picked from commit bee5f63cb6)
Closesscylladb/scylladb#28174
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.
The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.
We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.
Fixes#28253
The fix addresses CI flakiness and only changes the test, so it
should be backported.
Closesscylladb/scylladb#28274
(cherry picked from commit 1f0f694c9e)
Closesscylladb/scylladb#28302
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
(cherry picked from commit e3dcb7e827)
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.
Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.
Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
reader_concurrency_semaphore::signal()
* all places where base resources are released now call
release_base_resources()
A reproducer unit test is added, which fails before and passes after the
fix.
Fixes: #28083Closesscylladb/scylladb#28155
(cherry picked from commit b7bc48e7b7)
Closesscylladb/scylladb#28238
It is observed that:
repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error
(repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])
This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.
To fix, we can relax the check of (start, end] for the min max range.
Fixes#27220
Backport to all active branches.
(cherry picked from commit e97a504)
Parent PR: #27357Closesscylladb/scylladb#27458
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:
```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```
Fixes#28195.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#28197
(cherry picked from commit 8aca7b0eb9)
Closesscylladb/scylladb#28207
Note: only the fix for replace with different IP has been backported to
2025.1. The IP-based gossiper in 2025.1 has made backporting the fix
for replace with the same IP too difficult.
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
require IP of the replacing node (like client requests or REST API
requests coming from nodetool).
We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.
For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.
For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.
We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.
We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.
We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).
Fixes#28057
Backport this PR to all branches as it fixes a problematic bug.
- (cherry picked from commit fc4c2df2ce)
- (cherry picked from commit 4526dd93b1)
- (cherry picked from commit 749b0278e5)
- (cherry picked from commit 0fed9f94f8)
Manually cherry-picked:
- 90b5b2c5f5
- 92b165b8c0
Parent PR: #27435Closesscylladb/scylladb#28096
* github.com:scylladb/scylladb:
test: introduce test_full_shutdown_during_replace
utils: error_injection: allow aborting wait_for_message
raft topology: preserve IP -> ID mapping of a replacing node on restart
pylib/rest_client.py: encode injection name
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them
To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.
Fixes#25551.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#25563
(cherry picked from commit 149f9d8448)
Closesscylladb/scylladb#25630
Fixes#27992
When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).
Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.
The same problem applies later when we try to sanity check the return of our permits.
Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)
Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).
Closesscylladb/scylladb#27998
(cherry picked from commit a7cdb602e1)
Closesscylladb/scylladb#28095
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.
The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.
We fix the problem by making the function wait until the schema versions
match.
Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.
Fixes#23803Closesscylladb/scylladb#28114
(cherry picked from commit 6b5923c64e)
Closesscylladb/scylladb#28172
Sometimes it's convenient to use slashes in injection names,
for example my_component/my_method/my_condition. Without quote()
we get 'handler not found' error from Scylla.
(cherry picked from commit 92b165b8c0)
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.
This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).
Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.
Fixes#27639
* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions
- (cherry picked from commit ec4069246d)
- (cherry picked from commit 5be6b80936)
- (cherry picked from commit 0342a24ee0)
- (cherry picked from commit 02ee341a03)
- (cherry picked from commit 2a803d2261)
- (cherry picked from commit 93b827c185)
- (cherry picked from commit ebd667a8e0)
Parent PR: #27643Closesscylladb/scylladb#28067
* https://github.com/scylladb/scylladb:
test: database_test: do_with_some_data: randomize keys
database: truncate_table_on_all_shards: drop outdated TODO comment
database: truncate_table_on_all_shards: consider can_flush on all shards
memtable_list: unify can_flush and may_flush
test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
replica: table, storage_group, compaction_group: add needs_flush
test: database_test: do_with_some_data_in_thread: accept void callback function
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:
```
@pytest.mark.asyncio
async def test_not_enough_token_owners(manager: ManagerClient):
"""
Test that:
- the first node in the cluster cannot be a zero-token node
- removenode and decommission of the only token owner fail in the presence of zero-token nodes
- removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
owners would fall below the RF of some keyspace using tablets
"""
logging.info('Trying to add a zero-token server as the first server in the cluster')
await manager.server_add(config={'join_ring': False},
property_file={"dc": "dc1", "rack": "rz"},
expected_error='Cannot start the first node in the cluster as zero-token')
logging.info('Adding the first server')
server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})
logging.info('Adding two zero-token servers')
# The second server is needed only to preserve the Raft majority.
server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]
logging.info(f'Trying to decommission the only token owner {server_a}')
await manager.decommission_node(server_a.server_id,
expected_error='Cannot decommission the last token-owning node in the cluster')
logging.info(f'Stopping {server_a}')
await manager.server_stop_gracefully(server_a.server_id)
logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
await manager.remove_node(server_b.server_id, server_a.server_id,
expected_error='cannot be removed because it is the last token-owning node in the cluster')
logging.info(f'Starting {server_a}')
await manager.server_start(server_a.server_id)
logging.info('Adding a normal server')
await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})
cql = manager.get_cql()
await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)
> async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:
test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None
@asynccontextmanager
async def new_test_keyspace(manager: ManagerClient, opts, host=None):
"""
A utility function for creating a new temporary keyspace with given
options. It can be used in a "async with", as:
async with new_test_keyspace(ManagerClient, '...') as keyspace:
"""
keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
try:
yield keyspace
except:
logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
raise
else:
> await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})
test/cluster/util.py:544: NoHostAvailable
```
Fixes#28011Closesscylladb/scylladb#28040
(cherry picked from commit 34df158605)
Closesscylladb/scylladb#28063
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.
I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.
Fixes: SCYLLADB-163
Closesscylladb/scylladb#27764
(cherry picked from commit e4da0afb8d)
Closesscylladb/scylladb#28000
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ebd667a8e0)
Test that table::flush waits on outstanding flushes, even if the active memtable is empty
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0342a24ee0)
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ec4069246d)
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
that is reverted
During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.
This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.
Fixes https://github.com/scylladb/scylladb/issues/26512
It's a pre existing issue. Backport is required to all recent 2025.x versions.
- (cherry picked from commit 669286b1d6)
- (cherry picked from commit 67f1c6d36c)
- (cherry picked from commit 6163fedd2e)
Parent PR: #27413Closesscylladb/scylladb#27425
* github.com:scylladb/scylladb:
topology_coordinator: Fix the indentation for the cleanup_target case
topology_coordinator: Add barrier to cleanup_target
test_node_failure_during_tablet_migration: Increase RF from 2 to 3
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
- (cherry picked from commit 502b03dbc6)
- (cherry picked from commit 904183734f)
- (cherry picked from commit 7f20b66eff)
- (cherry picked from commit e1b2180092)
- (cherry picked from commit d436233209)
- (cherry picked from commit 1935268a87)
- (cherry picked from commit 6fc43f27d0)
Parent PR: #26319Closesscylladb/scylladb#26752
* github.com:scylladb/scylladb:
repair: throw if flush failed in get_flush_time
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.
The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.
By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.
The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.
Fixes: scylladb/scylladb#27307
Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).
- (cherry picked from commit 3af5183633)
- (cherry picked from commit 4ba3e90f33)
Parent PR: #28010Closesscylladb/scylladb#28036
* https://github.com/scylladb/scylladb:
test/raft: use valid sentinel in liveness check to prevent digest errors
test/raft: improve debugging in randomized_nemesis_test
test/raft: improve reporting in the randomized_nemesis_test digest functions