vector_search: fix race condition on connection timeout
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness.
- (cherry picked from commit 3107d9083e)
Parent PR: #29031Closesscylladb/scylladb#29360
* github.com:scylladb/scylladb:
vector_search: test: fix flaky test
vector_search: fix race condition on connection timeout
`data_value::to_parsable_string()` crashes with a null pointer dereference when called on a `null` data_value. Return `"null"` instead.
Added tests after the fix. Manually checked that tests fail without the fix.
Fixes SCYLLADB-1350
This is a fix that prevents format crash. No known occurrence in production, but backport is desirable.
Closesscylladb/scylladb#29262
* github.com:scylladb/scylladb:
test: boost: test null data value to_parsable_string
cql3: fix null handling in data_value formatting
(cherry picked from commit 816f2bf163)
Closesscylladb/scylladb#29384Closesscylladb/scylladb#29434
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128
Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side
Closesscylladb/scylladb#29110
* github.com:scylladb/scylladb:
encryption: fix deadlock in encrypted_data_source::get()
Fix formatting after previous patch
Fix indentation after previous patch
(cherry picked from commit 3b9398dfc8)
Closesscylladb/scylladb#29198Closesscylladb/scylladb#29359
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.
Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.
We also provide a regression test. It takes 6s in dev mode.
Fixes SCYLLADB-959
Closesscylladb/scylladb#29266
(cherry picked from commit b9f82f6f23)
Closesscylladb/scylladb#29291Closesscylladb/scylladb#29308
The test assumes that the sleep duration will be at least the value of
the sleep parameter. However, the actual sleep time can be slightly less
than requested (e.g., a 100ms sleep request might result in a 99ms
sleep).
This commit adjusts the test's time comparison to be more lenient,
preventing test flakiness.
When a `with_connect` operation timed out, the underlying connection
attempt continued to run in the reactor. This could lead to a crash
if the connection was established/rejected after the client object had
already been destroyed. This issue was observed during the teardown
phase of a upcoming high-availability test case.
This commit fixes the race condition by ensuring the connection attempt
is properly canceled on timeout.
Additionally, the explicit TLS handshake previously forced during the
connection is now deferred to the first I/O operation, which is the
default and preferred behavior.
Fixes: SCYLLADB-832
Use get_cql_exclusive(node1) so the driver only connects to node1 and
never attempts to contact the stopped node2. The test was flaky because
the driver received `Host has been marked down or removed` from node2.
Fixes: SCYLLADB-1227
Closesscylladb/scylladb#29268
(cherry picked from commit ab43420d30)
Closesscylladb/scylladb#29278Closesscylladb/scylladb#29355
The test was using time.sleep(1) (a blocking call) to wait after
scheduling the stop_compaction task, intending to let it register on
the server before releasing the sstable_cleanup_wait injection point.
However, time.sleep() blocks the asyncio event loop entirely, so the
asyncio.create_task(stop_compaction) task never gets to run during the
sleep. After the sleep, the directly-awaited message_injection() runs
first, releasing the injection point before stop_compaction is even
sent. By the time stop_compaction reaches Scylla, the cleanup has
already completed successfully -- no exception is raised and the test
fails.
Fix by replacing time.sleep(1) with await asyncio.sleep(1), which
yields control to the event loop and allows the stop_compaction task
to actually send its HTTP request before message_injection is called.
Fixes: SCYLLADB-834
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29202
(cherry picked from commit 068a7894aa)
Closesscylladb/scylladb#29277Closesscylladb/scylladb#29356
Currently, the view_update_generator::mutate_MV function acquires a
reference to the keyspace relevant to the operation, then it calls
max_concurrent_for_each and uses that reference inside the lambda passed
to that function. max_concurrent_for_each can preempt and there is no
mechanism that makes sure that the keyspace is alive until the view
updates are generated, so it is possible that the keyspace is freed by
the time the reference is used.
Fix the issue by precomputing the necessary information based on the
keyspace reference right away, and then passing that information by
value to the other parts of the code. It turns out that we only need to
know the replication factor of the datacenter and whether the keyspace
uses a network topology strategy.
Fixes: scylladb/scylladb#28925Closesscylladb/scylladb#28928
(cherry picked from commit 42d70baad3)
Closesscylladb/scylladb#28968Closesscylladb/scylladb#29095
During raft-topology upgrade in 2026.1, service_level_controller::migrate_to_v2() returns early when system_distributed.service_levels is empty. This skips the service_level_version = 2 write, so the cluster is never marked as upgraded to service levels v2 even though there is no data to migrate. Subsequent upgrades may then fail the startup check which requires service_level_version == 2.
Remove the early return and let the migration commit the version marker even when there are no legacy service levels rows to copy.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1198
backport: should be backported to all versions that can be upgraded to 2026.2
Closesscylladb/scylladb#29333
* github.com:scylladb/scylladb:
test/auth_cluster: cover empty legacy table in service level upgrade
service_levels: mark v2 migration complete on empty legacy table
(cherry picked from commit 95e422db48)
Closesscylladb/scylladb#29352
The tests in test_out_of_space_prevention.py are flaky. Three issues contribute:
1. After creating/removing the blob file that simulates disk pressure,
the tests immediately checked derived state (e.g., "compaction_manager
- Drained") without first confirming the disk space monitor had detected
the utilization change. Fix: explicitly wait for "Reached/Dropped below
critical disk utilization level" right after creating/removing the blob
file, before checking downstream effects.
2. Several tests called `manager.driver_connect()` or omitted reconnection
entirely after `server_restart()` / `server_start()`. The pre-existing
driver session can silently reconnect multiple times, causing subsequent
CQL queries to fail. Fix: call `reconnect_driver()` after every node restart.
Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward,
to ensure all connection pools are established.
3. Some log assertions used marks captured before a restart, so they could
match pre-restart messages or miss messages emitted in the correct post-restart
window. Fix: refresh marks at the right points.
Apart from that, the patch fixes a typo: autotoogle -> autotoggle.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655Closesscylladb/scylladb#28626
(cherry picked from commit 826fd5d6c3)
Closesscylladb/scylladb#28967Closesscylladb/scylladb#29197
The test intentionally creates huge index pages.
But since 5e7fb08bf3,
the index reader allocates a block of memory for a whole index page,
instead of incrementally allocating small pieces during index parsing.
This giant allocation causes the test to fail spuriously in CI sometimes.
Fix this by disabling sstable compression on the test table,
which puts a hard cap of 2000 keys per index page.
Fixes: SCYLLADB-1152
Closesscylladb/scylladb#29152
(cherry picked from commit f29525f3a6)
Closesscylladb/scylladb#29172Closesscylladb/scylladb#29259
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.
Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.
This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103
Backport to all supported branches (other than 2026.1), as the test can fail
there.
Closesscylladb/scylladb#29108
(cherry picked from commit 1398a55d16)
Closesscylladb/scylladb#29205
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
(cherry picked from commit 2e68f48068)
Closesscylladb/scylladb#29006Closesscylladb/scylladb#29038
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.
Fixes: SCYLLADB-996
- (cherry picked from commit 01ddc17ab9)
Parent PR: #28839Closesscylladb/scylladb#28977
* github.com:scylladb/scylladb:
Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
mv: remove dead code in view_updates::can_skip_view_updates
Closesscylladb/scylladb#29094
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).
Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.
A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913Closesscylladb/scylladb#28998
(cherry picked from commit 526e5986fe)
Closesscylladb/scylladb#29068Closesscylladb/scylladb#29097
Set enable_schema_commitlog for each group0 tables.
Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.
Needs backport to all live releases as all are vulnerable
Closesscylladb/scylladb#28876
* github.com:scylladb/scylladb:
test: add test_group0_tables_use_schema_commitlog
db: service: remove group0 tables from schema commitlog schema initializer
service: ensure that tables updated via group0 use schema commitlog
db: schema: remove set_is_group0_table param
(cherry picked from commit b90fe19a42)
Closesscylladb/scylladb#28916Closesscylladb/scylladb#28986
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547
Fixes: SCYLLADB-802
Fixes: SCYLLADB-825
Fixes: SCYLLADB-826
Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.
- (cherry picked from commit bf369326d6)
Parent PR: #28879Closesscylladb/scylladb#28895
* github.com:scylladb/scylladb:
vector_search: test: include ANN error in assertion
vector_search: test: fix HTTPS client test flakiness
When the test fails, the assertion message does not include
the error from the ANN request.
This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.
This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.
Fixes: VECTOR-547, SCYLLADB-802
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
(cherry picked from commit ba7f314cdc)
Closesscylladb/scylladb#28953
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
node A is missing data, digest mismatches and data reconciliation
is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
failing the entire read query
When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...
> await finish_writes()
test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
worker_id = 1
...
> rows = await cql.run_async(rd_stmt, [pk])
E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```
Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.
However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.
This change:
- Retries the verification read in start_writes() on failure to mitigate
races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs
Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529
Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125Closesscylladb/scylladb#28140
(cherry picked from commit e07fe2536e)
Closesscylladb/scylladb#28825
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
(cherry picked from commit bb57b0f3b7)
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
(cherry picked from commit cd4caed3d3)
Closesscylladb/scylladb#28792
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.
The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.
This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.
The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:
get() return_all — returns 1 unit
put() return_all — returns 0 units
get() consume_units — takes back 1 unit
put() consume_units — takes back 0 units
Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.
Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.
Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd
- (cherry picked from commit 0376d16ad3)
- (cherry picked from commit 3b98451776)
Parent PR: #28530Closesscylladb/scylladb#28715
* github.com:scylladb/scylladb:
test: decrease strain in test_startup_response
test: auth_cluster: add test for hanged AUTHENTICATING connections
transport: fix connection code to consume only initially taken semaphore units
transport: remove redundant futurize_invoke from counted data sink and source
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).
Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.
As additional measure we double the timeout.
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters
- (cherry picked from commit 289e910cec)
- (cherry picked from commit 6280cb91ca)
- (cherry picked from commit 960adbb439)
Parent PR: #28592Closesscylladb/scylladb#28696
* github.com:scylladb/scylladb:
s3_client: add more constrains to the calc_part_size
s3_client: add tests for calc_part_size
s3_client: correct multipart part-size logic to respect 10k limit
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.
This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.
Fixes https://github.com/scylladb/scylladb/issues/28183
LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.
- (cherry picked from commit f89a8c4ec4)
- (cherry picked from commit 9baaddb613)
Parent PR: #28230Closesscylladb/scylladb#28507
* github.com:scylladb/scylladb:
test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
cql3/statements/describe_statement: hide paxos state tables
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
acquires a read lock on the vector of listeners, and calls the
callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
write lock on the vector of listeners. it waits because the lock is
held.
A: the callback function calls another notification and calls
thread_for_each which tries to acquire the read lock again. but it
waits since there is a waiter.
Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.
Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.
Fixesscylladb/scylladb#27364Closesscylladb/scylladb#27637
(cherry picked from commit 55f4a2b754)
Closesscylladb/scylladb#28557
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
---
Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.
Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.
---
The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):
Before:
CPU utilization: 0.9%
real 2m7.811s
user 0m25.446s
sys 0m16.733s
After:
CPU utilization: 1.1%
real 1m40.288s
user 0m25.218s
sys 0m16.566s
---
Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203
Backport: This improves the stability of our CI, so let's
backport it to all supported versions.
- (cherry picked from commit 628e74f157)
- (cherry picked from commit ac4af5f461)
- (cherry picked from commit c5239edf2a)
- (cherry picked from commit a256ba7de0)
- (cherry picked from commit f83f911bae)
Parent PR: #28602Closesscylladb/scylladb#28622
* github.com:scylladb/scylladb:
test: cluster: Reduce wait time in test_sync_point
test: cluster: Fix test_sync_point
test: cluster: Await sync points asynchronously
test: cluster: Create sync points asynchronously
test: cluster: Fetch hint metrics asynchronously
The test can currently fail like this:
```
> await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
because node A performs tablet migrations at the the same time.
What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.
The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.
Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.
The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.
Fixes#25938Closesscylladb/scylladb#28632
(cherry picked from commit 0693091aff)
Closesscylladb/scylladb#28672
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.
(cherry picked from commit 6280cb91ca)
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.
The PR introduces two changes:
* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.
Fixes: #28012
Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.
- (cherry picked from commit aef5ff7491)
- (cherry picked from commit 079fe17e8b)
Parent PR: #28617Closesscylladb/scylladb#28642
* github.com:scylladb/scylladb:
vector_search: Fix missing timeout on TLS handshake
vector_search: test: Fix flaky cert rewrite test
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.
However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.
To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.
Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).
Fixes: scylladb/scylladb#28204
(cherry picked from commit 9ffa62a986)
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.
This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.
Fixes: #28012
(cherry picked from commit aef5ff7491)
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.
There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.
(cherry picked from commit f83f911bae)
The test had a few shortcomings that made it flaky or simply wrong:
1. We were verifying that hints were written by checking the size of
in-flight hints. However, that could potentially lead to problems
in rare situations.
For instance, if all of the hints failed to be written to disk, the
size of in-flight hints would drop to zero, but creating a sync point
would correspond to the empty state.
In such a situation, we should fail immediately and indicate what
the cause was.
2. A sync point corresponds to the hints that have already been written
to disk. The number of those is tracked by the metric `written`.
It's a much more reliable way to make sure that hints have been
written to the commitlog. That ensures that the sync point we'll
create will really correspond to those hints.
3. The auxiliary function `wait_for` used in the test works like this:
it executes the passed callback and looks at the result. If it's
`None`, it retries it. Otherwise, the callback is deemed to have
finished its execution and no further retries will be attempted.
Before this commit, we simply returned a bool, and so the code was
wrong. We improve it.
Note that this fixesscylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.
Refs scylladb/scylladb#25879Fixesscylladb/scylladb#28203
(cherry picked from commit a256ba7de0)
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.
(cherry picked from commit c5239edf2a)
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.
(cherry picked from commit ac4af5f461)
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.
(cherry picked from commit 628e74f157)
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.
The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).
The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.
Refs #28183.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9baaddb613)
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.
Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request
Fixes: CUSTOMER-96
Fixes: CUSTOMER-139
Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3
- (cherry picked from commit bd9d5ad75b)
- (cherry picked from commit 359d0b7a3e)
- (cherry picked from commit ce0c7b5896)
- (cherry picked from commit 5b3e513cba)
- (cherry picked from commit 66a33619da)
- (cherry picked from commit 6eb7dba352)
- (cherry picked from commit a05a4593a6)
- (cherry picked from commit 3a31380b2c)
- (cherry picked from commit 912c48a806)
Parent PR: #27891Closesscylladb/scylladb#28404
* https://github.com/scylladb/scylladb:
connection_factory: includes cleanup
dns_connection_factory: refine the move constructor
connection_factory: retry on failure
connection_factory: introduce TTL timer
connection_factory: get rid of shared_future in dns_connection_factory
connection_factory: extract connection logic into a member
connection_factory: remove unnecessary `else`
connection_factory: use all resolved DNS addresses
s3_test: remove client double-close
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.
Fixes: SCYLLADB-522
Closesscylladb/scylladb#28519
(cherry picked from commit 6b9fcc6ca3)
Closesscylladb/scylladb#28537
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
locator::natural_endpoints_tracker::natural_endpoints_tracker(
const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.
This bug basically made maintenance mode unusable in customer clusters.
We fix it by changing the node state to `normal`.
We also extend `test_maintenance_mode` to provide a reproducer for
Fixes#27988
This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.
- (cherry picked from commit a08c53ae4b)
- (cherry picked from commit 9d4a5ade08)
- (cherry picked from commit c92962ca45)
- (cherry picked from commit 408c6ea3ee)
- (cherry picked from commit 53f58b85b7)
- (cherry picked from commit 867a1ca346)
- (cherry picked from commit 6c547e1692)
- (cherry picked from commit 7e7b9977c5)
Parent PR: #28322Closesscylladb/scylladb#28498
* https://github.com/scylladb/scylladb:
test: test_maintenance_mode: enable maintenance mode properly
test: test_maintenance_mode: shutdown cluster connections
test: test_maintenance_mode: run with different keyspace options
test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
test: test_maintenance_mode: get rid of the conditional skip
test: test_maintenance_mode: remove the redundant value from the query result
storage_proxy: skip validate_read_replica in maintenance mode
storage_service: set up topology properly in maintenance mode
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call
(cherry picked from commit bd9d5ad75b)