Commit Graph

53651 Commits

Author SHA1 Message Date
Aleksandra Martyniuk
101237d7cf tablets: fix hint re-adding tokens after full partition read decision
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone and/or
static row while subsequent mutations contain only clustered rows.

The tablet metadata change hint logic would correctly clear tokens (marking
a full partition read) upon seeing the tombstone in the first mutation,
but then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.

Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens.

(cherry picked from commit d6c1707a04)
2026-05-29 15:32:00 +00:00
Yehuda Lebi
d4bbc144a4 fix: raise scylla-helper.slice CPUWeight from 10 to 100 to prevent node_exporter CPU starvation
Closes scylladb/scylladb#29839

(cherry picked from commit 6307e17795)

Closes scylladb/scylladb#29949
2026-05-29 13:55:49 +03:00
Botond Dénes
b617cd8c02 Merge '[Backport 2026.2] alternator: Graduate Alternator Streams from experimental' from Scylladb[bot]
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.

Fixes SCYLLADB-1680
Fixes #16367

To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.

This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.

- (cherry picked from commit 870013b437)
- (cherry picked from commit 9a86044c63)

Parent PR: #29604

Closes scylladb/scylladb#29817

* github.com:scylladb/scylladb:
  test: Stop providing alternator-streams experimental flag
  alternator: Graduate Alternator Streams from experimental
2026-05-29 13:54:54 +03:00
Patryk Jędrzejczak
6fa12c6419 Merge '[Backport 2026.2] storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Scylladb[bot]
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).

Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.

The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.

Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.

Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.

Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked

Fixes: SCYLLADB-2163
Refs: scylladb/scylladb#23665

backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1

- (cherry picked from commit bc4dc13e94)
- (cherry picked from commit 324a08295d)
- (cherry picked from commit c88120abca)
- (cherry picked from commit fa01f74ae6)
- (cherry picked from commit 32002f6443)
- (cherry picked from commit a093be9ca9)
- (cherry picked from commit 5bc3e84d1e)
- (cherry picked from commit 2927f0dd21)

Parent PR: #29882

Closes scylladb/scylladb#30008

* https://github.com/scylladb/scylladb:
  storage_proxy: only cancel write handlers with pending remote targets during drain
  storage_service: cancel write handlers during drain to prevent shutdown deadlock
  test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
  test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
  test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
  test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
  test: scylla_cluster: allow stop() to bypass start_stop_lock
  error_injection: add non-shared mode to wait_for_message
  error_injection: release waiters when injection is disabled
2026-05-28 12:49:15 +02:00
Nadav Har'El
79492c9662 Merge '[Backport 2026.2] cql: rewrite CassIO SAI metadata index to regular secondary index' from Scylladb[bot]
CassIO (the library backing LangChain's `langchain_community.vectorstores.Cassandra` integration) issues the following DDL during schema setup to create a metadata index:

```sql
CREATE CUSTOM INDEX IF NOT EXISTS eidx_metadata_s_<table>
ON <keyspace>.<table> (ENTRIES(metadata_s))
USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
```

ScyllaDB does not support Cassandra's StorageAttachedIndex (SAI) for non-vector columns and previously rejected this statement with:

```
StorageAttachedIndex (SAI) is only supported on vector columns; use a secondary index for non-vector columns
```

This blocks seamless migration of existing LangChain/CassIO applications from Cassandra to ScyllaDB — applications fail during initialization before any application-level workaround can run, even when metadata filtering is not used (`metadata_indexing="none"`).

CassIO is no longer actively maintained but remains the only official LangChain integration path for Apache Cassandra over CQL, meaning existing applications will continue using this setup pattern.

Instead of rejecting the CassIO metadata-map SAI DDL, detect the pattern and rewrite it to a standard ScyllaDB secondary index on collection entries:

- **Detection**: SAI class name + single `ENTRIES` target on a non-frozen `map` column
- **Rewrite**: Clear the custom class so the index is created through the standard secondary index path (which already fully supports indexing map entries)
- **Warning**: Emit a CQL warning informing the user that SAI is not supported by ScyllaDB, a regular secondary index was created instead, and metadata filtering behavior may differ from Cassandra SAI

The rewrite is placed early in `validate_while_executing()`, before the rf-rack-validity check, so the standard secondary index code path handles all subsequent validation naturally — no code duplication.

After this change, the CassIO schema setup succeeds on ScyllaDB:
- `CREATE CUSTOM INDEX ... USING 'sai'` on `ENTRIES(metadata_s)` creates a real secondary index
- The index is functional and can accelerate metadata filtering queries
- A CQL warning makes the rewrite transparent to operators
- SAI on non-vector, non-map-entries columns is still rejected as before
- Vector SAI indexes continue to be rewritten to `vector_index` as before

- `test_sai_entries_on_map_creates_regular_index` — verifies the index is created and the warning is emitted (fully-qualified SAI class name)
- `test_sai_entries_on_map_short_name` — same with the `'sai'` short alias
- `test_sai_on_regular_column_rejected` — confirms SAI on regular scalar columns is still rejected

All 148 tests in `test_vector_index.py` and `test_secondary_index.py` pass with no regressions (125 passed, 22 xfailed, 1 skipped).

Fixes: SCYLLADB-2234
Backport: 2026.2 as this is the version where the support for SAI class needed by LangChain was added.

- (cherry picked from commit 242eb96b16)
- (cherry picked from commit 5ee339b11d)

Parent PR: #29981

Closes scylladb/scylladb#30084

* github.com:scylladb/scylladb:
  cql: rewrite CassIO SAI metadata index to regular secondary index
  db/config: add enable_cassio_compatibility flag
2026-05-27 14:59:49 +03:00
Petr Gusev
e7d94d9927 storage_proxy: only cancel write handlers with pending remote targets during drain
The previous fix (cancel_all_write_response_handlers in do_drain)
was too aggressive — it killed all handlers including ones used by
group0 for raft commits. Since group0 is still running at that point
(before wait_for_group0_stop), this caused group0 operations to fail
(SCYLLADB-2168).

The actual problem is only with handlers that have pending remote
targets: after stop_transport() their MUTATION_DONE responses can
never arrive via messaging. Handlers whose only pending targets are
local can still complete via apply_locally and should be left alone.

Add cancel_nonlocal_write_response_handlers() which checks each
handler's remaining targets against the local host ID. Only handlers
with at least one remote pending target are cancelled. Use it in
do_drain instead of cancel_all_write_response_handlers. The latter
remains unchanged for drain_on_shutdown (final proxy shutdown where
all handlers must be killed).

Fixes: SCYLLADB-2168
(cherry picked from commit 2ff30ee6f0)
2026-05-27 13:35:30 +02:00
Patryk Jędrzejczak
ae88f7209d Merge '[Backport 2026.2] compaction_manager: unregister compaction module on early shutdown' from Scylladb[bot]
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.

Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.

Fixes: SCYLLADB-2226

Backport to all supported branches, as the bug is there and it has
already caused a failure in 2026.1 CI.

- (cherry picked from commit 6cde390e21)
- (cherry picked from commit b7400d20dd)

Parent PR: #30015

Closes scylladb/scylladb#30082

* https://github.com/scylladb/scylladb:
  test: add test_stop_before_starting_compaction_manager
  compaction_manager: unregister compaction module on early shutdown
2026-05-26 10:06:51 +02:00
Szymon Wasik
c30c9f3a82 cql: rewrite CassIO SAI metadata index to regular secondary index
When CassIO creates a SAI ENTRIES index on a map column,
ScyllaDB now rewrites it to a regular secondary index and emits
a CQL warning. This allows LangChain/CassIO applications to work
without DDL errors.

The rewrite is gated behind the enable_cassio_compatibility flag
(disabled by default).

Refs: SCYLLADB-2113
(cherry picked from commit 5ee339b11d)
2026-05-25 23:37:32 +00:00
Szymon Wasik
4251357118 db/config: add enable_cassio_compatibility flag
Add a new live-updatable boolean configuration option
'enable_cassio_compatibility' (default: false).

When enabled, it allows ScyllaDB to rewrite CassIO's SAI index DDL
on map entries to a regular secondary index, so that LangChain/CassIO
applications can run without DDL errors.

The flag is disabled by default to avoid affecting users who don't
need CassIO compatibility.

(cherry picked from commit 242eb96b16)
2026-05-25 23:37:31 +00:00
Patryk Jędrzejczak
d1e7eaad13 test: add test_stop_before_starting_compaction_manager
(cherry picked from commit b7400d20dd)
2026-05-25 17:59:17 +00:00
Patryk Jędrzejczak
bbde5a64b2 compaction_manager: unregister compaction module on early shutdown
The compaction module is registered with task_manager in the compaction_manager
constructor, and unregistered in compaction_manager::really_do_stop(), which
was gated behind `_state != state::none` in compaction_manager::do_stop().
Since enable() -- which transitions _state from none to running -- is called
later during startup (from database::start() or the disk space monitor callback)
than the compaction_manager constructor, an early shutdown could leave the
compaction module registered after compaction_manager::do_stop() returned.
task_manager::stop() then aborted with 'Tried to stop task manager while
some modules were not unregistered'.

Fix compaction_manager::do_stop() to call _task_manager_module->stop() even
when `_state == state::none`, so that the compaction module is always properly
unregistered.

Fixes: SCYLLADB-2226
(cherry picked from commit 6cde390e21)
2026-05-25 17:59:16 +00:00
Gleb Natapov
81dc11557c storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution
Otherwise if a table is dropped in the middle of a scan the object may
disappear.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137

Closes scylladb/scylladb#29988

(cherry picked from commit 0bf050d175)

Closes scylladb/scylladb#30067
2026-05-25 12:48:42 +02:00
Avi Kivity
0606b71893 Merge '[Backport 2026.2] cql: request-side custom payload parsing' from Scylladb[bot]
When a CQL client sends a request with the `CUSTOM_PAYLOAD` flag (`0x04`) set, the frame body starts with a [bytes map] before the message. Scylla never implemented parsing of this map on the request side. This caused it to fail parsing with protocol errors such as `"truncated frame: expected 65546 bytes"`.

This was discovered through DataStax Java Driver 4.19.x tests that attach a `request-id` to queries via custom payload. The same issue affects any CQL client that sets the `CUSTOM_PAYLOAD` flag.

Fix this by skipping over the custom payload [bytes map] from the frame body before dispatching to opcode-specific handlers. The payload contents are discarded since Scylla has no pluggable `QueryHandler`. Cassandra's default `QueryHandler` also discards them.

Fixes SCYLLADB-745

Reported on 2026.2, backport.

- (cherry picked from commit 8e6d2d0631)
- (cherry picked from commit f9e8518776)

Parent PR: #30005

Closes scylladb/scylladb#30026

* github.com:scylladb/scylladb:
  cql: fix request-side custom payload parsing
  test/cqlpy: add tests for request-side custom payload handling
2026-05-24 22:11:41 +03:00
Łukasz Paszkowski
d66b65d6d4 tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
   topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
   registered on the CV. The loop kept spinning, holding the HTTP
   server's task gate open and preventing http_server::stop() from
   completing. After ~15 minutes, systemd killed the process with
   SIGABRT.

The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.

Fixes: SCYLLADB-2181

Closes scylladb/scylladb#29485

(cherry picked from commit 96a992002c)

Closes scylladb/scylladb#30042
2026-05-24 22:10:47 +03:00
Gleb Natapov
74a057bee0 schema: ensure committed_by_group0 is set for all non-system tables on boot
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.

Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):

1. Queries system_schema.scylla_tables for rows where committed_by_group0
   is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
   cell (from the in-memory schema). The committed_by_group0 = true flag
   is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.

On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.

This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.

Closes scylladb/scylladb#29911

(cherry picked from commit cc034f84c5)

Closes scylladb/scylladb#30000
2026-05-24 22:10:19 +03:00
Piotr Dulikowski
2d33cbd644 Merge '[Backport 2026.2] db/view/view_building_coordinator: add flag to mark if any remote work was finished' from Michał Jadwiszczak
There is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work.

To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV and is checked before awaiting on the CV.

Fixes SCYLLADB-2029

The problem is not critical but it should be backported to 2025.4 and newer version, all of them contains view building coordinator.

(cherry picked from commit c7f65131bf)
(cherry picked from commit c767ac7ef3)

Parent PR: https://github.com/scylladb/scylladb/pull/27313

Closes scylladb/scylladb#30029

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer
  db/view/view_building_coordinator: add flag to mark if any remote work was finished
2026-05-23 10:22:37 +02:00
Dario Mirovic
1af677dff5 cql: fix request-side custom payload parsing
When a CQL client sends a request with the CUSTOM_PAYLOAD flag (0x04)
set, the frame body starts with a [bytes map] before the message.
Scylla never implemented parsing of this map on the request side.
This caused it to fail parsing with protocol errors such as
"truncated frame: expected 65546 bytes".

Fix this by skipping over the custom payload [bytes map] from the frame
body before dispatching to opcode-specific handlers. The payload
contents are discarded since Scylla has no pluggable QueryHandler.
Cassandra's default QueryHandler also discards them.

Fixes SCYLLADB-745

(cherry picked from commit f9e8518776)
2026-05-22 18:17:34 +02:00
Avi Kivity
b259295366 Update seastar submodule (coroutine generator fix)
* seastar 74f19b81ca...5462d881a6 (1):
  > coroutine/generator: fix setup of generator's waiting task

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798

Ref 6df04c9e5b
2026-05-22 18:01:44 +03:00
Michał Jadwiszczak
2a3904ec43 test/cluster/test_view_building_coordinator: add reproducer
Add test which reproduces scylladb/scylladb#27298

(cherry picked from commit c767ac7ef3)
2026-05-22 15:59:27 +02:00
Michał Jadwiszczak
b3e345d680 db/view/view_building_coordinator: add flag to mark if any remote work was finished
In the main coordinator loop (`view_building_coordinator::run()`),
there is small windows just after view building coordinator releases
group0 guard and before it waits on view_building_state_machine's CV,
when the coordinator may miss CV broadcast triggered by finished remote
work (`view_building_coordinator::work_on_tasks()`).

To fix it, this patch adds a boolean flag, which is set to true before
broadcasting the CV by finished/failed RPC call
and is checked before awaiting on the CV.

Fixes scylladb/scylladb#27298

(cherry picked from commit c7f65131bf)
2026-05-22 15:58:45 +02:00
Dario Mirovic
829b04fb21 test/cqlpy: add tests for request-side custom payload handling
Add tests that verify Scylla's handling of CQL native protocol
requests with the CUSTOM_PAYLOAD flag (0x04) set. Each test asserts
the specific parse error that the unfixed server produces.

A separate CQL session is used for each test. The protocol error
kills the driver connection, and we need to catch it properly.

Refs SCYLLADB-745

(cherry picked from commit 8e6d2d0631)
2026-05-22 13:46:02 +00:00
Michael Litvak
846ff3ce7f test: wait for others_not_see_server before exclude
Between stopping a server and excluding it, wait for other nodes to see
the server as down, otherwise exclude may see the server as alive and
fail.

Fixes SCYLLADB-2110

Closes scylladb/scylladb#29966

(cherry picked from commit eecbead541)

Closes scylladb/scylladb#29975
2026-05-22 15:11:25 +03:00
Szymon Malewski
17a61e0015 vector_search: fix decimal/varint precision loss in filter value_to_json()
value_to_json() converts CQL values to JSON for vector search filters.
For decimal and varint types, it used rjson::parse() on the JSON string,
which parses through a double and silently loses precision for values
exceeding ~15 significant digits — producing wrong filter results.

Additionally, for decimal type we need an exact string representation
that preserves the original (unscaled, scale) pair, because partition
keys use byte-level identity: different serialized representations of
the same numeric value are distinct rows, so the filter must reproduce
the exact representation stored in the key.

Add big_decimal::to_string_canonical() which follows the Java BigDecimal
toString() spec (JDK 8+), producing a bijective string representation
that uses exponential notation for extreme scales instead of expanding
trailing zeros (which could cause OOM). This could replace to_string(),
but doing so has wider consequences (e.g. hash/equality contract for
decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for
decimal_type, and use rjson::from_string() for varint_type, both
bypassing the lossy double parse path.

Tests cover the new to_string_canonical() and the filter fix, as well as
existing decimal type behavior (key representation, clustering order,
toJson) that we rely on and must not break. The CQL decimal type tests
(test_type_decimal.py) also pass against Cassandra.

Fixes: SCYLLADB-2107
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574

Closes scylladb/scylladb#29505

(cherry picked from commit 15493872b2)

Closes scylladb/scylladb#29957
2026-05-22 15:09:54 +03:00
Botond Dénes
f3245c933d Merge '[Backport 2026.2] load_balancer: apply balance threshold to intranode shard balancing' from Scylladb[bot]
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing

The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).

The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.

Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.

The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted

Fixes: SCYLLADB-2006

This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2

- (cherry picked from commit aaead10e5d)
- (cherry picked from commit 6856f51097)

Parent PR: #29756

Closes scylladb/scylladb#29895

* github.com:scylladb/scylladb:
  test: add test for intranode balance threshold in size-based mode
  tablet_allocator: apply balance threshold to intranode shard balancing
2026-05-22 15:07:34 +03:00
Petr Gusev
a2b2a42936 storage_service: cancel write handlers during drain to prevent shutdown deadlock
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.

If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
  (the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
  deadlock (the barrier_and_drain RPC handler holds the gate open).

Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().

Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.

Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.

Fixes: SCYLLADB-2163
Refs: scylladb/scylladb#23665
(cherry picked from commit 2927f0dd21)
2026-05-21 18:58:06 +00:00
Petr Gusev
1268ab6f92 test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
The existing test only covers the case where the shutting-down node is
NOT the topology coordinator (deadlocks in uninit_messaging_service).
When the node IS the coordinator, the deadlock manifests differently:
the topology coordinator fiber calls barrier_and_drain on itself
(without messaging), and do_drain -> wait_for_group0_stop blocks
because the coordinator can't stop while stale_versions_in_use is
waiting on the uncancelled write handler.

Run the test twice on the same 2-node cluster (RF=2):
- Run 1: target is a non-coordinator
- Restore cluster state (restart target, decommission added node)
- Run 2: target is the topology coordinator

Use CL=ONE so the write completes from the local replica even with
the other server's response paused.

Mark as xfail since this reproduces bugs not yet fixed on this branch.

Refs: SCYLLADB-1842
(cherry picked from commit 5bc3e84d1e)
2026-05-21 18:58:06 +00:00
Petr Gusev
b147ab4418 test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
The test was written for another case, and was not supposed to
reproduce the issue that was fixed in this PR.

Fix the test to reproduce the real scenario:

1. Use one_shot=False for pause_before_barrier_and_drain so the
   injection fires on every barrier_and_drain RPC, not just the first.

2. Let the first barrier_and_drain through (at this point the write
   handler's ERM version matches the current token_metadata version).

3. Wait for the second barrier_and_drain. Between the two calls,
   topology_state_load installs a new token_metadata version. The
   write handler still holds the old version's ERM — now stale.

4. After stop_transport completes, disable the injection (rather than
   sending a single message) to release the paused handler and any
   subsequent ones that arrived during stop_transport. The 'disabled'
   flag in injection_shared_data ensures all waiters wake up.

With these changes the test reliably fails (shutdown deadlock within
15s) on the unfixed code and passes on the fixed version from
e0dc73f52a ('Cancel all write requests on storage_proxy shutdown').

Refs: scylladb/scylladb#23665
(cherry picked from commit a093be9ca9)
2026-05-21 18:58:05 +00:00
Petr Gusev
8489493323 test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
asyncio cancel() only affects the client-side coroutine. The
server-side addserver handler in the cluster manager continues
running. If it can't complete (e.g. no raft quorum because the
target node is shut down), the orphaned handler blocks _after_test
cleanup for 120s.

Await the task instead so it completes cleanly (we restart the
target node first to restore quorum).

(cherry picked from commit 32002f6443)
2026-05-21 18:58:05 +00:00
Petr Gusev
8addbed0dc test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
Add a 15s timeout around the shutdown_task await. If the timeout
fires, the deadlock is reproduced (shutdown hung because
stale_versions_in_use blocks on a write handler holding a stale
token_metadata version).

When the timeout fires, explicitly kill the node via
server_stop() so that the manager's _after_test handler does not
wait 120s for the stuck stop_gracefully request. Then fail the
test with a clear message.

(cherry picked from commit fa01f74ae6)
2026-05-21 18:58:04 +00:00
Petr Gusev
de55e0472a test: scylla_cluster: allow stop() to bypass start_stop_lock
Remove the @stop_event and @start_stop_lock decorators from
ScyllaServer.stop() so it can SIGKILL a server even while
stop_gracefully() holds the lock (e.g. the node is deadlocked
during shutdown and stop_gracefully is blocked on cmd.wait()).

A local copy of self.cmd is used because there are await points
after which another coroutine (stop_gracefully) may set self.cmd
to None. The concurrent stop_gracefully() unblocks once the
process dies from SIGKILL since its cmd.wait() returns.

Also make shutdown_control_connection a plain (non-async) function
since it contains no await points — this makes it obvious that no
coroutine interleaving is possible inside it.

(cherry picked from commit c88120abca)
2026-05-21 18:58:04 +00:00
Petr Gusev
09aca68f71 error_injection: add non-shared mode to wait_for_message
Add a 'share' parameter to wait_for_message (default true, preserving
existing behavior). When share=false, each handler invocation requires
its own dedicated message to proceed — a message consumed by one
handler is not visible to others.

Use share=false for the pause_before_barrier_and_drain injection in
raft_topology_cmd_handler. The topology coordinator sends multiple
barrier_and_drain RPCs during a single topology transition (one per
state change). With share=true a single message_injection call
releases all handlers. With share=false the test can release them
one at a time, controlling exactly which topology state the write
handler's ERM captures.

(cherry picked from commit 324a08295d)
2026-05-21 18:58:04 +00:00
Petr Gusev
3655879f48 error_injection: release waiters when injection is disabled
When an error injection is disabled (via disable() or disable_all()),
any handlers currently suspended in wait_for_message() must be woken up
so they can proceed instead of hanging until timeout.

Add a 'disabled' flag to injection_shared_data. When disable() or
disable_all() is called, set the flag and broadcast the condition
variable. The wait_for_message() predicate checks the flag and returns
true immediately, letting the handler continue.

This makes disable() atomic with respect to releasing waiters: it both
wakes up blocked handlers and removes the injection from the enabled
map in one call. This avoids races that would occur with separate
message_injection() + disable() calls — message_injection() after
disable() fails because the injection is already gone, and
disable() after message_injection() risks a new handler hitting the
injection between the two calls.

Concrete example: test_unfinished_writes_during_shutdown pauses
barrier_and_drain RPC handlers via wait_for_message. During shutdown,
the test calls disable_injection() to simultaneously release the
paused handler and prevent any new barrier_and_drain RPCs from
getting stuck.

(cherry picked from commit bc4dc13e94)
2026-05-21 18:58:03 +00:00
Patryk Jędrzejczak
d5a27e1cf1 test: fix flaky test_raft_snapshot_truncation by waiting for async log truncation
Snapshot creation and raft log truncation happen asynchronously in the
IO fiber after a schema change completes. The test was querying
system.raft immediately after the schema change returned, racing with
the IO fiber's store_snapshot_descriptor call.

Replace immediate assertions with wait_for polling loops:
- log_size == 0: wait for log truncation after drop keyspace
- new_snap_id != original_snap_id: wait for new snapshot to be persisted

Fixes: SCYLLADB-2157

Closes scylladb/scylladb#29967

(cherry picked from commit cbadc3d675)

Closes scylladb/scylladb#29999
2026-05-21 16:06:24 +02:00
Avi Kivity
0afe3dcfd5 Update seastar submodule (default DMA alignment)
* seastar 4d268e0ef5...74f19b81ca (1):
  > file: fix default DMA alignment

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043

Ref 6df04c9e5b
2026-05-20 18:57:19 +03:00
Jenkins Promoter
c4c38aeeda Update pgo profiles - aarch64 2026-05-20 15:41:19 +03:00
Jenkins Promoter
11c7df5510 Update pgo profiles - x86_64 2026-05-20 14:35:23 +03:00
Jenkins Promoter
bd3335221e Update ScyllaDB version to: 2026.2.0-rc3 2026-05-19 23:20:28 +03:00
Patryk Jędrzejczak
6466cded43 Merge '[Backport 2026.2] cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce' from Scylladb[bot]
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

backport: need to backport to all versions with LWT over tablets

- (cherry picked from commit 167a3c9c50)
- (cherry picked from commit 8a76ec7e65)
- (cherry picked from commit 738b7b4a86)
- (cherry picked from commit 9e3209e4a3)

Parent PR: #29910

Closes scylladb/scylladb#29948

* https://github.com/scylladb/scylladb:
  cql: refactor add_tablet_info to take tablet_routing_info directly
  cql: fix UB dereference of nullopt tablet_info in execute_with_condition
  test/boost: add regression test for missing tablet routing after CAS bounce
  cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
2026-05-19 10:24:49 +02:00
Ernest Zaslavsky
8173885012 test: fix use-after-free in start_docker_service retry path
start_docker_service is a coroutine that took docker_args and
image_args by const reference. Its caller start_fake_gcs_server
is a regular function that passes temporaries (initializer lists)
and immediately returns a future. The temporaries are destroyed
when the caller returns, leaving the coroutine holding dangling
references.

On the first loop iteration this works by luck (memory not yet
reused), but on retry (after "address already in use") the
params.append_range(image_args) reads freed memory, causing
use-after-free that manifests as std::bad_alloc or broken_promise
in non-sanitizer builds.

Fix by taking docker_args and image_args by value so the coroutine
frame owns the vectors for its entire lifetime.

Fixes: SCYLLADB-2081

Closes scylladb/scylladb#29932

(cherry picked from commit 834eed10d9)

Closes scylladb/scylladb#29943
2026-05-18 19:24:55 +03:00
Petr Gusev
1c3fde8abb cql: refactor add_tablet_info to take tablet_routing_info directly
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.

Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.

(cherry picked from commit 9e3209e4a3)
2026-05-18 13:39:09 +00:00
Petr Gusev
b8b08ea89f cql: fix UB dereference of nullopt tablet_info in execute_with_condition
When check_locality() returns nullopt (correctly routed LWT), the
optional tablet_info was unconditionally dereferenced in the lambda
capture list: tablet_info->tablet_replicas, tablet_info->token_range.

The code previously masked this by initializing tablet_info with an
empty-but-present value, so the dereference happened to work but
only because the empty tablet_replicas made add_tablet_info() a no-op.
After check_locality() overwrites it with nullopt, the dereference
is UB.

Fix by initializing tablet_info as empty (nullopt) and guarding the
dereference.

(cherry picked from commit 738b7b4a86)
2026-05-18 13:39:09 +00:00
Petr Gusev
377cb5bc53 test/boost: add regression test for missing tablet routing after CAS bounce
Add test_tablet_routing_info_after_cas_shard_bounce that verifies
TABLETS_ROUTING_V1 payload is returned after an internal CAS shard
bounce.

The test simulates the transport-layer bounce: it creates a table whose
single tablet replica lands on a shard different from the test thread,
executes an LWT (which bounces), then transfers client_state via
client_state_for_another_shard (preserving _original_shard) and
re-executes on the tablet shard. The test asserts that check_locality()
correctly detects the misrouting and returns tablet routing info.

Refs SCYLLADB-2041

(cherry picked from commit 8a76ec7e65)
2026-05-18 13:39:08 +00:00
Petr Gusev
c66f8fa6fd cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

(cherry picked from commit 167a3c9c50)
2026-05-18 13:39:08 +00:00
Jenkins Promoter
172cd07bb7 Update ScyllaDB version to: 2026.2.0-rc2 scylla-2026.2.0-rc2-candidate-20260518093053 scylla-2026.2.0-rc2 2026-05-18 14:22:29 +03:00
Andrzej Jackowski
fb33abaee6 test: storage: retry fusermount3 unmount on teardown
After stopping scylla server processes, the FUSE daemon
(fuse2fs) may still be processing file handle closures.
An immediate fusermount3 -u can fail with 'device busy',
causing spurious test failures on teardown.

Retry the unmount up to 10 times with 0.5s delay between
attempts, and capture stderr for diagnostics.

Fixes: SCYLLADB-2066

Closes scylladb/scylladb#29920

(cherry picked from commit 61e5ec9888)

Closes scylladb/scylladb#29930
2026-05-18 12:25:22 +03:00
Avi Kivity
c1c0c96643 Merge '[Backport 2026.2] QOS: self-heal stale V1-to-V2 migration state on upgrade' from Scylladb[bot]
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-2038

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

- (cherry picked from commit ac0a19aab8)
- (cherry picked from commit c2014f7e50)
- (cherry picked from commit 6188bf3e01)

Parent PR: #29749

Closes scylladb/scylladb#29905

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-17 19:33:58 +03:00
Avi Kivity
b924f66425 Merge '[Backport 2026.2] test: limits: optimize test_max_cells to avoid large allocations and fragmentation' from Scylladb[bot]
The `test_max_cells` test was flaky due to `std::bad_alloc` caused by Seastar buddy allocator fragmentation. The root causes are:
1. The doubling loop with 24 iterations of CREATE/INSERT/DROP fragmented the allocator
2. The test built the whole batch as a single string that takes contiguous memory

Also, some iterations inserted zero rows, but still did CREATE/DROP table which also contributed to the fragmentation.

This patch series:
- Skips iterations that insert zero rows
- Creates the table once, truncates it after each test iteration
- Switches to prepared statements

Investigation results are presented in detail in https://scylladb.atlassian.net/browse/SCYLLADB-1645

Fixes SCYLLADB-1645

CI stability improvement. Backport to versions that have this test.

- (cherry picked from commit 3debae9a37)
- (cherry picked from commit 0fd6f6f292)
- (cherry picked from commit 0574055b73)

Parent PR: #29759

Closes scylladb/scylladb#29926

* github.com:scylladb/scylladb:
  test: prepare max cells inserts
  test: reuse max cells schema
  test: limits: skip empty max cells iterations
2026-05-17 19:31:32 +03:00
Botond Dénes
c1a5fea937 docs: expand OCI Object Storage configuration section
The existing OCI section in admin.rst was a minimal stub that only showed
a config snippet without explaining how to actually set up connectivity.

Add documentation for:
- The OCI S3-compatible endpoint URL format (namespace + region)
- That credentials must be set explicitly via AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY using OCI Customer Secret Keys (unlike AWS,
  OCI has no instance metadata fallback compatible with STS/EC2)
- A note that iam_role_arn is AWS-specific and should be omitted for OCI

Fixes: SCYLLADB-2047

Closes scylladb/scylladb#29689

(cherry picked from commit 8a305dd6c7)

Closes scylladb/scylladb#29916
2026-05-16 19:34:25 +03:00
Piotr Szymaniak
e87c4d80aa alternator/doc: update Streams compatibility docs
Alternator Streams graduated from experimental in #29604.  Update the
compatibility and FAQ docs accordingly:

- Replace the "Experimental API features" section with a new
  "Alternator Streams" section that lists known differences without
  the experimental framing.
- Expand the alternator_streams_increased_compatibility paragraph to
  explain both consequences of leaving it off (spurious no-op events
  and inaccurate INSERT/MODIFY distinction) and the performance cost
  of enabling it (LWT path for every write).
- Drop the stale ShardFilter limitation (now implemented).
- Replace the alternator-streams FAQ example with
  strongly-consistent-tables so the multi-feature syntax example
  remains useful.

Fixes SCYLLADB-462

Closes scylladb/scylladb#29695

(cherry picked from commit ac3fff897a)

Closes scylladb/scylladb#29921
2026-05-16 19:34:01 +03:00
Aleksandra Martyniuk
4b8672256d service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: SCYLLADB-2056.

Closes scylladb/scylladb#29622

(cherry picked from commit d874d355c2)

Closes scylladb/scylladb#29927
2026-05-16 19:18:18 +03:00