Commit Graph

6371 Commits

Author SHA1 Message Date
Piotr Dulikowski
8dfd455001 Merge 'strong consistency: fix drop table blocking on stuck writes and handle timeout in update()' from Petr Gusev
- Fix table drop blocking for the full client timeout when in-flight writes can't reach quorum
- Handle unhandled timeout exception in the wait-for-leader loop during group startup

When a strongly consistent table is dropped, `schedule_raft_group_deletion`() calls `g->close()` which waits for all in-flight operations to release their gate holders. But other nodes may have already destroyed their raft servers for this group, so an in-flight write on the leader cannot reach quorum and hangs until the client timeout expires (~seconds), unnecessarily delaying group deletion.

Additionally, the wait-for-leader loop in groups_manager::update() uses abort_on_expiry with a 60-second timeout but never catches the exception if it fires, leaving the group in an indeterminate state.

SCYLLADB-2080 fix:
- Reorder `schedule_raft_group_deletion`: initiate gate close (prevents new operations), then abort the raft server (unblocks stuck writes by causing `raft::stopped_error`), then await the gate future (resolves immediately since holders are released).
- Handle `raft::stopped_error` in the coordinator's top-level catch blocks (both write and read paths): if the table no longer exists, return `no_such_column_family` (CQL layer converts to InvalidRequest: unconfigured table). Otherwise fall through to the default timeout handling.
- Replace gate->hold() with try_hold() + on_internal_error in acquire_server, with a comment explaining why the gate can never be closed at that point (table removal in `schema_applier::commit_on_shard` precedes gate closure, with no scheduling point in between).

Timeout handling fix:
- Use `coroutine::as_future` in the wait-for-leader loop to catch timeout exceptions gracefully — log a warning and break out instead of propagating unhandled.

Includes a cluster test reproducer (test_drop_table_unblocks_stuck_write) that:
1. Pauses a write on the leader before add_entry
2. Drops the table (follower destroys its group immediately)
3. Resumes the write — verifies it fails promptly with InvalidRequest ("unconfigured table") instead of hanging for 15 seconds

backport: no need, strong consistency is not released yet

Fixes: SCYLLADB-2080

Closes scylladb/scylladb#30105

* github.com:scylladb/scylladb:
  strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
  strong consistency: abort raft server before gate close when dropping a table
  test/cluster: rewrite test_queries_while_dropping_table for SCYLLADB-2080
2026-05-28 09:59:20 +02:00
Emil Maskovsky
f845918861 raft: don't block replace when group0 leader is unknown
The join_node_request_handler rejects replace requests when the node
being replaced is still seen as the group0 leader. It loops for up to
10s waiting for the leader to change. However, the loop condition also
blocked when current_leader() returned empty (no leader known):

    while (!g0_server.current_leader() || *params.replaced_id == g0_server.current_leader())

This is incorrect: if current_leader() is empty, it means the old
leader is already gone (election in progress). The replaced node is
no longer the leader, so the safety check is satisfied and the replace
should be allowed to proceed.

Remove the !current_leader() check so the loop only continues while the
replaced node is positively identified as the current leader.

No backport needed: the failure rate is 2/17K in CI (dev mode only,
caused by reactor stalls under extreme resource contention) and the
code path only affects replace-after-kill scenarios where the replaced
node was the group0 leader.

Refs: SCYLLADB-2125

Closes scylladb/scylladb#30098
2026-05-27 14:56:30 +02:00
Petr Gusev
f2b1cbe998 strong consistency/groups_manager: handle timeout in update() wait-for-leader loop
The wait-for-leader loop in groups_manager::update() uses abort_on_expiry
with a 60-second timeout. If the timeout fires, co_await w->future throws
an exception that propagates unhandled out of the server_control_op
coroutine, leaving the group in an indeterminate state.

Use coroutine::as_future to catch the exception, log a warning, and break
out of the loop gracefully. The group will still be reported as started
(allowing other operations to proceed) even if the leader wasn't found
within the timeout.
2026-05-27 12:06:46 +02:00
Petr Gusev
d922c43358 strong consistency: abort raft server before gate close when dropping a table
When a strongly consistent table is dropped, schedule_raft_group_deletion()
used to call g->close() first, which waits for all in-flight operations to
release their gate holders. But other nodes may have already destroyed their
raft servers for this group, so an in-flight write on the leader cannot
reach quorum and hangs until the client timeout expires, unnecessarily
delaying group deletion.

Fix: initiate gate close (prevents new operations from entering), then
abort the raft server (causes in-flight add_entry/read_barrier to throw
raft::stopped_error, releasing their gate holders), then await the gate
future (resolves immediately since holders are now released).

Handle raft::stopped_error in the coordinator's top-level catch blocks
(both write and read paths): if the table no longer exists, return
no_such_column_family (which the CQL layer converts to InvalidRequest
'unconfigured table'). Otherwise fall through to the default timeout
handling.

Also replace gate->hold() with try_hold() + on_internal_error in
acquire_server, and handle the timeout exception in the wait-for-leader
loop in update() gracefully (log + break instead of propagating).

Fixes: SCYLLADB-2080
2026-05-27 12:06:46 +02:00
Wojciech Mitros
515faaf1d0 strong_consistency: cleanup forwarding reads to leader
When forwarding reads to the raft group leader was introduced, we
didn't use the methods allowing us to cache the leader after
completing requests - we fix it in this commit by using the
redirect_to_leader method prepared for this case.
Also remove a duplicated consecutive 'if'

Closes scylladb/scylladb#30102
2026-05-27 09:49:06 +02:00
Botond Dénes
555cfbcd38 Merge 'treewide: replace deprecated smp::count and smp::all_cpus() with new APIs' from Avi Kivity
Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads.
- schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.

No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway)

Closes scylladb/scylladb#29990

* github.com:scylladb/scylladb:
  treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
  scylla-gdb: read shard count from smp::_this_smp instead of smp::count
  schema_builder: make shard_count an explicit constructor parameter
2026-05-27 09:42:06 +03:00
Avi Kivity
8010e408a2 treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
Replace all uses of the deprecated seastar::smp::count with
this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards()
across the ScyllaDB codebase (seastar submodule untouched).

Both replacement functions require a reactor thread context. All call
sites were verified to run on reactor threads.

Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default
  parameter value. This is safe since all callers are on reactor threads,
  but the expression is now evaluated at each call site rather than being
  a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh,
  ent/encryption/encryption.cc: used in default member initializers and
  constructor member-init-lists. Objects are always constructed on reactor
  threads.

Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
2026-05-26 17:35:20 +03:00
Avi Kivity
f165b396fd schema_builder: make shard_count an explicit constructor parameter
A recent Seastar update deprecated smp::count and introduced
this_smp_shard_count() as a replacement. One difference is that
this_smp_shard_count() wants to run on a reactor thread.

This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE)
that nevertheless use a schema, as the schema_builder constructor
references smp::count. If we replace it with this_smp_shard_count()
then it will crash when running without a reactor.

To fix, remove the implicit this_smp_shard_count() call from raw_schema's
constructor and require callers to pass shard_count explicitly to
schema_builder. This allows tests that don't run on a reactor thread
to construct schemas without crashing.

Production code and reactor-based tests pass this_smp_shard_count().
Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test,
wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test)
pass a fixed shard count of 1.

Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE)
but also contains one plain BOOST_AUTO_TEST_CASE
(test_empty_key_view_comparison) that constructs a schema_builder without
a reactor context. This test also receives a fixed shard count of 1.
2026-05-26 11:55:56 +03:00
Nikos Dragazis
54cb6d4608 test: Order task-wait before finalization in test_migration_wait_task
The purpose of this test is to verify that the task manager's "wait" API
works correctly for vnodes-to-tablets migration virtual tasks. It starts
a `wait_task` HTTP request concurrently with a finalize (or rollback)
operation, and asserts that the wait returns the correct final state
("done" or "suspended").

The test `uses asyncio.create_task()` to wrap the wait request into a
task, and then immediately calls finalize. With asyncio's lazy task
scheduling, the wait coroutine does not start until the event loop
yields, so the finalization request reaches the server before wait, and
therefore may also complete before it. Once finalization completes, the
virtual migration task is no longer discoverable, causing a
"task not found" error.

Add a log message in Scylla's wait handler and a synchronization point
in the test to ensure that the wait request lands the server before
finalization. This follows the same pattern used in
`test_tablet_tasks.py::check_and_abort_repair_task`.

Fixes SCYLLADB-2077

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#29973
2026-05-26 10:43:22 +03:00
Botond Dénes
722efb4d8f storage_proxy: avoid large allocation in data_read_resolver::resolve
The versions collection in data_read_resolver::resolve() is a
std::vector<std::vector<version>>. This contains one entry per unique
partition in the union of all results from each replica.
The vector's size is reserved to the size of partitions in the first
replica's response. Later, new entries are added via `emplace_back()`
for partitions found only in other replica's responses.
This can become really large if there are lot of small partitions, and
especially when there are big differences between the partition set
returned by individual replicas.

With small partitions (e.g. Alternator items with TTL, typically 150-200
bytes each), a single 1 MB read page can carry thousands of partitions,
easily pushing this vector past 2730 entries -- the point at which a
std::vector doubling reallocation exceeds the 128 KB seastar
large-allocation warning threshold:

    2 * 2731 * sizeof(std::vector<version>=24) > 131072

Switching to utils::chunked_vector caps every individual allocation at
128 KB by design, regardless of the number of partitions or how much
the replicas diverge.  The four internal helper functions that receive
this container (find_short_partitions, get_last_row,
got_incomplete_information_across_partitions, got_incomplete_information)
are updated to accept the new type; their logic is unchanged.

Fixes: SCYLLADB-460

Closes scylladb/scylladb#29325
2026-05-25 21:09:36 +03:00
Piotr Dulikowski
3a5dd2e5be Merge 'strong_consistency: forward reads to the raft leader' from Wojciech Mitros
Strongly consistent reads currently call read_barrier() on whichever
replica happens to process the request. When a follower runs
read_barrier(), it sends an RPC to the leader to get the current read
index, then waits for its local apply index to catch up. If the follower
is behind, this wait can be significant.

By forwarding linearizable reads to the leader, we don't need an RPC from replica to leader to get the index to wait for apply -- it's available locally.

Note that read_barrier() is still required on the leader to confirm it
is still the leader and guarantee linearizability. A future optimization
would be to implement leases in the raft library, which could eliminate
read_barrier() on the leader entirely.

The CL-to-behavior mapping is isolated in a single parse_consistency_level()
function:
- CL=(LOCAL_)QUORUM -> linearizable: forwarded to the raft leader
- CL=(LOCAL_)ONE -> non-linearizable: existing behavior (no read_barrier()/forwarding, may return stale results)
- All other CLs -> invalid request

Read forwarding reuses the same CQL-layer bounce_to_node() mechanism
that write forwarding already uses. The transport layer's existing
requests_forwarded_* metrics automatically count forwarded reads.
Coordinator-level metrics (linearizable_reads, non_linearizable_reads,
writes) are added for visibility into the strong consistency workload.

Fixes: SCYLLADB-1157

Closes scylladb/scylladb#29575

* github.com:scylladb/scylladb:
  strong_consistency: test read forwarding to leader
  strong_consistency: skip read_barrier() for non-linearizable reads
  strong_consistency: split coordinator-level read latency metrics
  strong_consistency: forward linearizable reads to raft leader
  strong_consistency: classify reads by consistency level
  strong_consistency: add begin_read() to raft_server
2026-05-25 10:55:00 +02:00
Gleb Natapov
0bf050d175 storage_proxy: hold shared pointer to a table object during entire query_partition_key_range_concurrent execution
Otherwise if a table is dropped in the middle of a scan the object may
disappear.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2137

Closes scylladb/scylladb#29988
2026-05-24 21:54:08 +03:00
Petr Gusev
954426407e storage_proxy: only cancel write handlers with pending remote targets during drain
The previous fix (cancel_all_write_response_handlers in do_drain)
was too aggressive — it killed all handlers including ones used by
group0 for raft commits. Since group0 is still running at that point
(before wait_for_group0_stop), this caused group0 operations to fail
(SCYLLADB-2168).

The actual problem is only with handlers that have pending remote
targets: after stop_transport() their MUTATION_DONE responses can
never arrive via messaging. Handlers whose only pending targets are
local can still complete via apply_locally and should be left alone.

Add cancel_nonlocal_write_response_handlers() which checks each
handler's remaining targets against the local host ID. Only handlers
with at least one remote pending target are cancelled. Use it in
do_drain instead of cancel_all_write_response_handlers. The latter
remains unchanged for drain_on_shutdown (final proxy shutdown where
all handlers must be killed).

Fixes: SCYLLADB-2168

Closes scylladb/scylladb#30020
2026-05-23 13:37:34 +02:00
Wojciech Mitros
afa2ef6816 strong_consistency: skip read_barrier() for non-linearizable reads
Non-linearizable reads (CL=ONE) no longer call read_barrier() before
querying the local replica. This is safe because state_machine::apply()
only writes to the table after raft commit, so a local read without
read_barrier cannot see uncommitted data — just potentially stale data
which is acceptable for CL=ONE semantics.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
d07692a7ff strong_consistency: split coordinator-level read latency metrics
Split the latency metrics for strongly consistent reads into two
categories: linearizable and non-linearizable. They replace the
existing metrics for both types combined - this shouldn't cause
issues because the feature is still experimental and both the
initial introduction of latency metrics and the split will be
a part of the same release.

Also fix a test that was using the old metric.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
297094c08f strong_consistency: forward linearizable reads to raft leader
For linearizable reads (CL=QUORUM), check leadership via begin_read()
before proceeding. If this node is not the leader, redirect the request
to the leader via need_redirect (handled by bounce_to_node() in the CQL
layer). If the leader is unknown, wait and retry. When this node is the
leader, perform read_barrier() locally. This avoids sending an RPC from
the replica to the leader to get the index to wait for apply - it's
available locally. Also, linearizable reads can use and fill the cache
of leaders that we store for strongly consistent tablet groups.

Non-linearizable reads (CL=ONE) retain the existing behavior:
create_operation_ctx() redirects if not a replica, then read_barrier()
is performed on the local replica. This will be changed in the following
commit.

Also fix a copy-paste typo in the unknown exception log message that said
"mutate()" instead of "query()"

Fixes: SCYLLADB-1157
2026-05-23 11:35:37 +02:00
Wojciech Mitros
c0ea98f922 strong_consistency: classify reads by consistency level
Introduce a read_type enum (linearizable vs non_linearizable) and transform
the existing "validate" function into a "parse" method - instead of checking
if the consistency level is one of the accepted ones, we now also return the
correcponding read type for strong consistency.
The "parse" function maps CQL consistency levels to following read types:
- CL=(LOCAL_)QUORUM -> linearizable (this is the default CL)
- CL=(LOCAL_)ONE -> non_linearizable
- all others -> throw

The classification is performed in the CQL layer (select_statement) to
keep the coordinator free of CL concepts.
2026-05-23 11:35:37 +02:00
Wojciech Mitros
1f91524547 strong_consistency: add begin_read() to raft_server
Add begin_read() method to raft_server that checks leadership for read
operations. Unlike begin_mutate(), it does not need to compute a
timestamp or interact with leader_info. It simply checks current_leader()
and returns one of three dispositions:

  - ok: this node is the leader, proceed with read_barrier() locally
  - raft::not_a_leader: redirect to the indicated leader
  - need_wait_for_leader: leader unknown, caller must wait and retry

This will be used by the read forwarding logic in subsequent commits.
2026-05-23 11:35:36 +02:00
Łukasz Paszkowski
cf0ad2bde9 tablet_allocator: use chunked_vector in cluster_resize_load to avoid oversized allocations
In make_resize_plan(), the tables_need_resize vector in cluster_resize_load
accumulates all tables that require a resize decision before the downstream
heap-based logic selects the top-N most urgent ones to emit.

In clusters with thousands of tables and aggressive tablets-per-shard scaling
(e.g., 5000 empty tables with scaling factors of 0.04-0.12), nearly all tables
satisfy the merge condition (scaled target < current tablet count), causing
the vector to grow to thousands of entries. With ~100 bytes per element,
std::vector's doubling strategy triggers contiguous allocations exceeding
256KB, producing seastar oversized allocation warnings.

Replace std::vector with utils::chunked_vector in cluster_resize_load for
both tables_need_resize and tables_being_resized. chunked_vector caps
individual allocations at 128KB, splitting into multiple chunks when needed.
For normal workloads (fewer than ~1300 resize candidates), behavior is
iadentical to std::vector — single contiguous chunk, same performance.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1955

Closes scylladb/scylladb#29946
2026-05-22 16:52:12 +03:00
Łukasz Paszkowski
96a992002c tasks: fix busy-spin and shutdown hang in tablet_virtual_task::wait() for repair tasks
The condition variable predicate for repair tasks unconditionally
returned true (introduced in e5928497ce), which meant event.wait(pred)
never actually suspended: do_until checks the predicate first, and if
it's already satisfied, returns immediately without calling the inner
wait(). This caused two problems:
1. The while(true) loop busy-spun, polling without blocking between
   topology changes.
2. During shutdown, event.broken() had no effect because no waiter was
   registered on the CV. The loop kept spinning, holding the HTTP
   server's task gate open and preventing http_server::stop() from
   completing. After ~15 minutes, systemd killed the process with
   SIGABRT.

The fix replaces the synchronous predicate with an async task_finished()
helper that dispatches on the task type. Since the repair check is async
(for_each_tablet scans every tablet), we cannot use event.wait(Pred).
Instead, we register a waiter via event.wait() *before* running the async
check, ensuring no broadcast is missed during the check. event.broken()
during shutdown propagates broken_condition_variable to the registered
waiter and unblocks the loop promptly.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1532

Closes scylladb/scylladb#29485
2026-05-22 16:47:48 +03:00
Avi Kivity
305346a3ec Merge 'Don't materialize collections into intermediate representations' from Botond Dénes
Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again.
This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections.

This PR aims to solve this problem once and for all.  The plan is as follows:
* Promote direct use of the serialized collection format:
    - Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`.
    - Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`).
* Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure.
* Drop the old infrastructure, to avoid accidental regressions.

Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion.

To help focus review, here is a summary of the patches:
* [1, 2] preparatory refactoring: drop some unused abstract_type params
* [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR
* [6, -1) replace all usage of old materializing infrastructure with usage of the new one
* [-1] drop old infrastructure

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |   After | Change     |
|--------------------------|--------:|--------:|------------|
| Throughput (median tps)  | 315,760 | 332,021 | **+5.1%**  |
| Instructions/op (median) |  53,776 |  48,681 | **-9.5%**  |
| CPU cycles/op (median)   |  17,365 |  16,471 | **-5.1%**  |
| Allocations/op           |    85.1 |    82.1 | **-3.5%**  |

**Significant improvement.** Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write
```

| Metric                   |    Before |    After | Change    |
|--------------------------|----------:|---------:|-----------|
| Throughput (median tps)  |   150,823 |  149,678 | **-0.8%** |
| Instructions/op (median) |   108,388 |  103,858 | **-4.2%** |
| CPU cycles/op (median)   |    34,860 |   35,371 | **+1.5%** |
| Allocations/op           | ~105–108  | ~102–103 | **-3.0%** |

**Mixed, mostly neutral.** Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally.

---

**Command:**
```
dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error
```

| Metric                   |  Before |  After | Change    |
|--------------------------|--------:|-------:|-----------|
| Throughput (median tps)  |  55,777 | 56,051 | **+0.5%** |
| Instructions/op (median) | 246,215 |246,610 | **+0.2%** |
| CPU cycles/op (median)   |  77,641 | 77,020 | **-0.8%** |
| Allocations/op           |   340.4 |  335.4 | **-1.5%** |

**Essentially neutral.** All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise.

---

The change has a **clear, substantial positive effect on reads** (~5% throughput gain, ~9.5% fewer instructions per op).
The write and alternator paths are **unaffected in practice** — changes there are within measurement noise. No regressions are apparent.
This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads.

Fixes: #3602

Improvement, no backport.

Closes scylladb/scylladb#29127

* github.com:scylladb/scylladb:
  mutation/collection_mutation: make collection_mutation::_data private
  mutation_collection: drop collection_mutation_description and friends
  test: move away from collection_mutation_description
  tree: move away from collection_mutation_description
  test: move away from collection_mutation_view::with_deserialized()
  tree: move away from collection_mutation_view::with_deserialized()
  types: fix indendation, left broken by previous commit
  types: move away from collection_mutation_view::with_deserialized()
  types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT()
  schema: column_computation: move away from collection_mutation_view::with_deserialized()
  mutation: move away from collection_mutation_view::with_deserialized()
  alternator: move away from collection_mutation_view::with_deserialized()
  cdc: move away from collection_mutation_view::with_deserialized()
  mutation/collection_mutation: printer: don't deserialize collections
  mutation/collection_mutation: difference(): don't deserialize collections
  mutation/collection_mutation: merge(): don't deserialize collections
  mutation/collection_mutation: extract compact_and_expire() to free function
  mutation/collection_mutation: refactor empty(), is_any_live() and last_update()
  compaction_garbage_collector: pass collection_mutation to collect()
  test/boost/mutation_test: add tests for collection_mutation_{view,writer}
  mutation/collaction_mutation: collection_mutation_view: add methods to inspect content
  mutation/collection_mutation: add collection_mutation_writer
  mutation/collection_mutation: collection_mutation(): generate valid collection
  mutation/collection_mutation: collection_mutation(): remove unused abstract_type param
  mutation/atomic_cell: drop unused type param from from_bytes()
2026-05-21 17:10:40 +03:00
Patryk Jędrzejczak
1ed3f5c4af Merge 'storage_service: cancel write handlers during drain to prevent shutdown deadlock' from Petr Gusev
Fixes a shutdown deadlock where a node hangs because `stale_versions_in_use()` blocks on stale `token_metadata` versions held by write handlers whose `MUTATION_DONE` responses can never arrive (transport is already stopped).

Two manifestations depending on whether the shutting-down node is the topology coordinator:
- Coordinator: do_drain → wait_for_group0_stop deadlocks because the topology coordinator fiber is stuck in barrier_and_drain → stale_versions_in_use().
- Non-coordinator: ss::stop → uninit_messaging_service deadlocks because the barrier_and_drain RPC handler holds the gate open.

The non-coordinator case was fixed in PR #24714 (cancel all write requests on storage_proxy shutdown), but its test never actually failed — the write handler always captured the current token_metadata version because `pause_before_barrier_and_drain` used `one_shot=True,` so only the first `barrier_and_drain` was paused. The topology state hadn't advanced by that point, meaning the write handler's ERM version matched the current version and `stale_versions_in_use()` returned immediately. The coordinator case was not covered at all.

Cancel all write response handlers on all shards right after `stop_transport()` in `do_drain()`. This releases their ERMs and the associated stale token_metadata versions, unblocking `stale_versions_in_use()`.

Fixed the test to ensure the write handler holds a stale version: use one_shot=False, let the first barrier_and_drain through (version still current), then wait for the second one (version now stale). Extended to cover both coordinator and non-coordinator shutdown on the same 2-node cluster.

Also includes supporting changes:
- error_injection: release wait_for_message waiters on disable() so the test can atomically unblock paused handlers
- error_injection: add non-shared mode to wait_for_message for per-invocation message semantics
- scylla_cluster.py: allow stop() to bypass start_stop_lock so SIGKILL works while stop_gracefully is blocked

Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665

backports: SCYLLADB-1842 reported a failure in 2025.1, so we need to backport to all versions starting from 2025.1

Closes scylladb/scylladb#29882

* https://github.com/scylladb/scylladb:
  storage_service: cancel write handlers during drain to prevent shutdown deadlock
  test_unfinished_writes_during_shutdown: extend to cover coordinator shutdown
  test_unfinished_writes_during_shutdown: fix to reproduce the shutdown deadlock
  test_unfinished_writes_during_shutdown: await add_last_node_task instead of cancelling it
  test_unfinished_writes_during_shutdown: add timeout and deadlock detection for shutdown_task
  test: scylla_cluster: allow stop() to bypass start_stop_lock
  error_injection: add non-shared mode to wait_for_message
  error_injection: release waiters when injection is disabled
2026-05-21 15:43:36 +02:00
Wojciech Mitros
13c043903d strong_consistency: cache leader location for non-replica nodes
When a non-replica node handles a strongly consistent write, it must
forward the request to a replica. If the closest replica is not the
leader, the request gets redirected again, causing an extra roundtrip.

Add a leader location cache in groups_manager, keyed by raft group_id.
After a write request is forwarded, the CQL transport layer records the
final node as the leader in the cache. Subsequent write requests from
the same node for the same group are forwarded directly to the cached
leader, eliminating the extra roundtrip.

The cache is only used for writes. Reads can be served by any replica,
so they skip the cache and use proximity-based routing instead.

Cache entries are validated at use time: if the cached leader is no
longer a replica (e.g. after tablet migration), the entry is evicted
and the normal closest-replica path is taken. This prevents a scenario
where two nodes keep redirecting to each other because both think that
the other is the leader but actually both are non-replicas - such loop
is broken as soon as the tablet maps are updated.

On token_metadata updates, entries for groups that no longer exist
(e.g. table dropped, tablet merged) are evicted. Entries for groups
that still exist are kept — use-time validation handles staleness.

An on_node_resolved callback is propagated through the redirect/bounce
path so the transport layer can update the cache generically without
coupling to the strong-consistency coordinator. The coordinator creates
the callback only for writes (capturing the groups_manager and
group_id) and attaches it to the bounce message; the transport layer
invokes it once the final node is known, keeping the forwarding
infrastructure subsystem-agnostic.

We also add a test which verifies that after the initial redirect,
following requests to the same node avoid the extra redirect and
forward directly to the leader.

Fixes: SCYLLADB-1064

Closes scylladb/scylladb#29392
2026-05-21 10:32:56 +02:00
Gleb Natapov
cc034f84c5 schema: ensure committed_by_group0 is set for all non-system tables on boot
Tables created before the GROUP0_SCHEMA_VERSIONING feature was enabled
have committed_by_group0 = null in system_schema.scylla_tables. This
causes maybe_delete_schema_version() to delete their version cell,
forcing the legacy hash-based schema version computation path.

Add ensure_committed_by_group0() which runs on boot and fixes up any
non-system tables where committed_by_group0 is not true (null or false):

1. Queries system_schema.scylla_tables for rows where committed_by_group0
   is null or false, skipping system keyspaces (system, system_schema).
2. Takes a group0 guard
3. Re-checks after the raft barrier in case another node already fixed it.
4. For each table needing fixup, creates a mutation writing the version
   cell (from the in-memory schema). The committed_by_group0 = true flag
   is stamped by add_committed_by_group0_flag() inside announce().
5. Announces via raft group0.
6. Retries with a small random delay on group0_concurrent_modification.

On other nodes, schema_applier will detect these as "altered" tables
(scylla_tables mutation changed), but since the actual table definition
is unchanged, update_column_family is effectively a no-op.

This is a prerequisite for eventually removing the legacy hash-based
schema versioning code path.

Closes scylladb/scylladb#29911
2026-05-21 10:22:07 +02:00
Botond Dénes
636e2877e2 tree: move away from collection_mutation_description
Use collection_mutation_writer instead.

Add to_managed_bytes() to cql3::raw_value to help avoid some copies.

A special note for sstables/kl/reader.cc: this conversion is not
straighforward, so we accumulate a list of cells and feed to the writer
at the end. This is sub-optimal but this code is rarely used, best to be
conservative.
2026-05-21 10:23:29 +03:00
Botond Dénes
24fdfa34dd mutation/collection_mutation: collection_mutation(): remove unused abstract_type param 2026-05-21 08:34:21 +03:00
Petr Gusev
2927f0dd21 storage_service: cancel write handlers during drain to prevent shutdown deadlock
When a node shuts down, do_drain() calls stop_transport() which tears
down the messaging service. After this point, MUTATION_DONE responses
from replicas can no longer reach the coordinator, so any in-flight
write_response_handlers will never complete naturally. These handlers
hold ERMs referencing stale token_metadata versions.

If the topology coordinator calls barrier_and_drain (either on itself
or via RPC), it blocks in stale_versions_in_use() waiting for these
stale versions to be released. This causes:
- On the coordinator node: do_drain -> wait_for_group0_stop deadlock
  (the topology coordinator fiber is stuck in barrier_and_drain).
- On non-coordinator nodes: ss::stop -> uninit_messaging_service
  deadlock (the barrier_and_drain RPC handler holds the gate open).

Fix: cancel all write response handlers on all shards right after
stop_transport() in do_drain(). This releases their ERMs and the
associated stale token_metadata versions, unblocking
stale_versions_in_use().

Heap-allocate _write_handlers_gate and add an allow_new parameter to
cancel_all_write_response_handlers(). When allow_new=true (used by
do_drain), the gate is closed and swapped with a fresh one — existing
handlers are waited on while new handlers can still be created. This
avoids blocking internal writes (paxos learn, compaction history
updates) that still need to create handlers during the remainder of
the drain sequence. When allow_new=false (used by drain_on_shutdown),
the gate is closed permanently — no new handlers can be created after
final shutdown.

Update test_lwt_shutdown to wait for 'Stop transport: done' instead
of 'Shutting down storage proxy RPC verbs'. The latter message is
now only logged after do_drain() completes, but do_drain() blocks
in cancel_all_write_response_handlers() waiting for the background
paxos learn handler — which is exactly what the test needs to release
before shutdown can proceed.

Fixes: SCYLLADB-1842
Refs: scylladb/scylladb#23665
2026-05-20 22:21:45 +02:00
Petr Gusev
324a08295d error_injection: add non-shared mode to wait_for_message
Add a 'share' parameter to wait_for_message (default true, preserving
existing behavior). When share=false, each handler invocation requires
its own dedicated message to proceed — a message consumed by one
handler is not visible to others.

Use share=false for the pause_before_barrier_and_drain injection in
raft_topology_cmd_handler. The topology coordinator sends multiple
barrier_and_drain RPCs during a single topology transition (one per
state change). With share=true a single message_injection call
releases all handlers. With share=false the test can release them
one at a time, controlling exactly which topology state the write
handler's ERM captures.
2026-05-20 17:05:54 +02:00
Andrzej Jackowski
c810bb48f4 audit: rebuild rule caches on group0 snapshot and role changes
Nodes can join or reload snapshots after roles and tables
already exist, so the cache cannot rely only on
incremental notifications.

Bulk-load all known roles and tables into the rule cache
on Raft state reload and snapshot transfer. Detect
incremental role creates and drops in reload_modules() by
comparing the loaded roles against the auth cache, and
forward the changes to every shard.

Each shard rebuilds the fnmatch cache locally from its own
rules to avoid cross-shard races when rules are updated
concurrently with entity sync.

Refs SCYLLADB-1430
2026-05-20 06:55:15 +02:00
Patryk Jędrzejczak
c9592a495e Merge 'cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce' from Petr Gusev
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041

backport: need to backport to all versions with LWT over tablets

Closes scylladb/scylladb#29910

* https://github.com/scylladb/scylladb:
  cql: refactor add_tablet_info to take tablet_routing_info directly
  cql: fix UB dereference of nullopt tablet_info in execute_with_condition
  test/boost: add regression test for missing tablet routing after CAS bounce
  cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
2026-05-18 11:19:04 +02:00
Yaniv Michael Kaul
34aac2030c paxos: enable paging for internal paxos state queries
The paxos state queries (load_paxos_state, save_paxos_promise, etc.)
were using page_size=-1 (no paging). While each query returns at most
one row and paging never actually kicks in, the lack of paging causes
these internal queries to be counted as non-paged reads in the metrics,
which can be confusing to users monitoring their cluster.

Add LIMIT 1 to the SELECT query so that may_need_paging() short-circuits
to false (row_limit <= 1), avoiding pager allocation overhead entirely.
Set page_size=1000 so these queries are no longer reported as non-paged
reads.

Refs: https://scylladb.atlassian.net/browse/CUSTOMER-372
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Backport: no, improvement

Closes scylladb/scylladb#29852
2026-05-18 11:35:55 +03:00
Aleksandra Martyniuk
d874d355c2 service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.

Closes scylladb/scylladb#29622
2026-05-15 17:46:28 +02:00
Petr Gusev
167a3c9c50 cql: fix missing TABLETS_ROUTING_V1 payload after CAS shard bounce
After an internal CAS shard bounce, check_locality() was evaluating
against this_shard_id() of the post-bounce shard — which is the correct
tablet shard — so it returned nullopt, and LWT/SERIAL responses omitted
the tablets-routing-v1 custom payload. The client never learned the
correct tablet map.

Fix by recording the original entry shard in client_state (initialized
to this_shard_id() at construction, preserved across shard bounces via
client_state_for_another_shard) and passing it to check_locality() so
it compares against the client's actual routing decision.

No host_id tracking or forwarded_client_state IDL changes are needed
because CAS shard bounces are always intra-node.

Fixes SCYLLADB-2041
2026-05-15 11:56:14 +02:00
Piotr Dulikowski
0c016cecc3 Merge 'QOS: self-heal stale V1-to-V2 migration state on upgrade' from Alex Dathskovsky
service_levels: self-heal stale v1 marker after raft topology upgrade

This PR handles an upgrade corner case where a node may already be using
raft topology, while `system.scylla_local` still marks service levels as v1.

The problem was introduced by commit 2917ec5d51
("service:qos: service levels migration"), which added the service-levels
migration from `system_distributed.service_levels` to
`system.service_levels_v2` as part of the raft topology upgrade.

However, if the cluster had no service levels configured, there was no data
to migrate. In that case, the migration path could leave the local version
marker unchanged, so the node would later observe an inconsistent state:

  * raft topology is already enabled;
  * service levels are still marked as v1 in `system.scylla_local`.

Such clusters can be left in a stale state and fail startup during upgrade to
2026.2

This PR makes the upgrade path self-healing.

The first commit restores `service_level_controller::migrate_to_v2()`, giving
us a group0-based path for writing the service-levels v2 state even after raft
topology is already in use.

The second commit wires this path into startup. When the node detects the
stale raft-topology + service-levels-v1 state, it retries the migration a
bounded number of times and updates the version marker to v2 instead of
failing startup.

With this change, clusters that were left in this stale state can recover
automatically during upgrade to 2026.
Fixes: SCYLLADB-1807

backport: 2026.2 2026.1 we need this functionality when we are upgrading older servers

Closes scylladb/scylladb#29749

* github.com:scylladb/scylladb:
  test/auth_cluster: simulate v1 state in self-heal test When skip_service_levels_v2_initialization is used, write an explicit v1 service level version marker while skipping v2 initialization. This lets the restart test exercise self-healing from v1 to v2.
  qos: self-heal stale service levels version on startup
  qos: reintroduce service levels v2 migration self-heal
2026-05-14 10:32:43 +02:00
Alex
6188bf3e01 test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
2026-05-13 17:55:20 +03:00
Alex
c2014f7e50 qos: self-heal stale service levels version on startup
Add self_heal_service_levels_version() and use it during startup when
  the node is already on raft topology but service levels are still marked
  as v1.

  In that stale state, migrate service levels to v2 through group0 instead
  of failing startup.
2026-05-13 17:55:20 +03:00
Piotr Dulikowski
f3ac35f9d2 Merge 'strong_consistency: wait for raft servers to start in create table' from Michael Litvak
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807

no backport - strong consistency is still experimental

Closes scylladb/scylladb#28843

* github.com:scylladb/scylladb:
  strong_consistency: wait for leader when starting a group
  strong_consistency: change wait for groups to start on startup
  strong_consistency: optimize wait_for_groups_to_start
  strong_consistency: wait for raft servers to start in create table
2026-05-13 16:42:05 +02:00
Piotr Dulikowski
3c2c814215 Merge 'db/view/view_building: replace system keyspace functions with mutation builder' from Michał Jadwiszczak
`system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.

This PR removes all `make_..._mutation()` system keyspace functions related to view building tasks and replaces them with mutation builder.

Refs https://github.com/scylladb/scylladb/issues/25929

This patch doesn't fix any bug, it only reduces number of generated mutations, no need to backport it.

Closes scylladb/scylladb#26557

* github.com:scylladb/scylladb:
  db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder
  db/view/view_building_task_mutation_builder: make uuid generator optional
  db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder
  db/view/view_building_task_mutation_builder: add helper method
2026-05-13 16:10:55 +02:00
Tomasz Grabiec
66439bb753 Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing

The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).

The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.

Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.

The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted

Fixes: SCYLLADB-1775

This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2

Closes scylladb/scylladb#29756

* github.com:scylladb/scylladb:
  test: add test for intranode balance threshold in size-based mode
  tablet_allocator: apply balance threshold to intranode shard balancing
2026-05-13 13:09:52 +02:00
Patryk Jędrzejczak
3f2ff5a13f Merge 'Remove raft_group0::finish_setup_after_join' from Gleb Natapov
The function does nothing useful now.

No backport needed. Removes code.

Closes scylladb/scylladb#29828

* https://github.com/scylladb/scylladb:
  raft_group0: remove finish_setup_after_join function
  raft_group0: fix indentation after the last change
  raft_group: drop unneeded checks
2026-05-13 10:53:37 +02:00
Michał Jadwiszczak
1a32ccd8f6 db/system_keyspace: replace make_remove_view_building_task_mutation() with mutation builder
Again, get rid of system keyspace method in favor of mutation builder,
because `system.view_building_tasks` is a single parition table.
2026-05-13 10:06:18 +02:00
Alex
ac0a19aab8 qos: reintroduce service levels v2 migration self-heal
migrate_to_v2() was removed after gossip-based service level migration
  support was dropped, since upgraded nodes were expected to already use
  service levels v2.

  However, clusters affected by the old migration bug may reach raft topology
  while system.scylla_local still has a stale service level version. Restore
  the migration helper so startup can self-heal those nodes by writing the v2
  state through group0.
2026-05-13 10:16:02 +03:00
Michael Litvak
80bfc445a8 strong_consistency: wait for leader when starting a group
When starting the raft server for a group, wait for the leader before
completing the start operation. We want the group to be ready to accept
writes by the time the start is reported to be completed without the
additional latency of waiting for leader.
2026-05-13 08:43:26 +02:00
Michael Litvak
5f8322a820 strong_consistency: change wait for groups to start on startup
on startup, previously groups_manager::start() was called and waited for
the groups to start. we change it instead to just start the raft servers
in the background without waiting for them to be fully started. we wait
for the servers to start explicitly at a later stage of startup, after
starting the messaging service.

the reason is that for the servers to be fully started they may require
communication that requires the messaging service. currently it is not
required, but it will be changed in the next commit.
2026-05-13 08:43:26 +02:00
Michael Litvak
e568ca2bd8 strong_consistency: optimize wait_for_groups_to_start
instead of iterating over all raft groups in wait_for_groups_to_start
and check if we need to wait for them, maintain a list of only the raft
groups that are starting and need to be waited.
2026-05-13 08:43:26 +02:00
Michael Litvak
5a5c7c6241 strong_consistency: wait for raft servers to start in create table
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807
2026-05-13 08:43:24 +02:00
Michał Jadwiszczak
e002665aa7 db/system_keyspace: replace make_view_building_task_mutation() with mutation builder
`system.view_building_tasks` is a single partition table, so it makes
more sense to use a mutation builder and generate 1 mutation per group0
command instead of generating multiple mutations.
2026-05-12 21:49:18 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Avi Kivity
ddb1181103 Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1803

Closes scylladb/scylladb#29791

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-12 12:38:25 +03:00