Commit Graph

6230 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
2fbba4a071 raft, service, locator: create raft_fwd.hh and reduce heavy header includes
Create raft/raft_fwd.hh with lightweight type aliases (server_id, group_id,
term_t, index_t) backed only by raft/internal.hh, avoiding the heavy
raft/raft.hh (832 lines with futures, abort_source, bytes_ostream).

Replace raft/raft.hh with raft/raft_fwd.hh in headers that only need the
basic ID types: tablets.hh, topology_state_machine.hh,
topology_coordinator.hh, storage_service.hh, group0_fwd.hh,
view_building_coordinator.hh, view_building_worker.hh.

Also remove gossiper.hh and tablet_allocator.hh from storage_service.hh
(forward declarations suffice), and remove unused reactor.hh from
tablets.hh. Add explicit includes in .cc files that lost transitive
availability.
2026-04-17 01:08:04 +03:00
Yaniv Michael Kaul
be5fa64d36 db: break gossiper.hh include from system_keyspace.hh
Extract loaded_endpoint_state into a standalone lightweight header to
avoid pulling the heavy gossiper.hh (and transitively query-result-set.hh)
into every includer of system_keyspace.hh. Add explicit includes where
the full definitions are actually needed.

Reduces clean dev build time by ~2 minutes (-8%).
2026-04-16 23:27:55 +03:00
Yaniv Michael Kaul
5c918d29cc service: remove unused storage_service.hh include from storage_proxy.hh
storage_proxy.hh included storage_service.hh but never referenced any
symbol from it. storage_service.hh costs 3.7s to parse per file, and
storage_proxy.hh has 75 direct includers. While most of those also
include database.hh (which shares transitive deps), removing this
unnecessary include still reduces total parse work.

Speedup: part of a series measured at -5.8% wall-clock improvement
(same-session A/B: 16m14s -> 15m17s at -j16, 16 cores).
2026-04-16 18:22:56 +03:00
Yaniv Michael Kaul
8ad8e76c3b cql3, service, test: add explicit includes for headers losing transitive availability
Add explicit #include directives for headers that are currently
available transitively through cql3/query_processor.hh but will stop
being available after a subsequent refactoring that removes the
loading_cache include chain.

Files changed:
- cql3/statements/drop_keyspace_statement.cc: add unimplemented.hh
- cql3/statements/truncate_statement.cc: add unimplemented.hh
- cql3/statements/batch_statement.cc: add result_message.hh
- cql3/statements/broadcast_modification_statement.cc: add result_message.hh
- service/paxos/paxos_state.cc: add result_message.hh
- test/lib/cql_test_env.cc: add result_message.hh
- table_helper.cc: add result_message.hh

No functional change. Prepares for subsequent query_processor.hh cleanup.
2026-04-15 04:20:49 +03:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Avi Kivity
22949bae52 Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.

* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.

Refs SCYLLADB-770

no backport - still a new and experimental feature

Closes scylladb/scylladb#29207

* github.com:scylladb/scylladb:
  test: logstor: additional logstor tests
  docs/dev: add logstor on-disk format section
  logstor: add version and crc to buffer header
  test: logstor: tablet split/merge and migration
  logstor: enable tablet balancing
  logstor: streaming of logstor segments using stream_blob
  logstor: add take_logstor_snapshot
  logstor: segment input/output stream
  logstor: implement compaction_group::cleanup
  logstor: tablet split
  logstor: tablet merge
  logstor: add compaction reenabler
  logstor: add segment header
  logstor: serialize writes to active segment
  replica: extend compaction_group functions for logstor
  replica: add compaction_group_for_logstor_segment
  logstor: code cleanup
2026-04-12 16:11:12 +03:00
Avi Kivity
8ccee6803e Merge 'Remove upgrade view builder' from Gleb Natapov
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.

v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.

No backport needed since this is code removal.

Closes scylladb/scylladb#29105

* github.com:scylladb/scylladb:
  view: drop unused v1 builder code
  view: remove upgrade to raft code
2026-04-12 00:39:26 +03:00
Piotr Dulikowski
32e3a01718 Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek
Motivation
----------

Since strongly consistent tables are based on the concept of Raft
groups, operations on them can get stuck for indefinite amounts of
time. That may be problematic, and so we'd like to implement a way
to cancel those operations at suitable times.

Description of solution
-----------------------

The situations we focus on are the following:

* Timed-out queries
* Leader changes
* Tablet migrations
* Table drops
* Node shutdowns

We handle each of them and provide validation tests.

Implementation strategy
-----------------------

1. Auxiliary commits.
2. Abort operations on timeout.
3. Abort operations on tablet removal.
4. Extend `client_state`.
5. Abort operation on shutdown.
6. Help `state_machine` be aborted as soon as possible.

Tests
-----

We provide tests that validate the correctness of the solution.

The total time spent on `test_strong_consistency.py`
(measured on my local machine, dev mode):

Before:
```
real    0m31.809s
user    1m3.048s
sys     0m21.812s
```

After:
```
real    0m34.523s
user    1m10.307s
sys     0m27.223s
```

The incremental differences in time can be found in the commit messages.

Fixes SCYLLADB-429

Backport: not needed. This is an enhancement to an experimental feature.

Closes scylladb/scylladb#28526

* github.com:scylladb/scylladb:
  service: strong_consistency: Abort state_machine::apply when aborting server
  service: strong_consistency: Abort ongoing operations when shutting down
  service: client_state: Extend with abort_source
  service: strong_consistency: Handle abort when removing Raft group
  service: strong_consistency: Abort Raft operations on timeout
  service: strong_consistency: Use timeout when mutating
  service: strong_consistency: Fix indentation
  service: strong_consistency: Enclose coordinator methods with try-catch
  service: strong_consistency: Crash at unexpected exception
  test: cluster: Extract default config & cmdline in test_strong_consistency.py
2026-04-10 11:11:21 +02:00
Patryk Jędrzejczak
751bf31273 Merge 'More gossiper cleanups' from Gleb Natapov
The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state.

No need to backport. Cleanups.

Closes scylladb/scylladb#29129

* https://github.com/scylladb/scylladb:
  storage_service: cleanup unused code
  storage_service: simplify get_peer_info_for_update
  gossiper: send shutdown notifications in parallel
  gms: remove unused code
  virtual_tables: no need to call gossiper if we already know that the node is in shutdown
  gossiper: print node state from raft topology in the logs
  gossiper: use is_shutdown instead of code it manually
  gossiper: mark endpoint_state(inet_address ip) constructor as explicit
  gossiper: remove unused code
  gossiper: drop last use of LEFT state and drop the state
  gossiper: drop unused STATUS_BOOTSTRAPPING state
  gossiper: rename is_dead_state to is_left since this is all that the function checks now.
  gossiper: use raft topology state instead of gossiper one when checking node's state
  storage_service: drop check_for_endpoint_collision function
  storage_service: drop is_first_node function
  gossiper: remove unused REMOVED_TOKEN state
  gossiper: remove unused advertise_token_removed function
2026-04-10 09:56:20 +02:00
Michał Hudobski
c8b9fde828 auth: allow VECTOR_SEARCH_INDEXING permission to access system.tablets
Add system.tablets to the set of system resources that can be
accessed with the VECTOR_SEARCH_INDEXING permission.

Fixes: VECTOR-605

Closes scylladb/scylladb#29397
2026-04-09 21:53:07 +03:00
Gleb Natapov
dbaba7ab8a storage_service: cleanup unused code
Remove unused definition and double includes.
2026-04-09 13:31:41 +03:00
Gleb Natapov
b050b593b3 storage_service: simplify get_peer_info_for_update
It does nothing for fields managed in raft, so drop their processing.
2026-04-09 13:31:41 +03:00
Gleb Natapov
67102496c8 gossiper: drop last use of LEFT state and drop the state
The decommission sets left gossiper state only to prevent shutdown
notification be issued by the node during shutdown. Since the
notification code now checks the state in raft topology this is no
longer needed.
2026-04-09 13:31:39 +03:00
Gleb Natapov
7c895ced19 gossiper: rename is_dead_state to is_left since this is all that the function checks now. 2026-04-09 13:31:38 +03:00
Gleb Natapov
c17c4806a1 storage_service: drop check_for_endpoint_collision function
All the checks that it does are also done by join coordinator and the
join coordinator uses more reliable raft state instead of gossiper one.
2026-04-09 13:31:37 +03:00
Gleb Natapov
1ac8edb22b storage_service: drop is_first_node function
It make no sense now since the first node to bootstrap is determined by
discover_group0 algorithm.
2026-04-09 13:31:37 +03:00
Gleb Natapov
681aa9ebe1 gossiper: remove unused REMOVED_TOKEN state 2026-04-09 13:31:37 +03:00
Dawid Mędrek
f0dfe29d88 service: strong_consistency: Abort state_machine::apply when aborting server
The state machine used by strongly consistent tablets may block on a
read barrier if the local schema is insufficient to resolve pending
mutations [1]. To deal with that, we perform a read barrier that may
block for a long time.

When a strongly consistent tablet is being removed, we'd like to cancel
all ongoing executions of `state_machine::apply`: the shard is no
longer responsible for the tablet, so it doesn't matter what the outcome
is.

---

In the implementation, we abort the operations by simply throwing
an exception from `state_machine::apply` and not doing anything.
That's a red flag considering that it may lead to the instance
being killed on the spot [2].

Fortunately for us, strongly consistent tables use the default Raft
server implementation, i.e. `raft::server_impl`, which actually
handles one type of an exception thrown by the method: namely,
`abort_requested_exception`, which is the default exception thrown
by `seastar::abort_source` [3]. We leverage this property.

---

Unfortunately, `raft::server_impl::abort` isn't perfectly suited for
us. If we look into its code, we'll see that the relevant portion of
the procedure boils down to three steps:

1. Prevent scheduling adding new entries.
2. Wait for the applier fiber.
3. Abort the state machine.

Since aborting the state machine happens only after the applier fiber
has already finished, there will no longer be anything to abort. Either
all executions of `state_machine::apply` have already finished, or they
are hanging and we cannot do anything.

That's a pre-existing problem that we won't be solving here (even
though it's possible). We hope the problem will be solved, and it seems
likely: the code suggests that the behavior is not intended. For more
details, see e.g. [4].

---

We provide two validation tests. They simulate the abortion of
`state_machine::apply` in two different scenarios:

* when the table is dropped (which should also cover the case of tablet
  migration),
* when the node is shutting down.

The value of the tests isn't high since they don't ensure that the
state of the group is still valid (though it should be), nor do they
perform any other check. Instead, we rely on the testing framework to
spot any anomalies or errors. That's probably the best we can do at
the moment.

Unfortunately, both tests are marked as skipped becuause of the current
limitations of `raft::server_impl::abort` described above and in [4].

References:
[1] 4c8dba1
[2] See the description of `raft::state_machine` in `raft/raft.hh`.
[3] See `server_impl::applier_fiber` in `raft/server.cc`.
[4] SCYLLADB-1056
2026-04-09 11:36:51 +02:00
Dawid Mędrek
ad8a263683 service: strong_consistency: Abort ongoing operations when shutting down
These changes are complementary to those from a recent commit where we
handled aborting ongoing operations during tablet events, such as
tablet migration. In this commit, we consider the case of shutting down
a node.

When a node is shutting down, we eventually close the connections. When
the client can no longer get a response from the server, it makes no
sense to continue with the queries. We'd like to cancel them at that
point.

We leverage the abort source passed down via `client_state` down to
the strongly consistent coordinator. This way, the transport layer can
communicate with it and signal that the queries should be canceled.
The abort source is triggered by the CQL server (cf.
`generic_server::server::{stop,shutdown}`).

---

Note that this is not an optional change. In fact, if we don't abort
those requests, we might hang for an indefinite amount of time when
executing the following code in `main.cc`:

```
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {
    ss.local().drain_on_shutdown().get();
});
```

The problem boils down to the fact that `generic_server::server::stop`
will wait for all connections to be closed, but that won't happen until
all ongoing operations (at least those to strongly consistent tables)
are finished.

It's important to highlight that even though we hang on this, the
client can no longer get any response. Thus, it's crucial that at that
point we simply abort ongoing operations to proceed with the rest of
shutdown.

---

Two tests are added to verify that the implementation is correct:
one focusing on local operations, the other -- on a forwarded write.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

After:
```
real    0m32.024s
user    1m10.751s
sys     0m23.871s
```

Individual runs of the added tests:

test_queries_when_shutting_down:
```
real    0m12.818s
user    0m36.726s
sys     0m4.577s
```

test_abort_forwarded_write_upon_shutdown:
```
real    0m12.930s
user    0m36.622s
sys     0m4.752s
```
2026-04-09 11:36:17 +02:00
Dawid Mędrek
4a87bdc778 service: client_state: Extend with abort_source
We make `client_state` store a pointer to an `abort_source`. This will
be useful in the following commit that will implement aborting ongoing
requests to strongly consistent tables upon connection shutdowns.
It might also be useful in some other places in the code in the future.

We set the abort source for client states in relevant places.
2026-04-09 11:35:35 +02:00
Dawid Mędrek
89c049b889 service: strong_consistency: Handle abort when removing Raft group
When a strongly consistent Raft group is being removed, it means one of
the following cases:

(A) The node is shutting down and it's simply part of the the shutdown
    procedure.

(B) The tablet is somehow leaving the replica. For example, due to:
    - Tablet migration
    - Tablet split/merge
    - Tablet removal (e.g. because the table is dropped)

In this commit, we focus on case (A). Case (B) will be handled in the
following one.

---

The changes in the code are literally none, and there's a reason to it.

First, let's note that we've already implemented abortion of timed-out
requests. There is a limit to how long a query can run and sooner or
later it will finish, regardless of what we do.

Second, we need to ask ourselves if the cases we're considering in this
commit (i.e. case (B)) is a situation where we'd like to speed up the
process. The answer is no.

Tablet migrations are effectively internal operations that are invisible
to the users. User requests are, quite obviously, the opposite of that.
Because of that, we want to patiently wait for the queries to finish or
time out, even though it's technically possible to lead to an abort
earlier.

Lastly, the changes in the code that actually appear in this commit are
not completely irrelevant either. We consider the important case of
the `leader_info_updater` fiber and argue that it's safe to not pass
any abort source to the Raft methods used by it.

---

Unfortunately, we don't have tablet migrations implemented yet [1],
so our testing capabilities are limited. Still, we provide a new test
that corresponds to case (B) described above. We simulate a tablet
migration by dropping a table and observe how reads and writes behave
in such a situation. There's no extremely careful validation involved
there, but that's what we can have for the time being.

Difference in time spent on the whole test file
`test_strong_consistency.py` on my local machine, in dev mode:

Before:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

After:
```
real    0m31.775s
user    1m4.475s
sys     0m22.615s
```

The time spent on the new test only:
```
real    0m5.264s
user    0m34.646s
sys     0m3.374s
```

References:
[1] SCYLLADB-868
2026-04-09 11:35:31 +02:00
Dawid Mędrek
7dcc3e85b9 service: strong_consistency: Abort Raft operations on timeout
If a query, either a write, or a read to a strongly consistent table,
times out, we immediately abort the operation and throw an exception.

Unfortunately, due to the inconsistency in exception types thrown
on timeout by the many methods we use in the code, it results in
pretty messy `try-catch` clauses. Perhaps there's a better alternative
to this, but it's beyond the scope of this work, so we leave it as-is.

We provide a validation test that consists of three cases corresponding
to reads, writes, and waiting for the leader. They verify that the code
works as expected in all affected places.

A comparison of time spent on the whole `test_strong_consistency.py` on
my local machine, in dev mode:

Before:
```
real    0m32.185s
user    0m55.391s
sys     0m15.745s
```

After:
```
real  0m30.841s
user  1m3.294s
sys   0m21.091s
```

The time spent on the new test only:
```
real  0m7.077s
user  0m35.359s
sys   0m3.717s
```
2026-04-09 11:35:04 +02:00
Dawid Mędrek
2243e0ffea service: strong_consistency: Use timeout when mutating
We remove the inconsistency between reads and writes to strongly
consistent tables. Before the commit, only reads used a timeout.
Now, writes do as well.

Although the parameter isn't used yet, that will change in the following
commit. This is a prerequisite for it.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
fd9c907be1 service: strong_consistency: Fix indentation 2026-04-09 11:25:57 +02:00
Dawid Mędrek
ca7f24516e service: strong_consistency: Enclose coordinator methods with try-catch
We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They
do nothing at the moment, but we'll use them later. We do this now to
avoid noise in the upcoming commits.

We'll fix the indentation in the following commit.
2026-04-09 11:25:57 +02:00
Dawid Mędrek
e9ea9e7259 service: strong_consistency: Crash at unexpected exception
The loop shouldn't throw any other exception than the ones already
covered by the `catch` claues. Crash, at least when
`abort_on_internal_error` is set, if we catch any other type since
that may be a sign of a bug.
2026-04-09 11:25:57 +02:00
Botond Dénes
76c8794f4f Merge 'Strong consistency: allow taking snapshots (but not transfer) and make them less likely' from Piotr Dulikowski
While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely.

While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking:

- The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB.
- The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots.

I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally.

Refs: SCYLLADB-1115

Strong consistency is experimental, no need for backport.

Closes scylladb/scylladb#29189

* github.com:scylladb/scylladb:
  strong_consistency: fake taking and dropping snapshots
  strong_consistency: adjust limits for snapshots
2026-04-09 11:44:03 +03:00
Petr Gusev
7750d5737c strong consistency: replace local consistency with global
Currently we don't support 'local' consistency, which would
imply maintaining separate raft group for each dc. What we
support is actually 'global' consistency -- one raft group
per tablet replica set. We don't plan to support local
consistency for the first GA.

Closes scylladb/scylladb#29221
2026-04-08 12:52:32 +02:00
Botond Dénes
aeefbda304 Merge 'Simplify and improve API descibe_ring code flow' from Pavel Emelyanov
The endpoint in question has some places worth fixing, in particular

- the keyspace parameter is not validated
- the validated table name is resolved into table_id, but the id is unused
- two ugly static helpers to stream obtained token ranges into json

Improving the API code flow, not backporting

Closes scylladb/scylladb#29154

* github.com:scylladb/scylladb:
  api: Inline describe_ring JSON handling
  storage_service: Make describe_ring_for_table() take table_id
2026-04-08 10:50:07 +03:00
Nadav Har'El
4eeb9f4120 lwt, vector: write to CDC when vector index is enabled.
The vector-search feature introduced the somewhat confusing feature of
enabling CDC without explicitly enabling CDC: When a vector index is
enabled on a table, CDC is "enabled" for it even if the user didn't
ask to enable CDC.

For this, write-path code began to use a new cdc_enabled() function
instead of checking schema.cdc_options.enabled() directly.  This
cdc_enabled() function checks if either this enabled() is true, or
has_vector_index() is true.

Unfortunately, LWT writes continued to use cdc_options.enabled() instead
of the new cdc_enabled(). This means that if a vector index is used and
a vector is written using an LWT write, the new value is not indexed.

This patch fixes this bug. It also adds a regression test that fails
before this patch and passes afterwards - the new test verifies that
when a table has a vector index (but no explicit CDC enabled), the CDC
log is updated both after regular writes and after successful LWT writes.

This patch was also tested in the context of the upcoming vector-search-
for-Alternator pull request, which has a test reproducing this bug
(Alternator uses LWT frequently, so this is very important there).
It will also be tested by the vector-store test suite ("validator").

Fixes SCYLLADB-1342

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29300
2026-04-08 07:55:05 +03:00
Nadav Har'El
f590ee2b7e cdc, vector: fix CDC result tracker for vector indexes
When a table has a vector index, cdc::cdc_enabled() returns true because
vector index writes are implemented via the CDC augmentation path. However,
register_cdc_operation_result_tracker() was checking only
cdc_options().enabled(), which is false for tables that have a vector index
but not traditional CDC.

As a result, the operation_result_tracker was never attached to write
response handlers for vector-indexed tables. This tracker was added in
commit 1b92cbe, and its job is to update metrics of CDC operations,
and since vector search really does use CDC under the hood, these
metrics could be useful when diagnosing problems.

Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which
covers both traditional CDC and vector-indexed tables.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29343
2026-04-07 15:54:51 +03:00
Avi Kivity
00409b61f1 Merge 'Add Vnodes to Tablets Migration Procedure' from Nikos Dragazis
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.

The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
    1. Mark a node for upgrade in the topology state.
    2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
    3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
    1. Update the keyspace schema to mark it as tablet-based.
    2. Clear the group0 state related to the migration.

From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.

During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.

The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.

Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.

New feature, no backport is needed.

Closes scylladb/scylladb#29065

* github.com:scylladb/scylladb:
  docs: Add ops guide for vnodes-to-tablets migration
  test: cluster: Add test for migration of multiple keyspaces
  test: cluster: Add test for error conditions
  test: cluster: Add vnodes->tablets migration test (rollback)
  test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
  test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
  scylla-nodetool: Add migrate-to-tablets subcommand
  api: Add REST endpoint for vnode-to-tablet migration status
  api: Add REST endpoint for migration finalization
  topology_coordinator: Add `finalize_migration` request
  database: Construct migrating tables with tablet ERMs
  api: Add REST endpoint for upgrading nodes to tablets
  api: Add REST endpoint for starting vnodes-to-tablets migration
  topology_state_machine: Add intended_storage_mode to system.topology
  distributed_loader: Wire vnode-based resharding into table populator
  replica: Pick any compaction group for resharding
  compaction: resharding_compaction: add vnodes_resharding option
  storage_service: Preserve ERM flavor of migrating tables
  tablet_allocator: Exclude migrating tables from load balancing
  feature_service: Add vnodes_to_tablets_migrations feature
2026-04-07 14:32:22 +03:00
Michael Litvak
b02349d755 logstor: streaming of logstor segments using stream_blob
implement tablet migration for logstor tables by streaming segments
using stream_blob, similar to file streaming of sstables.

take a snapshot of the logstor segments and create a stream_blob_info
vector with entry for each segment with the input stream that reads the
segment and an op of type file_ops::stream_logstor_segments.

the stream_blob_handler creates a logstor sink that allocates a segment
on the target shard and creates an output stream that writes to it. when
the sink is closed it loads the segment.
2026-03-31 18:45:08 +02:00
Patryk Jędrzejczak
b9f82f6f23 raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start
A joining node hung forever if the topology coordinator added it to the
group 0 configuration before the node reached `post_server_start`. In
that case, `server->get_configuration().contains(my_id)` returned true
and the node broke out of the join loop early, skipping
`post_server_start`. `_join_node_group0_started` was therefore never set,
so the node's `join_node_response` RPC handler blocked indefinitely.
Meanwhile the topology coordinator's `respond_to_joining_node` call
(which has no timeout) hung forever waiting for the reply that never came.

Fix by only taking the early-break path when not starting as a follower
(i.e. when the node is the discovery leader or is restarting). A joining
node must always reach `post_server_start`.

We also provide a regression test. It takes 6s in dev mode.

Fixes SCYLLADB-959

Closes scylladb/scylladb#29266
2026-03-31 12:33:56 +02:00
Nikos Dragazis
2a5e6b832a api: Add REST endpoint for vnode-to-tablet migration status
If the keyspace is migrating, it reports the intended and actual storage
mode for each node.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-25 19:11:24 +02:00
Nikos Dragazis
d09196068c api: Add REST endpoint for migration finalization
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization

When called, it issues a `finalize_migration` topology request and waits
for its completion.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:21:12 +02:00
Nikos Dragazis
c88ddecfca topology_coordinator: Add finalize_migration request
Vnodes-to-tablets migration needs a finalization step to finish or
rollback the migration. Finishing the migration involves switching the
keyspace schema to tablets and clearing the `intended_storage_mode` from
system.topology. Rolling back the migration involves deleting the tablet
maps and clearing the `intended_storage_mode`.

The finalization needs to be done as a topology request to exclude with
other operations such as repair and TRUNCATE.

This patch introduces the `finalize_migration` global topology request
for this purpose. The request takes a keyspace name as an argument.
The direction of the finalization (i.e., forward path vs rollback) is
inferred from the `intended_storage_mode` of all nodes (not ideal,
should be made explicit).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:39 +02:00
Nikos Dragazis
2f93ab281b api: Add REST endpoint for upgrading nodes to tablets
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes}

This endpoint is part of the vnodes-to-tablets migration process and
controls a node's intended_storage_mode in system.topology. The storage
mode represents the node-local data distribution model, i.e., how data
are organized across shards. The node will apply the intended storage
mode to migrating tables upon next restart by resharding their SSTables
(either on vnode boundaries if intended_mode=tablets, or with the static
sharder if intended_mode=vnodes).

Note that this endpoint controls the intended_storage_mode of the local
node only. This has the nice benefit that once the API call returns, the
change has not only been committed to group0 but also applied to the
local node's state machine. This guarantees that the change is part of
the node's local copy upon next restart; no additional read barrier is
needed.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:20:35 +02:00
Nikos Dragazis
c4c3a95863 api: Add REST endpoint for starting vnodes-to-tablets migration
The endpoint is the following:

    POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}

Its purpose is to start the migration of a whole keyspace from vnodes to
tablets.

When called, Scylla will synchronously create a tablet map for each
table in the specified keyspace. The tablet maps of all tables are
identical and they mirror the vnode layout; they contain one tablet per
vnode and each tablet uses the same replica hosts and token boundaries
as the corresponding vnode.

The only difference from vnodes lies in the sharding approach. Tablets
are assigned to a single shard - using a round-robin strategy in this
patch - whereas vnodes are distributed evenly across all shards. If the
tablet count per shard is low and tablet sizes are uneven, or some
shards have more tablets than others, performance may degrade during the
migration process. For example, a cluster with i8g.48xlarge (192 vCPUs),
256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet
replicas per shard during the migration. One additional tablet or a
double-sized tablet would cause 25% overcommit.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 13:19:47 +02:00
Nikos Dragazis
b7f4ae8218 topology_state_machine: Add intended_storage_mode to system.topology
Part of the vnodes-to-tablets migration is to reshard the SSTables of
each node on vnode boundaries. Resharding is a heavy operation that
runs on startup while the node is offline. Since nodes can restart
for unexpected reasons, we need a flag to do it in a controllable way.

We also need the ability to roll back the migration, which requires
resharding in the opposite direction. This means a node must be aware of
the intended migration direction.

To address both requirements, this patch introduces a new column,
intended_storage_mode, in system.topology. A non-null value indicates
that a node should perform a migration and specifies the migration
direction.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
d153a95943 storage_service: Preserve ERM flavor of migrating tables
When a table is migrating from vnodes to tablets, the cluster is in a
mixed state where some nodes use vnode ERMs and others use tablet ERMs.
The ERM flavor is a node-local property that expresses the node's
storage organization.

Preserve the flavor across token metadata changes. The flavor needs to
be on par with storage, but the storage can change only on startup, as
it requires resharding all SSTables to conform with the flavor.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Nikos Dragazis
4a3e26d5e3 tablet_allocator: Exclude migrating tables from load balancing
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.

With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Pavel Emelyanov
f112e42ddd raft: Fix split mutations freeze
Commit faa0ee9844 accidentally broke the way split snapshot mutation was
frozen -- instead of appending the sub-mutation `m` the commit kept the
old variable name of `mut` which in the new code corresponds to "old"
non-split mutation

Fixes #29051

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29052
2026-03-24 08:53:50 +02:00
Piotr Dulikowski
63067f594d strong_consistency: fake taking and dropping snapshots
Snapshots are not implemented yet for strong consistency - attempting to
take, transfer or drop a snapshot results in an exception. However, the
logic of our state machine forces snapshot transfer even if there are no
lagging replicas - every raft::server::configuration::snapshot_threshold
log entries. We have actually encountered an issue in our benchmarks
where snapshots were being taken even though the cluster was not under
any disruption, and this is one of the possible causes.

It turns out that we can safely allow for taking snapshots right now -
we can just implement it as a no-op and return a random UUID.
Conversely, dropping a snapshot can also be a no-op. This is safe
because snapshot transfer still throws an exception - as long as the
taken/recovered snapshots are never attempted to be transferred.
2026-03-23 17:03:36 +01:00
Piotr Dulikowski
dd1d3dd1ee strong_consistency: adjust limits for snapshots
Raft snapshots are not implemented yet for strong consistency. Adjust
the current raft group config to make them much less likely to occur:

- snapshot_threshold config option decides how many log entries need to
  be applied after the last snapshot. Set it to the maximum value for
  size_t in order to effectively disable it.
- snapshot_threshold_log_size defines a threshold for the log memory
  usage over which a snapshot is created. Increase it from the default
  2MB to 10MB.
- max_log_size defines the threshold for the log memory usage over which
  requests are stopped to be admitted until the log is shrunk back by a
  snapshot. Set it to 20MB, as this option is recommended to be at least
  twice as much as snapshot_threshold_log_size.

Refs: SCYLLADB-1115
2026-03-23 17:03:36 +01:00
Pavel Emelyanov
9a2e583f29 storage_service: Make describe_ring_for_table() take table_id
All callers already have it. It makes no difference for the method
itself with which table identifier to work, but will help to simplify
the flow in API handler (next patch)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-20 19:49:24 +03:00
Botond Dénes
5573c3b18e Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.

Also, graceful shutdown will be blocked on it.

Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.

Reason for deadlock:

When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.

If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this

Initial state:

 cg0: main
 cg1: main
 cg2: main
 cg3: main

After 1st merge:

 cg0': main [locked], merging_groups=[cg0.main, cg1.main]
 cg1': main [locked], merging_groups=[cg2.main, cg3.main]

After 2nd merge:

 cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]

merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.

The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.

Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.

Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.

Fixes SCYLLADB-928

Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.

Closes scylladb/scylladb#29007

* github.com:scylladb/scylladb:
  tablets: Fix deadlock in background storage group merge fiber
  replica: table: Propagate old erm to storage group merge
  test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
  storage_service: Extract local_topology_barrier()
2026-03-20 09:05:52 +02:00
Piotr Dulikowski
171504c84f Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz
This patchset migrates: query_all_directly_granted, query_all,
get_attribute, query_attribute_for_all functions to use cache
instead of doing CQL queries. It also includes some preparatory
work which fixes cache update order and triggering.

Main motivation behind this is to make sure that all calls
from service_level_controller::auth_integration are cached,
which we achieve here.

Alternative implementation could move the whole auth_integration
data into auth cache but since auth_integration manages also lifetime
and contains service levels specific logic such solution would be
too complex for little (if any) gain.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159
Backport: no, not a bug

Closes scylladb/scylladb#28791

* github.com:scylladb/scylladb:
  auth: switch query_attribute_for_all to use cache
  auth: switch get_attribute to use cache
  auth: cache: add heterogeneous map lookups
  auth: switch query_all to use cache
  auth: switch query_all_directly_granted to use cache
  auth: cache: add ability to go over all roles
  raft: service: reload auth cache before service levels
  service: raft: move update_service_levels_effective_cache check
2026-03-19 14:37:22 +01:00
Botond Dénes
2e47fd9f56 Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk
During decommission, we first mark a topology request as done, then shut
down a node and in the following steps we remove node from the topology.
Thus,  finished request does not imply that a node is removed from
the topology.

Due to that, in node_ops_virtual_task::wait, while gathering children
from the whole cluster, we may hit the connection exception - because
a node is still in topology, even though it is down.

Modify the get_children method to ignore the exception and warn
about the failure instead.

Keep token_metadata_ptr in get_children to prevent topology from changing.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867

Needs backports to all versions

Closes scylladb/scylladb#29035

* github.com:scylladb/scylladb:
  tasks: fix indentation
  tasks: do not fail the wait request if rpc fails
  tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children
2026-03-19 10:03:18 +02:00
Gleb Natapov
77d3245e02 view: remove upgrade to raft code
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.
2026-03-18 17:45:40 +02:00