scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 08:30:35 +00:00

Author	SHA1	Message	Date
Yaniv Michael Kaul	2fbba4a071	raft, service, locator: create raft_fwd.hh and reduce heavy header includes Create raft/raft_fwd.hh with lightweight type aliases (server_id, group_id, term_t, index_t) backed only by raft/internal.hh, avoiding the heavy raft/raft.hh (832 lines with futures, abort_source, bytes_ostream). Replace raft/raft.hh with raft/raft_fwd.hh in headers that only need the basic ID types: tablets.hh, topology_state_machine.hh, topology_coordinator.hh, storage_service.hh, group0_fwd.hh, view_building_coordinator.hh, view_building_worker.hh. Also remove gossiper.hh and tablet_allocator.hh from storage_service.hh (forward declarations suffice), and remove unused reactor.hh from tablets.hh. Add explicit includes in .cc files that lost transitive availability.	2026-04-17 01:08:04 +03:00
Yaniv Michael Kaul	be5fa64d36	db: break gossiper.hh include from system_keyspace.hh Extract loaded_endpoint_state into a standalone lightweight header to avoid pulling the heavy gossiper.hh (and transitively query-result-set.hh) into every includer of system_keyspace.hh. Add explicit includes where the full definitions are actually needed. Reduces clean dev build time by ~2 minutes (-8%).	2026-04-16 23:27:55 +03:00
Yaniv Michael Kaul	5c918d29cc	service: remove unused storage_service.hh include from storage_proxy.hh storage_proxy.hh included storage_service.hh but never referenced any symbol from it. storage_service.hh costs 3.7s to parse per file, and storage_proxy.hh has 75 direct includers. While most of those also include database.hh (which shares transitive deps), removing this unnecessary include still reduces total parse work. Speedup: part of a series measured at -5.8% wall-clock improvement (same-session A/B: 16m14s -> 15m17s at -j16, 16 cores).	2026-04-16 18:22:56 +03:00
Yaniv Michael Kaul	8ad8e76c3b	cql3, service, test: add explicit includes for headers losing transitive availability Add explicit #include directives for headers that are currently available transitively through cql3/query_processor.hh but will stop being available after a subsequent refactoring that removes the loading_cache include chain. Files changed: - cql3/statements/drop_keyspace_statement.cc: add unimplemented.hh - cql3/statements/truncate_statement.cc: add unimplemented.hh - cql3/statements/batch_statement.cc: add result_message.hh - cql3/statements/broadcast_modification_statement.cc: add result_message.hh - service/paxos/paxos_state.cc: add result_message.hh - test/lib/cql_test_env.cc: add result_message.hh - table_helper.cc: add result_message.hh No functional change. Prepares for subsequent query_processor.hh cleanup.	2026-04-15 04:20:49 +03:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	22949bae52	Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine. * tablet merge simply merges the histograms of segments of one compaction group with another. * for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups. * for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it. * we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range. Refs SCYLLADB-770 no backport - still a new and experimental feature Closes scylladb/scylladb#29207 * github.com:scylladb/scylladb: test: logstor: additional logstor tests docs/dev: add logstor on-disk format section logstor: add version and crc to buffer header test: logstor: tablet split/merge and migration logstor: enable tablet balancing logstor: streaming of logstor segments using stream_blob logstor: add take_logstor_snapshot logstor: segment input/output stream logstor: implement compaction_group::cleanup logstor: tablet split logstor: tablet merge logstor: add compaction reenabler logstor: add segment header logstor: serialize writes to active segment replica: extend compaction_group functions for logstor replica: add compaction_group_for_logstor_segment logstor: code cleanup	2026-04-12 16:11:12 +03:00
Avi Kivity	8ccee6803e	Merge 'Remove upgrade view builder' from Gleb Natapov Since we do no longer support upgrade from versions that do not support v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version. v2 version was introduced by `8d25a4d678` which is included in scylla-2025.1.0. No backport needed since this is code removal. Closes scylladb/scylladb#29105 * github.com:scylladb/scylladb: view: drop unused v1 builder code view: remove upgrade to raft code	2026-04-12 00:39:26 +03:00
Piotr Dulikowski	32e3a01718	Merge 'service: strong_consistency: Allow for aborting operations' from Dawid Mędrek Motivation ---------- Since strongly consistent tables are based on the concept of Raft groups, operations on them can get stuck for indefinite amounts of time. That may be problematic, and so we'd like to implement a way to cancel those operations at suitable times. Description of solution ----------------------- The situations we focus on are the following: * Timed-out queries * Leader changes * Tablet migrations * Table drops * Node shutdowns We handle each of them and provide validation tests. Implementation strategy ----------------------- 1. Auxiliary commits. 2. Abort operations on timeout. 3. Abort operations on tablet removal. 4. Extend `client_state`. 5. Abort operation on shutdown. 6. Help `state_machine` be aborted as soon as possible. Tests ----- We provide tests that validate the correctness of the solution. The total time spent on `test_strong_consistency.py` (measured on my local machine, dev mode): Before: ``` real 0m31.809s user 1m3.048s sys 0m21.812s ``` After: ``` real 0m34.523s user 1m10.307s sys 0m27.223s ``` The incremental differences in time can be found in the commit messages. Fixes SCYLLADB-429 Backport: not needed. This is an enhancement to an experimental feature. Closes scylladb/scylladb#28526 * github.com:scylladb/scylladb: service: strong_consistency: Abort state_machine::apply when aborting server service: strong_consistency: Abort ongoing operations when shutting down service: client_state: Extend with abort_source service: strong_consistency: Handle abort when removing Raft group service: strong_consistency: Abort Raft operations on timeout service: strong_consistency: Use timeout when mutating service: strong_consistency: Fix indentation service: strong_consistency: Enclose coordinator methods with try-catch service: strong_consistency: Crash at unexpected exception test: cluster: Extract default config & cmdline in test_strong_consistency.py	2026-04-10 11:11:21 +02:00
Patryk Jędrzejczak	751bf31273	Merge 'More gossiper cleanups' from Gleb Natapov The PR contains more code cleanups, mostly in gossiper. Dropping more gossiper state leaving only NORMAL and SHUTDOWN. All other states are checked against topology state. Those two are left because SHUTDOWN state is propagated through gossiper only and when the node is not in SHUTDOWN it should be in some other state. No need to backport. Cleanups. Closes scylladb/scylladb#29129 * https://github.com/scylladb/scylladb: storage_service: cleanup unused code storage_service: simplify get_peer_info_for_update gossiper: send shutdown notifications in parallel gms: remove unused code virtual_tables: no need to call gossiper if we already know that the node is in shutdown gossiper: print node state from raft topology in the logs gossiper: use is_shutdown instead of code it manually gossiper: mark endpoint_state(inet_address ip) constructor as explicit gossiper: remove unused code gossiper: drop last use of LEFT state and drop the state gossiper: drop unused STATUS_BOOTSTRAPPING state gossiper: rename is_dead_state to is_left since this is all that the function checks now. gossiper: use raft topology state instead of gossiper one when checking node's state storage_service: drop check_for_endpoint_collision function storage_service: drop is_first_node function gossiper: remove unused REMOVED_TOKEN state gossiper: remove unused advertise_token_removed function	2026-04-10 09:56:20 +02:00
Michał Hudobski	c8b9fde828	auth: allow VECTOR_SEARCH_INDEXING permission to access system.tablets Add system.tablets to the set of system resources that can be accessed with the VECTOR_SEARCH_INDEXING permission. Fixes: VECTOR-605 Closes scylladb/scylladb#29397	2026-04-09 21:53:07 +03:00
Gleb Natapov	dbaba7ab8a	storage_service: cleanup unused code Remove unused definition and double includes.	2026-04-09 13:31:41 +03:00
Gleb Natapov	b050b593b3	storage_service: simplify get_peer_info_for_update It does nothing for fields managed in raft, so drop their processing.	2026-04-09 13:31:41 +03:00
Gleb Natapov	67102496c8	gossiper: drop last use of LEFT state and drop the state The decommission sets left gossiper state only to prevent shutdown notification be issued by the node during shutdown. Since the notification code now checks the state in raft topology this is no longer needed.	2026-04-09 13:31:39 +03:00
Gleb Natapov	7c895ced19	gossiper: rename is_dead_state to is_left since this is all that the function checks now.	2026-04-09 13:31:38 +03:00
Gleb Natapov	c17c4806a1	storage_service: drop check_for_endpoint_collision function All the checks that it does are also done by join coordinator and the join coordinator uses more reliable raft state instead of gossiper one.	2026-04-09 13:31:37 +03:00
Gleb Natapov	1ac8edb22b	storage_service: drop is_first_node function It make no sense now since the first node to bootstrap is determined by discover_group0 algorithm.	2026-04-09 13:31:37 +03:00
Gleb Natapov	681aa9ebe1	gossiper: remove unused REMOVED_TOKEN state	2026-04-09 13:31:37 +03:00
Dawid Mędrek	f0dfe29d88	service: strong_consistency: Abort state_machine::apply when aborting server The state machine used by strongly consistent tablets may block on a read barrier if the local schema is insufficient to resolve pending mutations [1]. To deal with that, we perform a read barrier that may block for a long time. When a strongly consistent tablet is being removed, we'd like to cancel all ongoing executions of `state_machine::apply`: the shard is no longer responsible for the tablet, so it doesn't matter what the outcome is. --- In the implementation, we abort the operations by simply throwing an exception from `state_machine::apply` and not doing anything. That's a red flag considering that it may lead to the instance being killed on the spot [2]. Fortunately for us, strongly consistent tables use the default Raft server implementation, i.e. `raft::server_impl`, which actually handles one type of an exception thrown by the method: namely, `abort_requested_exception`, which is the default exception thrown by `seastar::abort_source` [3]. We leverage this property. --- Unfortunately, `raft::server_impl::abort` isn't perfectly suited for us. If we look into its code, we'll see that the relevant portion of the procedure boils down to three steps: 1. Prevent scheduling adding new entries. 2. Wait for the applier fiber. 3. Abort the state machine. Since aborting the state machine happens only after the applier fiber has already finished, there will no longer be anything to abort. Either all executions of `state_machine::apply` have already finished, or they are hanging and we cannot do anything. That's a pre-existing problem that we won't be solving here (even though it's possible). We hope the problem will be solved, and it seems likely: the code suggests that the behavior is not intended. For more details, see e.g. [4]. --- We provide two validation tests. They simulate the abortion of `state_machine::apply` in two different scenarios: * when the table is dropped (which should also cover the case of tablet migration), * when the node is shutting down. The value of the tests isn't high since they don't ensure that the state of the group is still valid (though it should be), nor do they perform any other check. Instead, we rely on the testing framework to spot any anomalies or errors. That's probably the best we can do at the moment. Unfortunately, both tests are marked as skipped becuause of the current limitations of `raft::server_impl::abort` described above and in [4]. References: [1] `4c8dba1` [2] See the description of `raft::state_machine` in `raft/raft.hh`. [3] See `server_impl::applier_fiber` in `raft/server.cc`. [4] SCYLLADB-1056	2026-04-09 11:36:51 +02:00
Dawid Mędrek	ad8a263683	service: strong_consistency: Abort ongoing operations when shutting down These changes are complementary to those from a recent commit where we handled aborting ongoing operations during tablet events, such as tablet migration. In this commit, we consider the case of shutting down a node. When a node is shutting down, we eventually close the connections. When the client can no longer get a response from the server, it makes no sense to continue with the queries. We'd like to cancel them at that point. We leverage the abort source passed down via `client_state` down to the strongly consistent coordinator. This way, the transport layer can communicate with it and signal that the queries should be canceled. The abort source is triggered by the CQL server (cf. `generic_server::server::{stop,shutdown}`). --- Note that this is not an optional change. In fact, if we don't abort those requests, we might hang for an indefinite amount of time when executing the following code in `main.cc`: ``` // Register at_exit last, so that storage_service::drain_on_shutdown will be called first auto do_drain = defer_verbose_shutdown("local storage", [&ss] { ss.local().drain_on_shutdown().get(); }); ``` The problem boils down to the fact that `generic_server::server::stop` will wait for all connections to be closed, but that won't happen until all ongoing operations (at least those to strongly consistent tables) are finished. It's important to highlight that even though we hang on this, the client can no longer get any response. Thus, it's crucial that at that point we simply abort ongoing operations to proceed with the rest of shutdown. --- Two tests are added to verify that the implementation is correct: one focusing on local operations, the other -- on a forwarded write. Difference in time spent on the whole test file `test_strong_consistency.py` on my local machine, in dev mode: Before: ``` real 0m31.775s user 1m4.475s sys 0m22.615s ``` After: ``` real 0m32.024s user 1m10.751s sys 0m23.871s ``` Individual runs of the added tests: test_queries_when_shutting_down: ``` real 0m12.818s user 0m36.726s sys 0m4.577s ``` test_abort_forwarded_write_upon_shutdown: ``` real 0m12.930s user 0m36.622s sys 0m4.752s ```	2026-04-09 11:36:17 +02:00
Dawid Mędrek	4a87bdc778	service: client_state: Extend with abort_source We make `client_state` store a pointer to an `abort_source`. This will be useful in the following commit that will implement aborting ongoing requests to strongly consistent tables upon connection shutdowns. It might also be useful in some other places in the code in the future. We set the abort source for client states in relevant places.	2026-04-09 11:35:35 +02:00
Dawid Mędrek	89c049b889	service: strong_consistency: Handle abort when removing Raft group When a strongly consistent Raft group is being removed, it means one of the following cases: (A) The node is shutting down and it's simply part of the the shutdown procedure. (B) The tablet is somehow leaving the replica. For example, due to: - Tablet migration - Tablet split/merge - Tablet removal (e.g. because the table is dropped) In this commit, we focus on case (A). Case (B) will be handled in the following one. --- The changes in the code are literally none, and there's a reason to it. First, let's note that we've already implemented abortion of timed-out requests. There is a limit to how long a query can run and sooner or later it will finish, regardless of what we do. Second, we need to ask ourselves if the cases we're considering in this commit (i.e. case (B)) is a situation where we'd like to speed up the process. The answer is no. Tablet migrations are effectively internal operations that are invisible to the users. User requests are, quite obviously, the opposite of that. Because of that, we want to patiently wait for the queries to finish or time out, even though it's technically possible to lead to an abort earlier. Lastly, the changes in the code that actually appear in this commit are not completely irrelevant either. We consider the important case of the `leader_info_updater` fiber and argue that it's safe to not pass any abort source to the Raft methods used by it. --- Unfortunately, we don't have tablet migrations implemented yet [1], so our testing capabilities are limited. Still, we provide a new test that corresponds to case (B) described above. We simulate a tablet migration by dropping a table and observe how reads and writes behave in such a situation. There's no extremely careful validation involved there, but that's what we can have for the time being. Difference in time spent on the whole test file `test_strong_consistency.py` on my local machine, in dev mode: Before: ``` real 0m30.841s user 1m3.294s sys 0m21.091s ``` After: ``` real 0m31.775s user 1m4.475s sys 0m22.615s ``` The time spent on the new test only: ``` real 0m5.264s user 0m34.646s sys 0m3.374s ``` References: [1] SCYLLADB-868	2026-04-09 11:35:31 +02:00
Dawid Mędrek	7dcc3e85b9	service: strong_consistency: Abort Raft operations on timeout If a query, either a write, or a read to a strongly consistent table, times out, we immediately abort the operation and throw an exception. Unfortunately, due to the inconsistency in exception types thrown on timeout by the many methods we use in the code, it results in pretty messy `try-catch` clauses. Perhaps there's a better alternative to this, but it's beyond the scope of this work, so we leave it as-is. We provide a validation test that consists of three cases corresponding to reads, writes, and waiting for the leader. They verify that the code works as expected in all affected places. A comparison of time spent on the whole `test_strong_consistency.py` on my local machine, in dev mode: Before: ``` real 0m32.185s user 0m55.391s sys 0m15.745s ``` After: ``` real 0m30.841s user 1m3.294s sys 0m21.091s ``` The time spent on the new test only: ``` real 0m7.077s user 0m35.359s sys 0m3.717s ```	2026-04-09 11:35:04 +02:00
Dawid Mędrek	2243e0ffea	service: strong_consistency: Use timeout when mutating We remove the inconsistency between reads and writes to strongly consistent tables. Before the commit, only reads used a timeout. Now, writes do as well. Although the parameter isn't used yet, that will change in the following commit. This is a prerequisite for it.	2026-04-09 11:25:57 +02:00
Dawid Mędrek	fd9c907be1	service: strong_consistency: Fix indentation	2026-04-09 11:25:57 +02:00
Dawid Mędrek	ca7f24516e	service: strong_consistency: Enclose coordinator methods with try-catch We enclose `coordinator::{mutate,query}` with `try-catch` clauses. They do nothing at the moment, but we'll use them later. We do this now to avoid noise in the upcoming commits. We'll fix the indentation in the following commit.	2026-04-09 11:25:57 +02:00
Dawid Mędrek	e9ea9e7259	service: strong_consistency: Crash at unexpected exception The loop shouldn't throw any other exception than the ones already covered by the `catch` claues. Crash, at least when `abort_on_internal_error` is set, if we catch any other type since that may be a sign of a bug.	2026-04-09 11:25:57 +02:00
Botond Dénes	76c8794f4f	Merge 'Strong consistency: allow taking snapshots (but not transfer) and make them less likely' from Piotr Dulikowski While working on benchmarks for strong consistency we noticed that the raft logic attempted to take snapshots during the benchmark. Snapshot transfer is not implemented for strong consistency yet and the methods that take or transfer snapshots throw exceptions. This causes the raft groups to stop working completely. While implementing snapshot transfers is out of scope, we can implement some mitigations now to stop the tests from breaking: - The first commit adjusts the configuration options. First, it disables periodic snapshotting (i.e. creating a snapshot every X log entries). Second, it increases the memory threshold for the raft log before which a snapshot is created from 2MB to 10MB. - The second commit relaxes the take snapshot / drop snapshot methods and makes it possible to actually use them - they are no-ops. It is still forbidden to transfer snapshots. I am including both commits because applying only the first one didn't completely prevent the issue from occurring when testing locally. Refs: SCYLLADB-1115 Strong consistency is experimental, no need for backport. Closes scylladb/scylladb#29189 * github.com:scylladb/scylladb: strong_consistency: fake taking and dropping snapshots strong_consistency: adjust limits for snapshots	2026-04-09 11:44:03 +03:00
Petr Gusev	7750d5737c	strong consistency: replace local consistency with global Currently we don't support 'local' consistency, which would imply maintaining separate raft group for each dc. What we support is actually 'global' consistency -- one raft group per tablet replica set. We don't plan to support local consistency for the first GA. Closes scylladb/scylladb#29221	2026-04-08 12:52:32 +02:00
Botond Dénes	aeefbda304	Merge 'Simplify and improve API descibe_ring code flow' from Pavel Emelyanov The endpoint in question has some places worth fixing, in particular - the keyspace parameter is not validated - the validated table name is resolved into table_id, but the id is unused - two ugly static helpers to stream obtained token ranges into json Improving the API code flow, not backporting Closes scylladb/scylladb#29154 * github.com:scylladb/scylladb: api: Inline describe_ring JSON handling storage_service: Make describe_ring_for_table() take table_id	2026-04-08 10:50:07 +03:00
Nadav Har'El	4eeb9f4120	lwt, vector: write to CDC when vector index is enabled. The vector-search feature introduced the somewhat confusing feature of enabling CDC without explicitly enabling CDC: When a vector index is enabled on a table, CDC is "enabled" for it even if the user didn't ask to enable CDC. For this, write-path code began to use a new cdc_enabled() function instead of checking schema.cdc_options.enabled() directly. This cdc_enabled() function checks if either this enabled() is true, or has_vector_index() is true. Unfortunately, LWT writes continued to use cdc_options.enabled() instead of the new cdc_enabled(). This means that if a vector index is used and a vector is written using an LWT write, the new value is not indexed. This patch fixes this bug. It also adds a regression test that fails before this patch and passes afterwards - the new test verifies that when a table has a vector index (but no explicit CDC enabled), the CDC log is updated both after regular writes and after successful LWT writes. This patch was also tested in the context of the upcoming vector-search- for-Alternator pull request, which has a test reproducing this bug (Alternator uses LWT frequently, so this is very important there). It will also be tested by the vector-store test suite ("validator"). Fixes SCYLLADB-1342 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29300	2026-04-08 07:55:05 +03:00
Nadav Har'El	f590ee2b7e	cdc, vector: fix CDC result tracker for vector indexes When a table has a vector index, cdc::cdc_enabled() returns true because vector index writes are implemented via the CDC augmentation path. However, register_cdc_operation_result_tracker() was checking only cdc_options().enabled(), which is false for tables that have a vector index but not traditional CDC. As a result, the operation_result_tracker was never attached to write response handlers for vector-indexed tables. This tracker was added in commit `1b92cbe`, and its job is to update metrics of CDC operations, and since vector search really does use CDC under the hood, these metrics could be useful when diagnosing problems. Fix by using cdc::cdc_enabled() instead of cdc_options().enabled(), which covers both traditional CDC and vector-indexed tables. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29343	2026-04-07 15:54:51 +03:00
Avi Kivity	00409b61f1	Merge 'Add Vnodes to Tablets Migration Procedure' from Nikos Dragazis This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets. The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following: 1. Create tablet maps for all tables in the keyspace. 2. Sequentially upgrade all nodes from vnodes to tablets: 1. Mark a node for upgrade in the topology state. 2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM. 3. Wait for the node to return online before proceeding to the next node. 4. Finalize the migration: 1. Update the keyspace schema to mark it as tablet-based. 2. Clear the group0 state related to the migration. From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded. During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization. The patch series consists of: * Load balancer adjustments to ignore tablets belonging to a migrating keyspace. * A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder. * A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction. * Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request. * Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor. * Cluster tests for the migration process. Fixes SCYLLADB-722. Fixes SCYLLADB-723. Fixes SCYLLADB-725. Fixes SCYLLADB-779. Fixes SCYLLADB-948. New feature, no backport is needed. Closes scylladb/scylladb#29065 * github.com:scylladb/scylladb: docs: Add ops guide for vnodes-to-tablets migration test: cluster: Add test for migration of multiple keyspaces test: cluster: Add test for error conditions test: cluster: Add vnodes->tablets migration test (rollback) test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes) test: cluster: Add vnodes->tablets migration test (1 table, 1 node) scylla-nodetool: Add migrate-to-tablets subcommand api: Add REST endpoint for vnode-to-tablet migration status api: Add REST endpoint for migration finalization topology_coordinator: Add `finalize_migration` request database: Construct migrating tables with tablet ERMs api: Add REST endpoint for upgrading nodes to tablets api: Add REST endpoint for starting vnodes-to-tablets migration topology_state_machine: Add intended_storage_mode to system.topology distributed_loader: Wire vnode-based resharding into table populator replica: Pick any compaction group for resharding compaction: resharding_compaction: add vnodes_resharding option storage_service: Preserve ERM flavor of migrating tables tablet_allocator: Exclude migrating tables from load balancing feature_service: Add vnodes_to_tablets_migrations feature	2026-04-07 14:32:22 +03:00
Michael Litvak	b02349d755	logstor: streaming of logstor segments using stream_blob implement tablet migration for logstor tables by streaming segments using stream_blob, similar to file streaming of sstables. take a snapshot of the logstor segments and create a stream_blob_info vector with entry for each segment with the input stream that reads the segment and an op of type file_ops::stream_logstor_segments. the stream_blob_handler creates a logstor sink that allocates a segment on the target shard and creates an output stream that writes to it. when the sink is closed it loads the segment.	2026-03-31 18:45:08 +02:00
Patryk Jędrzejczak	b9f82f6f23	raft_group0: join_group0: fix join hang when node joins group 0 before post_server_start A joining node hung forever if the topology coordinator added it to the group 0 configuration before the node reached `post_server_start`. In that case, `server->get_configuration().contains(my_id)` returned true and the node broke out of the join loop early, skipping `post_server_start`. `_join_node_group0_started` was therefore never set, so the node's `join_node_response` RPC handler blocked indefinitely. Meanwhile the topology coordinator's `respond_to_joining_node` call (which has no timeout) hung forever waiting for the reply that never came. Fix by only taking the early-break path when not starting as a follower (i.e. when the node is the discovery leader or is restarting). A joining node must always reach `post_server_start`. We also provide a regression test. It takes 6s in dev mode. Fixes SCYLLADB-959 Closes scylladb/scylladb#29266	2026-03-31 12:33:56 +02:00
Nikos Dragazis	2a5e6b832a	api: Add REST endpoint for vnode-to-tablet migration status If the keyspace is migrating, it reports the intended and actual storage mode for each node. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-25 19:11:24 +02:00
Nikos Dragazis	d09196068c	api: Add REST endpoint for migration finalization The endpoint is the following: POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace}/finalization When called, it issues a `finalize_migration` topology request and waits for its completion. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:21:12 +02:00
Nikos Dragazis	c88ddecfca	topology_coordinator: Add `finalize_migration` request Vnodes-to-tablets migration needs a finalization step to finish or rollback the migration. Finishing the migration involves switching the keyspace schema to tablets and clearing the `intended_storage_mode` from system.topology. Rolling back the migration involves deleting the tablet maps and clearing the `intended_storage_mode`. The finalization needs to be done as a topology request to exclude with other operations such as repair and TRUNCATE. This patch introduces the `finalize_migration` global topology request for this purpose. The request takes a keyspace name as an argument. The direction of the finalization (i.e., forward path vs rollback) is inferred from the `intended_storage_mode` of all nodes (not ideal, should be made explicit). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:20:39 +02:00
Nikos Dragazis	2f93ab281b	api: Add REST endpoint for upgrading nodes to tablets The endpoint is the following: POST /storage_service/vnode_tablet_migrations/node/storage_mode?intended_mode={tablets,vnodes} This endpoint is part of the vnodes-to-tablets migration process and controls a node's intended_storage_mode in system.topology. The storage mode represents the node-local data distribution model, i.e., how data are organized across shards. The node will apply the intended storage mode to migrating tables upon next restart by resharding their SSTables (either on vnode boundaries if intended_mode=tablets, or with the static sharder if intended_mode=vnodes). Note that this endpoint controls the intended_storage_mode of the local node only. This has the nice benefit that once the API call returns, the change has not only been committed to group0 but also applied to the local node's state machine. This guarantees that the change is part of the node's local copy upon next restart; no additional read barrier is needed. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:20:35 +02:00
Nikos Dragazis	c4c3a95863	api: Add REST endpoint for starting vnodes-to-tablets migration The endpoint is the following: POST /storage_service/vnode_tablet_migrations/keyspaces/{keyspace} Its purpose is to start the migration of a whole keyspace from vnodes to tablets. When called, Scylla will synchronously create a tablet map for each table in the specified keyspace. The tablet maps of all tables are identical and they mirror the vnode layout; they contain one tablet per vnode and each tablet uses the same replica hosts and token boundaries as the corresponding vnode. The only difference from vnodes lies in the sharding approach. Tablets are assigned to a single shard - using a round-robin strategy in this patch - whereas vnodes are distributed evenly across all shards. If the tablet count per shard is low and tablet sizes are uneven, or some shards have more tablets than others, performance may degrade during the migration process. For example, a cluster with i8g.48xlarge (192 vCPUs), 256 vnodes per node and RF=3 will have 256 * 3 / 192 vCPUs = 4 tablet replicas per shard during the migration. One additional tablet or a double-sized tablet would cause 25% overcommit. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:19:47 +02:00
Nikos Dragazis	b7f4ae8218	topology_state_machine: Add intended_storage_mode to system.topology Part of the vnodes-to-tablets migration is to reshard the SSTables of each node on vnode boundaries. Resharding is a heavy operation that runs on startup while the node is offline. Since nodes can restart for unexpected reasons, we need a flag to do it in a controllable way. We also need the ability to roll back the migration, which requires resharding in the opposite direction. This means a node must be aware of the intended migration direction. To address both requirements, this patch introduces a new column, intended_storage_mode, in system.topology. A non-null value indicates that a node should perform a migration and specifies the migration direction. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Nikos Dragazis	d153a95943	storage_service: Preserve ERM flavor of migrating tables When a table is migrating from vnodes to tablets, the cluster is in a mixed state where some nodes use vnode ERMs and others use tablet ERMs. The ERM flavor is a node-local property that expresses the node's storage organization. Preserve the flavor across token metadata changes. The flavor needs to be on par with storage, but the storage can change only on startup, as it requires resharding all SSTables to conform with the flavor. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Nikos Dragazis	4a3e26d5e3	tablet_allocator: Exclude migrating tables from load balancing The tablet load balancer operates on all tablet-based tables that appear in the tablet metadata. With the introduction of the vnodes-to-tablets migration procedure later in this series, migrating tables will also appear in the tablet metadata, but they need to be treated as vnode tables until migration is finished. This patch excludes such tables from load balancing. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Pavel Emelyanov	f112e42ddd	raft: Fix split mutations freeze Commit `faa0ee9844` accidentally broke the way split snapshot mutation was frozen -- instead of appending the sub-mutation `m` the commit kept the old variable name of `mut` which in the new code corresponds to "old" non-split mutation Fixes #29051 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29052	2026-03-24 08:53:50 +02:00
Piotr Dulikowski	63067f594d	strong_consistency: fake taking and dropping snapshots Snapshots are not implemented yet for strong consistency - attempting to take, transfer or drop a snapshot results in an exception. However, the logic of our state machine forces snapshot transfer even if there are no lagging replicas - every raft::server::configuration::snapshot_threshold log entries. We have actually encountered an issue in our benchmarks where snapshots were being taken even though the cluster was not under any disruption, and this is one of the possible causes. It turns out that we can safely allow for taking snapshots right now - we can just implement it as a no-op and return a random UUID. Conversely, dropping a snapshot can also be a no-op. This is safe because snapshot transfer still throws an exception - as long as the taken/recovered snapshots are never attempted to be transferred.	2026-03-23 17:03:36 +01:00
Piotr Dulikowski	dd1d3dd1ee	strong_consistency: adjust limits for snapshots Raft snapshots are not implemented yet for strong consistency. Adjust the current raft group config to make them much less likely to occur: - snapshot_threshold config option decides how many log entries need to be applied after the last snapshot. Set it to the maximum value for size_t in order to effectively disable it. - snapshot_threshold_log_size defines a threshold for the log memory usage over which a snapshot is created. Increase it from the default 2MB to 10MB. - max_log_size defines the threshold for the log memory usage over which requests are stopped to be admitted until the log is shrunk back by a snapshot. Set it to 20MB, as this option is recommended to be at least twice as much as snapshot_threshold_log_size. Refs: SCYLLADB-1115	2026-03-23 17:03:36 +01:00
Pavel Emelyanov	9a2e583f29	storage_service: Make describe_ring_for_table() take table_id All callers already have it. It makes no difference for the method itself with which table identifier to work, but will help to simplify the flow in API handler (next patch) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 19:49:24 +03:00
Botond Dénes	5573c3b18e	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier()	2026-03-20 09:05:52 +02:00
Piotr Dulikowski	171504c84f	Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz This patchset migrates: query_all_directly_granted, query_all, get_attribute, query_attribute_for_all functions to use cache instead of doing CQL queries. It also includes some preparatory work which fixes cache update order and triggering. Main motivation behind this is to make sure that all calls from service_level_controller::auth_integration are cached, which we achieve here. Alternative implementation could move the whole auth_integration data into auth cache but since auth_integration manages also lifetime and contains service levels specific logic such solution would be too complex for little (if any) gain. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159 Backport: no, not a bug Closes scylladb/scylladb#28791 * github.com:scylladb/scylladb: auth: switch query_attribute_for_all to use cache auth: switch get_attribute to use cache auth: cache: add heterogeneous map lookups auth: switch query_all to use cache auth: switch query_all_directly_granted to use cache auth: cache: add ability to go over all roles raft: service: reload auth cache before service levels service: raft: move update_service_levels_effective_cache check	2026-03-19 14:37:22 +01:00
Botond Dénes	2e47fd9f56	Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk During decommission, we first mark a topology request as done, then shut down a node and in the following steps we remove node from the topology. Thus, finished request does not imply that a node is removed from the topology. Due to that, in node_ops_virtual_task::wait, while gathering children from the whole cluster, we may hit the connection exception - because a node is still in topology, even though it is down. Modify the get_children method to ignore the exception and warn about the failure instead. Keep token_metadata_ptr in get_children to prevent topology from changing. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867 Needs backports to all versions Closes scylladb/scylladb#29035 * github.com:scylladb/scylladb: tasks: fix indentation tasks: do not fail the wait request if rpc fails tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children	2026-03-19 10:03:18 +02:00
Gleb Natapov	77d3245e02	view: remove upgrade to raft code Since we do no longer support upgrade from versions that do not support v2 of view building code we can remove upgrade code and make sure we do not boot with old builder version.	2026-03-18 17:45:40 +02:00

1 2 3 4 5 ...

6230 Commits