scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 07:42:16 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	d874d355c2	service: skip load_sketch unload for excluded nodes on RF shrink When an RF change shrinks replicas on a DC and the node being shrunk is excluded, refresh_tablet_load_stats() only provides load_stats for that node if it has a cached snapshot from when the node was still up. If the snapshot is missing or predates the tables being shrunk (e.g. they were created after the node went down), stats stay incomplete. In that case load_sketch::unload() called from make_rf_change_plan() throws: Can't provide accurate load computation with incomplete load_stats for host: <uuid> Since an excluded node is not expected to come back, load_stats will never become complete, and the topology coordinator retries the plan infinitely, hanging ALTER KEYSPACE. Add a check for excluded nodes and skip unload() for them: we are removing the replica, so accurate load data for that node is not needed. For all other node states the throw-and-retry behavior is preserved. Modify test_excludenode_shrink_rf to always trigger the bug: a new error injection 'force_down_node_load_stats_invalid' forces the invalid-stats path in refresh_tablet_load_stats() for a down node, so the test does not depend on whether the load-stats refresher happened to cache the excluded node's stats while it was still up. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702. Closes scylladb/scylladb#29622	2026-05-15 17:46:28 +02:00
Tomasz Grabiec	66439bb753	Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili - Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible - Add a regression test that verifies the threshold is respected for intranode balancing The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards). The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path. Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations. The test creates a single node with 2 shards and 512 tablets: 1. Balanced scenario (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted 2. Unbalanced scenario (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted Fixes: SCYLLADB-1775 This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2 Closes scylladb/scylladb#29756 * github.com:scylladb/scylladb: test: add test for intranode balance threshold in size-based mode tablet_allocator: apply balance threshold to intranode shard balancing	2026-05-13 13:09:52 +02:00
Botond Dénes	e95eb21a16	Merge 'Tablet-aware restore' from Pavel Emelyanov The mechanics of the restore is like this - A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet - The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas - Each replica handles the RPC verb by - Reading the snapshot_sstables table - Filtering the read sstable infos against current node and tablet being handled - Downloading and attaching the filtered sstables This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code. This is first step towards SCYLLADB-197 and lacks many things. In particular - the API only works for single-DC cluster - the caller needs to "lock" tablet boundaries with min/max tablet count - not abortable - no progress tracking - sub-optimal (re-kicking API on restore will re-download everything again) - not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node) - nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup) Other follow-up items: - have an actual swagger object specification for `backup_location` Closes #28436 Closes #28657 Closes #28773 Closes scylladb/scylladb#28763 * github.com:scylladb/scylladb: docs: Update topology_over_raft.md with `restore` transition kind test: Add test for backup vs migration race test: Restore resilience test sstables_loader: Fail tablet-restore task if not all sstables were downloaded sstables_loader: mark sstables as downloaded after attaching sstables_loader: return shared_sstable from attach_sstable db: add update_sstable_download_status method db: add downloaded column to snapshot_sstables db: extract snapshot_sstables TTL into class constant test: Add a test for tablet-aware restore tablets: Implement tablet-aware cluster-wide restore messaging: Add RESTORE_TABLET RPC verb sstables_loader: Add method to download and attach sstables for a tablet tablets: Add restore_config to tablet_transition_info sstables_loader: Add restore_tablets task skeleton test: Add rest_client helper to kick newly introduced API endpoint api: Add /storage_service/tablets/restore endpoint skeleton sstables_loader: Add keyspace and table arguments to manfiest loading helper sstables_loader_helpers: just reformat the code sstables_loader_helpers: generalize argument and variable names sstables_loader_helpers: generalize get_sstables_for_tablet sstables_loader_helpers: add token getters for tablet filtering sstables_loader_helpers: remove underscores from struct members sstables_loader: move download_sstable and get_sstables_for_tablet sstables_loader: extract single-tablet SST filtering sstables_loader: make download_sstable static sstables_loader: fix formating of the new `download_sstable` function sstables_loader: extract single SST download into a function sstables_loader: add shard_id to minimal_sst_info sstables_loader: add function for parsing backup manifests split utility functions for creating test data from database_test export make_storage_options_config from lib/test_services rjson: Add helpers for conversions to dht::token and sstable_id Add system_distributed_keyspace.snapshot_sstables add get_system_distributed_keyspace to cql_test_env code: Add system_distributed_keyspace dependency to sstables_loader storage_service: Export export handle_raft_rpc() helper storage_service: Export do_tablet_operation() storage_service: Split transit_tablet() into two tablets: Add braces around tablet_transition_kind::repair switch	2026-05-12 16:24:13 +03:00
Avi Kivity	ddb1181103	Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes. The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan. Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced. Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing. This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2 Fixes: SCYLLADB-1803 Closes scylladb/scylladb#29791 * github.com:scylladb/scylladb: test: boost: add drain test for forced capacity-based balancing service: allow draining with forced capacity-based balancing	2026-05-12 12:38:25 +03:00
Ferenc Szili	aaead10e5d	tablet_allocator: apply balance threshold to intranode shard balancing The intranode shard balancing loop only stopped when the most-loaded and least-loaded shard were the same (src == dst), meaning it would keep issuing migrations until the load difference reached exactly 0. This caused unnecessary migrations for negligible imbalances. Apply the same is_balanced() threshold check that is already used for inter-node balancing, so that intranode migrations stop when the relative load difference between shards is within the configured size_based_balance_threshold (default 1%).	2026-05-12 10:34:16 +02:00
Pavel Emelyanov	17384d42e3	tablets: Implement tablet-aware cluster-wide restore This patch adds - Changes in sstables_loader::restore_tablets() method It populates the system_distributed_keyspace.snapshot_sstables table with the information read from the manifest - Implementation of tablet_restore_task_impl::run() method It emplaces a bunch of tablet migrations with "restore" kind - Topology coordinator handling of tablet_transition_stage::restore When seen, the coordinator calls RESTORE_TABLET RPC against all tablet replicas Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Calle Wilund	db1b92c185	service::load_balancer: Add metrics for repair and rebuild count Fixes #21115 Adds cluster counter for repairs, and dc counter for rebuilds Closes scylladb/scylladb#28985	2026-05-11 16:57:46 +03:00
Nadav Har'El	d4aa528834	Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili - Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error` - The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort. `get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables: - `make_sizing_plan`: skips to next table - `make_resize_plan`: skips to next table (merge suppression is moot) - `check_constraints`: returns `skip_info{}` with empty viable targets - `get_rs`: returns `nullptr`, checked by `check_constraints` The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it. Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot. Fixes: SCYLLADB-1664 This fix needs to be backported to versions: 2025.4, 2026.1 Closes scylladb/scylladb#29585 * github.com:scylladb/scylladb: test: verify load balancer handles dropped tables gracefully tablet_allocator: handle dropped tables gracefully in get_schema_and_rs	2026-05-10 22:07:51 +03:00
Ferenc Szili	906d2b817e	service: allow draining with forced capacity-based balancing When force_capacity_based_balancing is enabled, the tablet allocator balances by node and shard capacity rather than by tablet sizes. When the data needed for load balancing is incomplete, the balancer fails and waits until load_stats is available and correct for all the nodes. An exception to this is when a node is being drained and excluded: it is unreachable, and will not return. In this case the balancer has to do its best and ignore the missing data. This patch fixes a bug where forcing capacity based balancing made the balancer not ignore missing data in these cases, and instead abort the balancing.	2026-05-07 13:44:53 +02:00
Ferenc Szili	4987204f71	tablet_allocator: handle dropped tables gracefully in get_schema_and_rs The load balancer's get_schema_and_rs() would trigger on_internal_error when a table present in the token metadata snapshot had been concurrently dropped from the live schema. This race is possible because the balancer coroutine yields between building the candidate list and checking replication constraints, allowing a DROP TABLE schema mutation to be applied by another fiber in the meantime. Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables instead of aborting. Update all callers to skip dropped tables: - make_sizing_plan: continue to next table - make_resize_plan: continue to next table (merge suppression is moot) - check_constraints: return skip_info with empty viable targets - get_rs: return nullptr, checked by check_constraints	2026-04-27 10:33:53 +02:00
Tomasz Grabiec	cddde464ca	Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature	2026-04-22 01:46:11 +02:00
Piotr Szymaniak	4b6937b570	alternator/streams: Block tablet merges when Alternator Streams are enabled DynamoDB Streams API can only convey a single parent per stream shard. Tablet merges produce 2 parents, which is incompatible. When streams are requested on a tablet table, block tablet merges via tablet_merge_blocked (the allocator suppresses new merge decisions and revokes any active merge decision). add_stream_options() sets tablet_merge_blocked=true alongside enabled=true, so CreateTable needs no special handling — the flag is inert on vnode tables and immediately effective on tablet tables. For UpdateTable, CDC enablement is deferred: store the user's intent via enable_requested, and let the topology coordinator finalize enablement once no in-progress merges remain. A new helper, defer_enabling_streams_block_tablet_merges(), amends the CDC options to this deferred state. Disabling streams clears all flags, immediately re-allowing merges. The tablet allocator accesses the merge-blocked flag through a schema::tablet_merges_forbidden() accessor rather than reaching into CDC options directly. Mark test_parent_children_merge as xfail and remove downward (merge) steps from tablet_multipliers in test_parent_filtering and test_get_records_with_alternating_tablets_count.	2026-04-19 03:54:33 +02:00
Aleksandra Martyniuk	2c0de7d9b3	test: test multi RF changes	2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk	1b2b453782	service: tasks: allow aborting ongoing RF changes Allow aborting an ongoing RF change using task manager. RF change can only be aborted if: - it is currently paused (existing); - it is a multi-RF change that still has replicas to be added. In the second case, we set error for the request in system.topology_requests and set next_replication to replication_v2. This makes load balancer roll back the RF change.	2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk	1bafc8394c	service: handle multi_rf_change Extend keyspace_rf_change handler to handle multi_rf_change. multi_rf_change is allowed only if we add or remove DCs and the keyspace uses rack list replication factor. The handler adds the request id to topology::ongoing_rf_changes. The request is further processed by load balancer.	2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk	8fb91e245f	service: implement make_rf_change_plan In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). Node availability is checked at two levels for extending actions: 1) In prepare_per_rack_rf_change_plan: the entire RF change request is aborted if any node in the target dc+rack is down, or if there are no live (non-excluded) nodes at all. Shrinking is never aborted. 2) In make_rf_change_plan: extending is skipped for a given round if any normal, non-excluded node in the target dc+rack is missing from the balanced node set. Shrinking always proceeds regardless. The resulting behavior per node state combination (extending only): - all up -> proceed - some excluded + some up -> proceed (excluded nodes are skipped) - any down node -> abort - all excluded (no live) -> abort When the last step is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved (if request succeeded); - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests.	2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk	89a17491db	service: add keyspace_rf_change_plan to migration_plan Add keyspace_rf_change_plan to migration_plan. The keyspace_rf_change_plan consists of: - completion - info about the request for which all migrations are done. Only one request can be completed at the time, even if more have finished migrations (the rest will be completed later). Based on it: - next_replication is cleared; - new keyspace properties are saved (only if succeeded); - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. - aborts - info about requests that cannot complete because the required rf change is impossible (e.g. no available nodes in a required rack). Multiple requests can be aborted in a single plan. Based on each: - next_replication is set to current_replication (rolling back); - the request is marked as aborted with an error in system.topology_requests. The scheduled rebuilds will be kept in migration_plan::_migrations. Based on that the canonical_mutations are generated. Add update_topology_state_with_mixed_change and use it if any schema changes are required, i.e. if plan contains keyspace_rf_change_plan::completion.	2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk	d41c5a7db4	service: split update_node_load_on_migration Split update_node_load_on_migration into decrease_node_load and increase_node_load - in the following changes for rebuilds we will need only one of those at the time.	2026-04-17 09:58:07 +02:00
Avi Kivity	59ec93b86b	Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any token, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. We can also split and merge incrementally individual tablets. Currently, we do it for the whole table or nothing, which makes splits and merges take longer and cause wide swings of the count. This is not implemented in this PR yet, we still split/merge the whole table. Another reason is vnode to tablets migration. We now could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from CQL-coordinator point of view. Tablet count is still a power-of-two by default for newly created tables. It may be different if tablet map is created by non-standard means, or if per-table tablet option "pow2_count" is set to "false". build/release/scylla perf-tablets: Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%) Before: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 57456 KiB Copied in 0.014346 [ms] Cleared in 0.002698 [ms] Saved in 1234.685303 [ms] Read in 445.577881 [ms] Read mutations in 299.596313 [ms] 128 mutations Read required hosts in 247.482742 [ms] Size of canonical mutations: 33.945053 [MiB] Disk space used by system.tablets: 1.456761 [MiB] Tablet metadata reload: full 407.69ms partial 2.65ms ``` After: ``` Generating tablet metadata Total tablet count: 131072 Size of tablet_metadata in memory: 59504 KiB Copied in 0.032475 [ms] Cleared in 0.002965 [ms] Saved in 1093.877441 [ms] Read in 387.027100 [ms] Read mutations in 255.752121 [ms] 128 mutations Read required hosts in 211.202805 [ms] Size of canonical mutations: 33.954453 [MiB] Disk space used by system.tablets: 1.450162 [MiB] Tablet metadata reload: full 354.50ms partial 2.19ms ``` Closes scylladb/scylladb#28459 * github.com:scylladb/scylladb: test: boost: tablets: Add test for merge with arbitrary tablet count tablets, database: Advertise 'arbitrary' layout in snapshot manifest tablets: Introduce pow2_count per-table tablet option tablets: Prepare for non-power-of-two tablet count tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() tablets: Prepare resize_decision to hold data in decisions tablets: table: Make storage_group handle arbitrary merge boundaries tablets: Make stats update post-merge work with arbitrary merge boundaries locator: tablets: Support arbitrary tablet boundaries locator: tablets: Introduce tablet_map::get_split_token() dht: Introduce get_uniform_tokens()	2026-04-15 18:57:22 +03:00
Raphael S. Carvalho	a2eed4bb45	service: Use optimistic replicas in all_sibling_tablet_replicas_colocated all_sibling_tablet_replicas_colocated was using committed ti.replicas to decide whether sibling tablets are co-located and merge can be finalized. This caused a false non-co-located window when a co-located pair was moved by the load balancer: as both tablets migrate together, their del_transition commits may land in different Raft rounds. After the first commit, ti.replicas diverge temporarily (one tablet shows the new position, the other the old), causing all_sibling_tablet_replicas_colocated to return false. This clears finalize_resize, allowing the load balancer to start new cascading migrations that delay merge finalization by tens of seconds. Fix this by using the optimistic replica view (trinfo->next when transitioning, ti.replicas otherwise) — the same view the load balancer uses for load accounting — so finalize_resize stays populated throughout an in-flight migration and no spurious cascades are triggered. Steps that lead to the problem: 1. Merge is triggered. The load balancer generates co-location migrations for all sibling pairs that are not yet on the same shard. Some pairs finish co-location before others. 2. Once all pairs are co-located in committed state, all_sibling_tablet_replicas_colocated returns true and finalize_resize is set. Meanwhile the load balancer may have already started a regular LB migration on one co-located pair (both tablets are stable and the load balancer is free to move them). 3. The LB migration moves both tablets together (colocated_tablets). Their two del_transition commits land in separate Raft rounds. After the first commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position. 4. In this window, all_sibling_tablet_replicas_colocated sees the pair as NOT co-located, clears finalize_resize, and the load balancer generates new migrations for other tablets to rebalance the load that the pair move created. 5. Those new migrations can take tens of seconds to stream, keeping the coordinator in handle_tablet_migration mode and preventing maybe_start_tablet_resize_finalization from being called. The merge finalization is delayed until all those cascaded migrations complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29465	2026-04-15 14:40:15 +03:00
Tomasz Grabiec	50fbac6ea6	tablets: Introduce pow2_count per-table tablet option By default it's true, in which case tablet count of the table is rounded up to a power of two. This option allows lifting this, in which case the count can be arbitrary. This will allow testing the logic of arbitrary tablet count.	2026-04-15 10:40:56 +02:00
Tomasz Grabiec	b6a7023f68	tablets: Prepare for non-power-of-two tablet count This is a step towards more flexibility in managing tablets. A prerequisite before we can split individual tablets, isolating hot partitions, and evening-out tablet sizes by shifting boundaries. After this patch, the system can handle tables with arbitrary tablet count. Tablet allocator is still rounding up desired tablet count to the nearest power of two when allocating tablets for a new table, so unless the tablet map is allocated in some other way, the counts will be still a power of two. We plan to utilize arbitrary count when migrating from vnodes to tablets, by creating a tablet map which matches vnode boundaries. One of the reasons we don't give up on power-of-two by default yet is that it creates an issue with merges. If tablet count is odd, one of the tablets doesn't have a sibling and will not be merged. That can obviously cause imbalance of token space and tablet sizes between tablets. To limit the impact, this patch dynamically chooses which tablet to isolate when initiating a merge. The largest tablet is chosen, as that will minimize imbalance. Otherwise, if we always chose the last tablet to isolate, its size would remain the same while other tablets double in size with each odd-count merge, leading to imbalance. The imbalance will still be there, but the difference in tablet sizes is limited to 2x. Example (3 tablets): [0] owns 1/3 of tokens [1] owns 1/3 of tokens [2] owns 1/3 of tokens After merge: [0] owns 2/3 of tokens [1] owns 1/3 of tokens What we would like instead: Step 1 (split [1]): [0] owns 1/3 of tokens [1] old 1.left, owns 1/6 of tokens [2] old 1.right, owns 1/6 of tokens [3] owns 1/3 of tokens Step 2 (merge): [0] owns 1/2 of tokens [1] owns 1/2 of tokens To do that, we need to be able to split individual tablets, but we're not there yet.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	f54daef4ec	tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets() This way it doesn't need to know how the scheduler chose to merge tablets. We'll have less duplication of logic.	2026-04-15 10:40:55 +02:00
Tomasz Grabiec	01fb97ee78	locator: tablets: Support arbitrary tablet boundaries There are several reasons we want to do that. One is that it will give us more flexibility in distributing the load. We can subdivide tablets at any points, and achieve more evenly-sized tablets. In particular, we can isolate large partitions into separate tablets. Another reason is vnode-to-tablet migration. We could construct a tablet map which matches exactly the vnode boundaries, so migration can happen transparently from the CQL-coordinator's point of view. Implementation details: We store a vector of tokens which represent tablet boundaries in the tablet_id_map. tablet_id keeps its meaning, it's an index into vector of tablets. To avoid logarithmic lookup of tablet_id from the token, we introduce a lookup structure with power-of-two aligned buckets, and store the tablet_id of the tablet which owns the first token in the bucket. This way, lookup needs to consider tablet id range which overlaps with one bucket. If boundaries are more or less aligned, there are around 1-2 tablets overlapping with a bucket, and the lookup is still O(1). Amount of memory used increased, but not significantly relative to old size (because tablet_info is currently fat): For 131'072 tablets: Before: Size of tablet_metadata in memory: 57456 KiB After: Size of tablet_metadata in memory: 59504 KiB	2026-04-15 01:25:14 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Nikos Dragazis	4a3e26d5e3	tablet_allocator: Exclude migrating tables from load balancing The tablet load balancer operates on all tablet-based tables that appear in the tablet metadata. With the introduction of the vnodes-to-tablets migration procedure later in this series, migrating tables will also appear in the tablet metadata, but they need to be treated as vnode tables until migration is finished. This patch excludes such tables from load balancing. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Dimitrios Symonidis	80b74d7df2	tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows. During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest. During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution. This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary. Closes scylladb/scylladb#28450	2026-03-03 11:19:24 +03:00
Avi Kivity	7ec710c250	Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec Tablet migration keeps sstable snapshot during streaming, which may cause temporary increase in disk utilization if compaction is running concurrently. SSTables compacted away are kept on disk until streaming is done with them. The more tablets we allow to migrate concurrently, the higher disk space can rise. When the target tablet size is configured correcly, every tablet should own about 1% of disk space. So concurrency of 4 shouldn't put us at risk. But target tablet size is not chosen dynamically yet, and it may not be aligned with disk capacity. Also, tablet sizes can temporarily grow above the target, up to 2x before the split starts, and some more because splits take a while to complete. To reduce the impact from this, reduce concurrency of migration. Concurrency of 2 should still be enough to saturate resources on the leaving shard. Also, reducing concurrency means that load balancing is more responsive to preemption. There will be less bandwidth sharing, so scheduled migrations complete faster. This is important for scale-out, where we bootstrap a node and want to start migrations to that new node as soon as possible. Refs scylladb/siren#15317 Closes scylladb/scylladb#28563 * github.com:scylladb/scylladb: tablets, config: Reduce migration concurrency to 2 tablets: load_balancer: Always accept migration if the load is 0 config, tablets: Make tablet migration concurrency configurable	2026-02-19 15:31:43 +02:00
Asias He	1be80c9e86	repair: Skip auto repair for tables using RF one There is no point running repair for tables using RF one. Row level repair will skip it but the auto repair scheduler will keep scheduling such repairs since repair_time could not be updated. Skip such repairs at the scheduler level for auto repair. If the request is issued by user, we will have to schedule such repair otherwise the user request will never be finished. Fixes SCYLLADB-561 Closes scylladb/scylladb#28640	2026-02-18 14:32:50 +02:00
Tomasz Grabiec	56e40e90c9	tablets: load_balancer: Always accept migration if the load is 0 Different transitions have different weights, and limits are configurable. We don't want a situation where a high-cost migration is cut off by limits and the system can make no progress. For example, repair uses weight 2 for read concurrency. Migrating co-located tablets scales the cost by the number of co-located tablets.	2026-02-06 00:42:18 +01:00
Tomasz Grabiec	39492596c2	config, tablets: Make tablet migration concurrency configurable We're about to reduce it. It's better to not have it hard-coded in case we change our mings again.	2026-02-06 00:42:18 +01:00
Botond Dénes	a8767f36da	Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec Contains various improvements to tablet load balancer. Batched together to save on the bill for CI. Most notably: - Make plan summary more concise, and print info only about present elements. - Print rack name in addition to DC name when making a per-rack plan - Print "Not possible to achieve balance" only when this is the final plan with no active migrations - Print per-node stats when "Not possible to achieve balance" is printed - amortize metrics lookup cost - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring" Backport to 2026.1: since the changes enhance debuggability and are relatively low risk Fixes #28423 Fixes #28422 Closes scylladb/scylladb#28337 * github.com:scylladb/scylladb: tablets: tablet_allocator.cc: Convert tabs to spaces tablets: load_balancer: Warn about incomplete stats once for all offending nodes tablets: load_balancer: Improve node stats printout tablets: load_balancer: Warn about imbalance only when there are no more active migrations tablets: load_balancer: Extract print_node_stats() tablet: load_balancer: Use empty() instead of size() where applicable tablets: Fix redundancy in migration_plan::empty() tablets: Cache pointer to stats during plan-making tablets: load_balancer: Print rack in addition to DC when giving context tablets: load_balancer: Make plan summary concise tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()	2026-01-29 08:25:17 +02:00
Tomasz Grabiec	df949dc506	Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`. To make cleanup reliable this PR: Adds an additional “fallback cleanup” stage - `write_both_read_old_fallback_cleanup` that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers. Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write. Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe. No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming. Closes scylladb/scylladb#28169 * github.com:scylladb/scylladb: topology_coordinator: add write_both_read_old_fallback_cleanup state topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old	2026-01-28 13:33:39 +01:00
Pavel Emelyanov	2ffe5b7d80	tablet_allocator: Have its own explicit background scheduling group Currently, tablet_allocator switches to streaming scheduling group that it gets from database. It's not nice to use database as provider of configs/scheduling_groups. This patch adds a background scheduling group for tablet allocator configured via its config and sets it to streaming group in main.cc code. This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28356	2026-01-28 10:34:28 +02:00
Tomasz Grabiec	8e831a7b6d	tablets: tablet_allocator.cc: Convert tabs to spaces	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	9715965d0c	tablets: load_balancer: Warn about incomplete stats once for all offending nodes To reduce log spamming when all nodes are missing stats.	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	ef0e9ad34a	tablets: load_balancer: Improve node stats printout Make it more concise: - reduce precision for load to 6 fractional digits - reduce precision for tablets/shard to 3 fractional digits - print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places - print "rd=0 wr=0" instead of "stream_read=0 stream_write=0" Example: load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	4a161bff2d	tablets: load_balancer: Warn about imbalance only when there are no more active migrations Otherwise, it may be only a temporary situation due to lack of candidates, and may be unnecessarily alerting. Also, print node stats to allow assessing how bad the situation is on the spot. Those stats can hint to a cause of imbalance, if balancing is per-DC and racks have different capacity.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	7228bd1502	tablets: load_balancer: Extract print_node_stats()	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	615b86e88b	tablet: load_balancer: Use empty() instead of size() where applicable	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	0d090aa47b	tablets: Cache pointer to stats during plan-making Saves on lookup cost, esp. for candidate evaluation. This showed up in perf profile in the past. Also, lays the ground for splitting stats per rack.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	f2b0146f0f	tablets: load_balancer: Print rack in addition to DC when giving context Load-balancing can be now per-rack instead of per-DC. So just printing "in DC" is confusing. If we're balancing a rack, we should print which rack is that.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	df32318f66	tablets: load_balancer: Make plan summary concise Before: load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s) After: load_balancer - Prepared plan: migrations: 1 We print only stats about elements which are present.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	32b336e062	tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan() Just a cleanup. After this, we don't have a new scope in the outmost make_plan() just for injection handling.	2026-01-27 16:01:36 +01:00
Łukasz Paszkowski	f06094aa95	topology_coordinator: add write_both_read_old_fallback_cleanup state Yet another barrier-failure scenario exists in the `write_both_read_new` state. When the barrier fails, the tablet is expected to transition to `cleanup_target`, but because barrier execution is asynchronous, the cleanup transition can be skipped entirely and the tablet may continue forward instead. Both `write_both_read_new` and `cleanup_target` modify read and write selectors. In this situation, a barrier is required, and transitioning directly between these states without one is unsafe. Introduce an intermediate `write_both_read_old_fallback_cleanup` state that modifies only a read selector and can be entered without a barrier (there is no need to wait for all nodes to start using the "new" read selector). From there, the tablet can proceed to `cleanup_target`, where the required barriers are enforced. This also avoids changing both selectors in a single step. A direct transition from `write_both_read_new` to `cleanup_target` updates both selectors at once, which can leave coordinators using the old selector for writes and the new selector for reads, causing reads to miss preceding writes. By routing through the fallback state, selectors are updated in order—read first, then write—preserving read-after-write correctness.	2026-01-26 13:14:37 +01:00
Patryk Jędrzejczak	67045b5f17	Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. Fixes #21452 Closes scylladb/scylladb#24129 * https://github.com/scylladb/scylladb: docs: Document parallel decommission and removenode and relevant task API test: Add tests for parallel decommission/removenode test: util: Introduce ensure_group0_leader_on() test: tablets: Check that there are no migrations scheduled on draining nodes test: lib: topology_builder: Introduce add_draining_request() topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization tablets: topology_coordinator: Refactor to propagate reason for migration rollback tablet_allocator: Skip co-location on draining nodes node_ops: task_manager_module: Populate entity field also for active requests tasks: node_ops: Put node id in the entity field tasks, node_ops: Unify setting of task_stats in get_status() and get_stats() topology: Protect against empty cancelation reason tasks, topology: Make pending node operations abortable doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged raft_topology, tablets: Drain tablets in parallel with other topology operations virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node locator: topology: Add "draining" flag to a node topology_coordinator: Extract generate_cancel_request_update() storage_service: Drop dependency in topology_state_machine.hh in the header locator: Extract common code in assert_rf_rack_valid_keyspace() topology_coordinator, storage_service: Validate node removal/decommission at request submission time	2026-01-22 13:06:53 +01:00
Aleksandra Martyniuk	761ace4f05	config: add enforce_rack_list option Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value.	2026-01-20 09:58:51 +01:00
Tomasz Grabiec	2d954f4b19	tablet_allocator: Skip co-location on draining nodes In case of decommission, it's not desirable because it's less urgent. In case of removenode, it leads to failure of removenode operation because scheduled co-locating migration will fail if the destination is on the excluded node, and this failure will be interpreted as drain failure and coordinator will cancel the request. Not a problem before "parallel decommission" because this failure is only a streaming failure, not a barrier failure, so exception doesn't escape into the catch clause in transition stage handler, and the migration is simply rolled back. Once draining happens in the tablet migration track, streaming failure will be interpreted as drain failure and cancel the request.	2026-01-18 15:36:06 +01:00
Tomasz Grabiec	a009644c7d	raft_topology, tablets: Drain tablets in parallel with other topology operations Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. The test case test_explicit_tablet_movement_during_decommission is removed. It verifies that tablet move API works during tablet draining transition. After this PR, we no longer enter this transition, so the test doesn't work. It loses its purpose, because movement during normal tablet balancing is not special and tested elsewhere.	2026-01-18 15:36:05 +01:00
Aleksandra Martyniuk	504290902c	test: add test_numeric_rf_to_rack_list_conversion_abort Add regression test that checks whether aborted rf change leaves the system_schema.keyspaces unchanged.	2026-01-16 11:36:21 +01:00

1 2 3 4 5 ...

257 Commits