Commit Graph

257 Commits

Author SHA1 Message Date
Aleksandra Martyniuk
d874d355c2 service: skip load_sketch unload for excluded nodes on RF shrink
When an RF change shrinks replicas on a DC and the node being shrunk is
excluded, refresh_tablet_load_stats() only provides load_stats for that
node if it has a cached snapshot from when the node was still up. If the
snapshot is missing or predates the tables being shrunk (e.g. they were
created after the node went down), stats stay incomplete. In that case
load_sketch::unload() called from make_rf_change_plan() throws:

    Can't provide accurate load computation with incomplete load_stats
    for host: <uuid>

Since an excluded node is not expected to come back, load_stats will
never become complete, and the topology coordinator retries the plan
infinitely, hanging ALTER KEYSPACE.

Add a check for excluded nodes and skip unload() for them: we are
removing the replica, so accurate load data for that node is not
needed. For all other node states the throw-and-retry behavior is
preserved.

Modify test_excludenode_shrink_rf to always trigger the bug: a new
error injection 'force_down_node_load_stats_invalid' forces the
invalid-stats path in refresh_tablet_load_stats() for a down node, so
the test does not depend on whether the load-stats refresher happened
to cache the excluded node's stats while it was still up.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1702.

Closes scylladb/scylladb#29622
2026-05-15 17:46:28 +02:00
Tomasz Grabiec
66439bb753 Merge 'load_balancer: apply balance threshold to intranode shard balancing' from Ferenc Szili
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing

The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).

The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.

Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.

The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted

Fixes: SCYLLADB-1775

This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2

Closes scylladb/scylladb#29756

* github.com:scylladb/scylladb:
  test: add test for intranode balance threshold in size-based mode
  tablet_allocator: apply balance threshold to intranode shard balancing
2026-05-13 13:09:52 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Avi Kivity
ddb1181103 Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1803

Closes scylladb/scylladb#29791

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-12 12:38:25 +03:00
Ferenc Szili
aaead10e5d tablet_allocator: apply balance threshold to intranode shard balancing
The intranode shard balancing loop only stopped when the most-loaded
and least-loaded shard were the same (src == dst), meaning it would
keep issuing migrations until the load difference reached exactly 0.
This caused unnecessary migrations for negligible imbalances.

Apply the same is_balanced() threshold check that is already used for
inter-node balancing, so that intranode migrations stop when the
relative load difference between shards is within the configured
size_based_balance_threshold (default 1%).
2026-05-12 10:34:16 +02:00
Pavel Emelyanov
17384d42e3 tablets: Implement tablet-aware cluster-wide restore
This patch adds

- Changes in sstables_loader::restore_tablets() method

It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest

- Implementation of tablet_restore_task_impl::run() method

It emplaces a bunch of tablet migrations with "restore" kind

- Topology coordinator handling of tablet_transition_stage::restore

When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Calle Wilund
db1b92c185 service::load_balancer: Add metrics for repair and rebuild count
Fixes #21115

Adds cluster counter for repairs, and dc counter for rebuilds

Closes scylladb/scylladb#28985
2026-05-11 16:57:46 +03:00
Nadav Har'El
d4aa528834 Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.

`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`

The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.

Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.

Fixes: SCYLLADB-1664

This fix needs to be backported to versions: 2025.4, 2026.1

Closes scylladb/scylladb#29585

* github.com:scylladb/scylladb:
  test: verify load balancer handles dropped tables gracefully
  tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
2026-05-10 22:07:51 +03:00
Ferenc Szili
906d2b817e service: allow draining with forced capacity-based balancing
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.

When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.

This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
2026-05-07 13:44:53 +02:00
Ferenc Szili
4987204f71 tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
The load balancer's get_schema_and_rs() would trigger on_internal_error when
a table present in the token metadata snapshot had been concurrently dropped
from the live schema. This race is possible because the balancer coroutine
yields between building the candidate list and checking replication
constraints, allowing a DROP TABLE schema mutation to be applied by another
fiber in the meantime.

Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables
instead of aborting. Update all callers to skip dropped tables:
- make_sizing_plan: continue to next table
- make_resize_plan: continue to next table (merge suppression is moot)
- check_constraints: return skip_info with empty viable targets
- get_rs: return nullptr, checked by check_constraints
2026-04-27 10:33:53 +02:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Piotr Szymaniak
4b6937b570 alternator/streams: Block tablet merges when Alternator Streams are enabled
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).

add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.

For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.

Disabling streams clears all flags, immediately re-allowing merges.

The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.

Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
2026-04-19 03:54:33 +02:00
Aleksandra Martyniuk
2c0de7d9b3 test: test multi RF changes 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1b2b453782 service: tasks: allow aborting ongoing RF changes
Allow aborting an ongoing RF change using task manager.

RF change can only be aborted if:
- it is currently paused (existing);
- it is a multi-RF change that still has replicas to be added.

In the second case, we set error for the request in system.topology_requests
and set next_replication to replication_v2. This makes load balancer
roll back the RF change.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1bafc8394c service: handle multi_rf_change
Extend keyspace_rf_change handler to handle multi_rf_change.
multi_rf_change is allowed only if we add or remove DCs and
the keyspace uses rack list replication factor. The handler
adds the request id to topology::ongoing_rf_changes.
The request is further processed by load balancer.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
8fb91e245f service: implement make_rf_change_plan
In make_rf_change_plan, load balancer schedules necessary migrations,
considering the load of nodes and other pending tablet transitions.
Requests from ongoing_rf_changes are processed concurrently, independently
from one another. In each request racks are processed concurrently.
No tablet replica will be removed until all required replicas are added.
While adding replicas to each rack we always start with base tables
and won't proceed with views until they are done (while removing - the other
way around).

Node availability is checked at two levels for extending actions:

1) In prepare_per_rack_rf_change_plan: the entire RF change request is
   aborted if any node in the target dc+rack is down, or if there are
   no live (non-excluded) nodes at all. Shrinking is never aborted.

2) In make_rf_change_plan: extending is skipped for a given round if
   any normal, non-excluded node in the target dc+rack is missing from
   the balanced node set. Shrinking always proceeds regardless.

The resulting behavior per node state combination (extending only):
  - all up                  -> proceed
  - some excluded + some up -> proceed (excluded nodes are skipped)
  - any down node           -> abort
  - all excluded (no live)  -> abort

When the last step is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved (if request succeeded);
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
89a17491db service: add keyspace_rf_change_plan to migration_plan
Add keyspace_rf_change_plan to migration_plan.

The keyspace_rf_change_plan consists of:
- completion - info about the request for which all migrations are done. Only one
  request can be completed at the time, even if more have finished migrations
  (the rest will be completed later). Based on it:
    - next_replication is cleared;
    - new keyspace properties are saved (only if succeeded);
    - request is removed from ongoing_rf_changes;
    - the request is marked as done in system.topology_requests.
- aborts - info about requests that cannot complete because the required
  rf change is impossible (e.g. no available nodes in a required rack).
  Multiple requests can be aborted in a single plan. Based on each:
    - next_replication is set to current_replication (rolling back);
    - the request is marked as aborted with an error in system.topology_requests.

The scheduled rebuilds will be kept in migration_plan::_migrations.

Based on that the canonical_mutations are generated.

Add update_topology_state_with_mixed_change and use it if any schema
changes are required, i.e. if plan contains keyspace_rf_change_plan::completion.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
d41c5a7db4 service: split update_node_load_on_migration
Split update_node_load_on_migration into decrease_node_load and
increase_node_load - in the following changes for rebuilds we will
need only one of those at the time.
2026-04-17 09:58:07 +02:00
Avi Kivity
59ec93b86b Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.

Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.

Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".

build/release/scylla perf-tablets:

Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)

Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full      407.69ms
partial     2.65ms
```

After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full      354.50ms
partial     2.19ms
```

Closes scylladb/scylladb#28459

* github.com:scylladb/scylladb:
  test: boost: tablets: Add test for merge with arbitrary tablet count
  tablets, database: Advertise 'arbitrary' layout in snapshot manifest
  tablets: Introduce pow2_count per-table tablet option
  tablets: Prepare for non-power-of-two tablet count
  tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
  tablets: Prepare resize_decision to hold data in decisions
  tablets: table: Make storage_group handle arbitrary merge boundaries
  tablets: Make stats update post-merge work with arbitrary merge boundaries
  locator: tablets: Support arbitrary tablet boundaries
  locator: tablets: Introduce tablet_map::get_split_token()
  dht: Introduce get_uniform_tokens()
2026-04-15 18:57:22 +03:00
Raphael S. Carvalho
a2eed4bb45 service: Use optimistic replicas in all_sibling_tablet_replicas_colocated
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.

Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.

Steps that lead to the problem:

1. Merge is triggered. The load balancer generates co-location migrations
   for all sibling pairs that are not yet on the same shard. Some pairs
   finish co-location before others.

2. Once all pairs are co-located in committed state,
   all_sibling_tablet_replicas_colocated returns true and finalize_resize
   is set. Meanwhile the load balancer may have already started a regular
   LB migration on one co-located pair (both tablets are stable and the
   load balancer is free to move them).

3. The LB migration moves both tablets together (colocated_tablets). Their
   two del_transition commits land in separate Raft rounds. After the first
   commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.

4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
   NOT co-located, clears finalize_resize, and the load balancer generates
   new migrations for other tablets to rebalance the load that the pair
   move created.

5. Those new migrations can take tens of seconds to stream, keeping the
   coordinator in handle_tablet_migration mode and preventing
   maybe_start_tablet_resize_finalization from being called. The merge
   finalization is delayed until all those cascaded migrations complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29465
2026-04-15 14:40:15 +03:00
Tomasz Grabiec
50fbac6ea6 tablets: Introduce pow2_count per-table tablet option
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
2026-04-15 10:40:56 +02:00
Tomasz Grabiec
b6a7023f68 tablets: Prepare for non-power-of-two tablet count
This is a step towards more flexibility in managing tablets.  A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.

After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.

We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.

One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.

Example (3 tablets):
  [0] owns 1/3 of tokens
  [1] owns 1/3 of tokens
  [2] owns 1/3 of tokens

After merge:
  [0] owns 2/3 of tokens
  [1] owns 1/3 of tokens

What we would like instead:

Step 1 (split [1]):
  [0] owns 1/3 of tokens
  [1] old 1.left, owns 1/6 of tokens
  [2] old 1.right, owns 1/6 of tokens
  [3] owns 1/3 of tokens

Step 2 (merge):
  [0] owns 1/2 of tokens
  [1] owns 1/2 of tokens

To do that, we need to be able to split individual tablets, but we're
not there yet.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
f54daef4ec tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
This way it doesn't need to know how the scheduler chose to merge tablets.
We'll have less duplication of logic.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
01fb97ee78 locator: tablets: Support arbitrary tablet boundaries
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.

Implementation details:

We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).

Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):

For 131'072 tablets:

Before:

  Size of tablet_metadata in memory: 57456 KiB

After:

  Size of tablet_metadata in memory: 59504 KiB
2026-04-15 01:25:14 +02:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Nikos Dragazis
4a3e26d5e3 tablet_allocator: Exclude migrating tables from load balancing
The tablet load balancer operates on all tablet-based tables that appear
in the tablet metadata.

With the introduction of the vnodes-to-tablets migration procedure later
in this series, migrating tables will also appear in the tablet
metadata, but they need to be treated as vnode tables until migration is
finished. This patch excludes such tables from load balancing.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-03-24 11:06:38 +02:00
Dimitrios Symonidis
80b74d7df2 tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds
Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows.
During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest.
During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution.
This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary.

Closes scylladb/scylladb#28450
2026-03-03 11:19:24 +03:00
Avi Kivity
7ec710c250 Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec
Tablet migration keeps sstable snapshot during streaming, which may
cause temporary increase in disk utilization if compaction is running
concurrently. SSTables compacted away are kept on disk until streaming
is done with them. The more tablets we allow to migrate concurrently,
the higher disk space can rise. When the target tablet size is
configured correcly, every tablet should own about 1% of disk
space. So concurrency of 4 shouldn't put us at risk. But target tablet
size is not chosen dynamically yet, and it may not be aligned with
disk capacity.

Also, tablet sizes can temporarily grow above the target, up to 2x
before the split starts, and some more because splits take a while to
complete.

To reduce the impact from this, reduce concurrency of
migration. Concurrency of 2 should still be enough to saturate
resources on the leaving shard.

Also, reducing concurrency means that load balancing is more
responsive to preemption. There will be less bandwidth sharing, so
scheduled migrations complete faster. This is important for scale-out,
where we bootstrap a node and want to start migrations to that new
node as soon as possible.

Refs scylladb/siren#15317

Closes scylladb/scylladb#28563

* github.com:scylladb/scylladb:
  tablets, config: Reduce migration concurrency to 2
  tablets: load_balancer: Always accept migration if the load is 0
  config, tablets: Make tablet migration concurrency configurable
2026-02-19 15:31:43 +02:00
Asias He
1be80c9e86 repair: Skip auto repair for tables using RF one
There is no point running repair for tables using RF one. Row level
repair will skip it but the auto repair scheduler will keep scheduling
such repairs since repair_time could not be updated.

Skip such repairs at the scheduler level for auto repair.

If the request is issued by user, we will have to schedule such
repair otherwise the user request will never be finished.

Fixes SCYLLADB-561

Closes scylladb/scylladb#28640
2026-02-18 14:32:50 +02:00
Tomasz Grabiec
56e40e90c9 tablets: load_balancer: Always accept migration if the load is 0
Different transitions have different weights, and limits are
configurable.  We don't want a situation where a high-cost migration
is cut off by limits and the system can make no progress.

For example, repair uses weight 2 for read concurrency.  Migrating
co-located tablets scales the cost by the number of co-located
tablets.
2026-02-06 00:42:18 +01:00
Tomasz Grabiec
39492596c2 config, tablets: Make tablet migration concurrency configurable
We're about to reduce it. It's better to not have it hard-coded in
case we change our mings again.
2026-02-06 00:42:18 +01:00
Botond Dénes
a8767f36da Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

Closes scylladb/scylladb#28337

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-29 08:25:17 +02:00
Tomasz Grabiec
df949dc506 Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.

To make cleanup reliable this PR:

Adds an additional “fallback cleanup” stage

- `write_both_read_old_fallback_cleanup`

that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.

Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.

Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.

No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.

Closes scylladb/scylladb#28169

* github.com:scylladb/scylladb:
  topology_coordinator: add write_both_read_old_fallback_cleanup state
  topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
  topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
  topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
2026-01-28 13:33:39 +01:00
Pavel Emelyanov
2ffe5b7d80 tablet_allocator: Have its own explicit background scheduling group
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.

This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28356
2026-01-28 10:34:28 +02:00
Tomasz Grabiec
8e831a7b6d tablets: tablet_allocator.cc: Convert tabs to spaces 2026-01-28 01:32:01 +01:00
Tomasz Grabiec
9715965d0c tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
ef0e9ad34a tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
4a161bff2d tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
7228bd1502 tablets: load_balancer: Extract print_node_stats() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
615b86e88b tablet: load_balancer: Use empty() instead of size() where applicable 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
0d090aa47b tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
f2b0146f0f tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
df32318f66 tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
32b336e062 tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.
2026-01-27 16:01:36 +01:00
Łukasz Paszkowski
f06094aa95 topology_coordinator: add write_both_read_old_fallback_cleanup state
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.

Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.

Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.

This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.

By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
2026-01-26 13:14:37 +01:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Tomasz Grabiec
2d954f4b19 tablet_allocator: Skip co-location on draining nodes
In case of decommission, it's not desirable because it's less
urgent.

In case of removenode, it leads to failure of removenode operation
because scheduled co-locating migration will fail if the destination
is on the excluded node, and this failure will be interpreted as drain
failure and coordinator will cancel the request.

Not a problem before "parallel decommission" because this failure is
only a streaming failure, not a barrier failure, so exception doesn't
escape into the catch clause in transition stage handler, and the
migration is simply rolled back. Once draining happens in the tablet
migration track, streaming failure will be interpreted as drain
failure and cancel the request.
2026-01-18 15:36:06 +01:00
Tomasz Grabiec
a009644c7d raft_topology, tablets: Drain tablets in parallel with other topology operations
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

The test case test_explicit_tablet_movement_during_decommission is
removed. It verifies that tablet move API works during tablet draining
transition. After this PR, we no longer enter this transition, so the
test doesn't work. It loses its purpose, because movement during
normal tablet balancing is not special and tested elsewhere.
2026-01-18 15:36:05 +01:00
Aleksandra Martyniuk
504290902c test: add test_numeric_rf_to_rack_list_conversion_abort
Add regression test that checks whether aborted rf change leaves
the system_schema.keyspaces unchanged.
2026-01-16 11:36:21 +01:00