service: Use optimistic replicas in all_sibling_tablet_replicas_colocated

all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.

Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.

Steps that lead to the problem:

1. Merge is triggered. The load balancer generates co-location migrations
   for all sibling pairs that are not yet on the same shard. Some pairs
   finish co-location before others.

2. Once all pairs are co-located in committed state,
   all_sibling_tablet_replicas_colocated returns true and finalize_resize
   is set. Meanwhile the load balancer may have already started a regular
   LB migration on one co-located pair (both tablets are stable and the
   load balancer is free to move them).

3. The LB migration moves both tablets together (colocated_tablets). Their
   two del_transition commits land in separate Raft rounds. After the first
   commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.

4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
   NOT co-located, clears finalize_resize, and the load balancer generates
   new migrations for other tablets to rebalance the load that the pair
   move created.

5. Those new migrations can take tens of seconds to stream, keeping the
   coordinator in handle_tablet_migration mode and preventing
   maybe_start_tablet_resize_finalization from being called. The merge
   finalization is delayed until all those cascaded migrations complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29465
This commit is contained in:
Raphael S. Carvalho
2026-04-13 11:54:03 -03:00
committed by Botond Dénes
parent 53b6e9fda5
commit a2eed4bb45

View File

@@ -1455,9 +1455,22 @@ public:
}
auto t2 = *t2_opt;
// Use the optimistic replica view (same as the load balancer uses for load accounting)
// instead of committed ti.replicas. This avoids a false non-co-located window that
// occurs when the two del_transition commits of a colocated LB migration land in
// different Raft rounds: after the first commit, ti.replicas diverge temporarily even
// though both tablets are still heading to the same shard. Using the optimistic view
// keeps finalize_resize populated throughout the migration and prevents the load
// balancer from starting cascading migrations in that window.
auto with_optimistic_replicas = [this](const tablet_desc& t) {
auto info = *t.info;
info.replicas = get_replicas_for_tablet_load(*t.info, t.transition);
return info;
};
// Sibling tablets cannot be considered co-located if their tablet info is temporarily unmergeable.
// It can happen either has active repair task for example.
all_colocated &= bool(merge_tablet_info(*t1.info, *t2.info));
// It can happen if either has an active repair task, for example.
all_colocated &= bool(merge_tablet_info(with_optimistic_replicas(t1), with_optimistic_replicas(t2)));
return make_ready_future<>();
});
if (all_colocated) {
@@ -3828,8 +3841,11 @@ public:
total_tablet_count += tids.size();
total_tablet_sizes += tablet_sizes_sum;
if (tmap.needs_merge() && tids.size() == 2) {
// Exclude both sibling tablets if either haven't finished migration yet. That's to prevent balancer from
// un-doing the colocation.
// Exclude both sibling tablets if either hasn't finished migration yet. That's to prevent
// the load balancer from un-doing co-location while it's still in progress.
// Once both are stable, allow load balancing to act on the co-located pair.
// During drain, we similarly wait for both to finish migrating and then add them as
// a candidate so drain can migrate the co-located pair together off the draining node.
if (!migrating(t1) && !migrating(t2)) {
auto candidate = colocated_tablets{global_tablet_id{table, t1.tid}, global_tablet_id{table, t2->tid}};
add_candidate(shard_load_info, migration_tablet_set{std::move(candidate), tablet_sizes_sum});