service: Use optimistic replicas in all_sibling_tablet_replicas_colocated

all_sibling_tablet_replicas_colocated was using committed ti.replicas to decide whether sibling tablets are co-located and merge can be finalized. This caused a false non-co-located window when a co-located pair was moved by the load balancer: as both tablets migrate together, their del_transition commits may land in different Raft rounds. After the first commit, ti.replicas diverge temporarily (one tablet shows the new position, the other the old), causing all_sibling_tablet_replicas_colocated to return false. This clears finalize_resize, allowing the load balancer to start new cascading migrations that delay merge finalization by tens of seconds. Fix this by using the optimistic replica view (trinfo->next when transitioning, ti.replicas otherwise) — the same view the load balancer uses for load accounting — so finalize_resize stays populated throughout an in-flight migration and no spurious cascades are triggered. Steps that lead to the problem: 1. Merge is triggered. The load balancer generates co-location migrations for all sibling pairs that are not yet on the same shard. Some pairs finish co-location before others. 2. Once all pairs are co-located in committed state, all_sibling_tablet_replicas_colocated returns true and finalize_resize is set. Meanwhile the load balancer may have already started a regular LB migration on one co-located pair (both tablets are stable and the load balancer is free to move them). 3. The LB migration moves both tablets together (colocated_tablets). Their two del_transition commits land in separate Raft rounds. After the first commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position. 4. In this window, all_sibling_tablet_replicas_colocated sees the pair as NOT co-located, clears finalize_resize, and the load balancer generates new migrations for other tablets to rebalance the load that the pair move created. 5. Those new migrations can take tens of seconds to stream, keeping the coordinator in handle_tablet_migration mode and preventing maybe_start_tablet_resize_finalization from being called. The merge finalization is delayed until all those cascaded migrations complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29465
2026-06-04 05:53:13 +00:00 · 2026-04-13 11:54:03 -03:00
parent 53b6e9fda5
commit a2eed4bb45
1 changed files with 20 additions and 4 deletions
--- a/service/tablet_allocator.cc
+++ b/service/tablet_allocator.cc
@@ -1455,9 +1455,22 @@ public:
            }
            auto t2 = *t2_opt;

+            // Use the optimistic replica view (same as the load balancer uses for load accounting)
+            // instead of committed ti.replicas. This avoids a false non-co-located window that
+            // occurs when the two del_transition commits of a colocated LB migration land in
+            // different Raft rounds: after the first commit, ti.replicas diverge temporarily even
+            // though both tablets are still heading to the same shard. Using the optimistic view
+            // keeps finalize_resize populated throughout the migration and prevents the load
+            // balancer from starting cascading migrations in that window.
+            auto with_optimistic_replicas = [this](const tablet_desc& t) {
+                auto info = *t.info;
+                info.replicas = get_replicas_for_tablet_load(*t.info, t.transition);
+                return info;
+            };
+
            // Sibling tablets cannot be considered co-located if their tablet info is temporarily unmergeable.
-            // It can happen either has active repair task for example.
-            all_colocated &= bool(merge_tablet_info(*t1.info, *t2.info));
+            // It can happen if either has an active repair task, for example.
+            all_colocated &= bool(merge_tablet_info(with_optimistic_replicas(t1), with_optimistic_replicas(t2)));
            return make_ready_future<>();
        });
        if (all_colocated) {
@@ -3828,8 +3841,11 @@ public:
                    total_tablet_count += tids.size();
                    total_tablet_sizes += tablet_sizes_sum;
                    if (tmap.needs_merge() && tids.size() == 2) {
-                        // Exclude both sibling tablets if either haven't finished migration yet. That's to prevent balancer from
-                        // un-doing the colocation.
+                        // Exclude both sibling tablets if either hasn't finished migration yet. That's to prevent
+                        // the load balancer from un-doing co-location while it's still in progress.
+                        // Once both are stable, allow load balancing to act on the co-located pair.
+                        // During drain, we similarly wait for both to finish migrating and then add them as
+                        // a candidate so drain can migrate the co-located pair together off the draining node.
                        if (!migrating(t1) && !migrating(t2)) {
                            auto candidate = colocated_tablets{global_tablet_id{table, t1.tid}, global_tablet_id{table, t2->tid}};
                            add_candidate(shard_load_info, migration_tablet_set{std::move(candidate), tablet_sizes_sum});