Files
seaweedfs/weed/worker
Chris Lu 2fd60cfbc3 fix(balance): guard against destination overshoot and oscillation (#9090)
* fix(balance): guard against destination overshoot and oscillation

Plugin-worker volume_balance detection re-selects maxServer/minServer
each iteration based on utilization ratio. With heterogeneous
MaxVolumeCount values, a single greedy move can flip which server is
most-utilized, causing A->B, B->A oscillation within one detection
cycle and pushing destinations past the cluster ideal.

Mirror the shell balancer's per-move guard
(weed/shell/command_volume_balance.go:440): before scheduling a move,
verify that the destination's post-move utilization would not strictly
exceed the source's post-move utilization. If it would, no single move
can improve balance, so stop.

Add regression tests that cover:
- TestDetection_HeterogeneousMax_NoOvershootNoOscillation: 2 servers
  with different caps just above threshold; detection must not
  oscillate or make the imbalance worse.
- TestDetection_RespectsClusterIdealUtilization: 3-server heterogeneous
  layout; destinations must not overshoot cluster ideal.

* fix(balance): use effective capacity when resolving destination disk

resolveBalanceDestination read VolumeCount directly from the topology
snapshot, which is not updated when AddPendingTask registers a move
within the current detection cycle. This meant multiple moves planned
in a single cycle all saw the same static count and could target the
same disk past its effective capacity.

Switch to ActiveTopology.GetNodeDisks + GetEffectiveAvailableCapacity
so that destination planning accounts for all pending and assigned
tasks affecting the disk — consistent with how the detection loop
already tracks effectiveCounts at the server level.

Add a unit test that seeds two pending balance tasks against a
destination disk with 2 free slots and asserts resolveBalanceDestination
rejects a third planned move.

* fix(ec_balance): capacity-weighted guard in Phase 4 global rebalance

detectGlobalImbalance picked min/max nodes by raw shard count and
compared them against a simple (unweighted) rack-wide average. With
heterogeneous MaxVolumeCount across nodes in the same rack, this lets
the greedy algorithm move shards from a large, barely-used node to a
small, nearly-full node just because the small node has fewer shards
in absolute terms — strictly worsening imbalance by utilization and
potentially overfilling the small node.

Snapshot each node's total shard capacity (current shards plus free
slots) at loop start and add a per-move convergence guard: reject any
move where the destination's post-move utilization would strictly
exceed the source's post-move utilization. Mirrors the fix in
weed/worker/tasks/balance/detection.go.

Regression test TestDetectGlobalImbalance_HeterogeneousCapacity covers
a rack with node1 (cap 100, 10 shards → 10% util) and node2 (cap 5,
3 shards → 60% util). Before the fix, Phase 4 moves 2 shards from
node1 to node2, filling node2 to 100% util. After the fix, the guard
blocks both moves.

* fix(ec_balance): utilization-based max/min in Phase 4 rebalance

Phase 4's global rebalancer picked source and destination nodes by raw
shard count, and compared against a simple raw-count average. With
heterogeneous MaxVolumeCount across nodes in a rack, this got the
direction wrong: a large-capacity node holding many shards in absolute
terms but only a small fraction of its capacity would be picked as the
"overloaded" source, while a small-capacity node nearly at its slot
limit (but holding fewer absolute shards) would be picked as the
"underloaded" destination. The previous fix added a strict-improvement
guard that prevented the bad move but left balance untouched — the
rack stayed in an uneven state.

Switch to utilization-based selection and a utilization-based pre-check:
- Pick max/min by (count / capacity), where capacity is the node's
  current allowed shards plus remaining free slots (snapshotted once
  per rack and held constant for the duration of the loop).
- Replace the raw-count imbalance gate (exceedsImbalanceThreshold) with
  a new exceedsUtilImbalanceThreshold helper that compares fractional
  fullness. The raw-count gate is still used by Phase 2 and Phase 3,
  where the per-rack / per-volume semantics differ.
- Drop the raw-count guards (maxCount <= avgShards || minCount+1 >
  avgShards and maxCount-minCount <= 1) now that the per-move
  strict-improvement check handles termination correctly for both
  homogeneous and heterogeneous capacity.

Also fix a latent bug in the inner shard-selection loop: it was not
updating shardBits between iterations, so every iteration picked the
same lowest-set bit and emitted duplicate move requests for the same
physical shard. Update maxNode and minNode's shardBits immediately
after appending a move, mirroring what applyMovesToTopology does
between phases.

Update TestDetectGlobalImbalance_HeterogeneousCapacity to assert:
- Moves flow from the higher-util node2 to the lower-util node1
  (direction check), and
- Each (volumeID, shardID) pair appears at most once in the move list
  (duplicate-shard guard).

* fix(ec_balance): keep source freeSlots in sync after planned shard moves

All three phase loops that plan EC shard moves (detectCrossRackImbalance,
detectWithinRackImbalance, detectGlobalImbalance) decrement the
destination node's freeSlots but leave the source node's freeSlots
stale. Over the course of a detection run that processes many volumes
or iterates within a rack, the source's reported freeSlots drifts
below its actual value.

In Phase 4 specifically, the per-move strict-improvement guard prevents
the source from becoming a destination candidate, so the stale value
never affects decisions. In Phases 2 and 3 it can: a node that sheds
shards for one volume's rebalance is eligible as a destination for
another volume in the same run, and the destination selection uses
node.freeSlots <= 0 as a hard skip (findDestNodeInUnderloadedRack /
findLeastLoadedNodeInRack). A tightly-provisioned node could be
skipped as a destination even after it has freed slots.

Increment maxNode.freeSlots / node.freeSlots symmetrically at each
scheduled move so freeSlots remains an accurate running view of
available slot capacity throughout a detection run.
2026-04-15 12:47:59 -07:00
..
2026-04-08 12:43:18 -07:00
2026-02-04 12:44:52 -08:00