Files
seaweedfs/weed/shell
Chris Lu 391f543ff2 fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes

The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.

Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.

* fix(ec): attribute EC balance source disk per shard and reject same-node moves

On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.

Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.

* fix(ec): spread EC balance moves across destination disks

Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
  - fewer total EC shards on the disk,
  - far fewer shards of the same volume on the disk (spread a volume's shards
    across disks for fault tolerance), and
  - data/parity anti-affinity (a data shard avoids disks holding the volume's
    parity shards and vice versa).

Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.

* fix(ec): bring EC balance worker to parity with shell ec.balance

The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:

- Cross-rack and within-rack now run a two-pass balance: data shards spread
  first, then parity shards spread while avoiding racks/nodes that already hold
  the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
  and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
  constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
  keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
  10+4 here), replacing the previously hardcoded constant at the call sites.

Selection is deterministic (sorted keys) to keep behavior reproducible.

* refactor(ec): extract shared ecbalancer package for shell and worker

The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.

- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
  within-rack data/parity two-pass with anti-affinity, global per-rack balance,
  and diversity-aware disk selection) over a caller-built Topology snapshot and
  returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
  task proposals; the shell builds it from its EcNode model and executes Moves
  via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
  each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
  (worker balances by fractional fullness; shell by raw count) and
  GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
  in one pass).

The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.

* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor

Address regressions and gaps from the ecbalancer extraction:

- Shell ec.balance honors -maxParallelization again: planned moves run phase by
  phase (preserving cross-phase dependencies) with bounded concurrency within a
  phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
  updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
  instead of combined totals, so a data/parity skew is corrected even when the
  per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
  custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
  move, so limited-capacity targets are no longer over-planned.

* fix(ec): make EC dedup keeper deterministic and capacity-aware

When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.

* fix(ec): restore global volume-diversity and per-volume move serialization

Two more behaviors lost in the ecbalancer refactor:

- Global rack balancing again prefers moving a shard of a volume the destination
  does not hold at all before adding another shard of an already-present volume
  (two-pass, mirroring the old balanceEcRack), keeping each volume's shards
  spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
  while still running different volumes in parallel, so concurrent moves of the
  same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.

* fix(ec): key EC balance shards by (collection, volume id)

A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.

* fix(ec): credit freed capacity from dedup before later balance phases

Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).

* test(ec): add multi-disk EC balance integration test

Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.

* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes

collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.

Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.

* test(ec): end-to-end multi-disk EC balance shard-loss regression

Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.

* fix(ec): bring EC balance worker/plugin path to parity with shell

- Per-volume serialization and phase order: key the plugin proposal dedupe by
  (collection, volume) instead of (volume, shard, source), so the scheduler runs
  only one of a volume's moves at a time (within a run and against in-flight jobs).
  Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
  because the planner emits a volume's moves in phase order, they now execute in
  order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
  a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
  instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
  read it into the worker config, mirroring ec.balance -shardReplicaPlacement.

* test(ec): rename worker in-process test (not a real integration test)

The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.

* ci(ec): remove redundant ec-integration workflow

ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.

* fix(ec): worker falls back to master default replication for EC balance

For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.

The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.

* fix(ec): plugin path populates master default replication too

The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.

* docs(ec): empty shard replica placement uses master default, not even spread

The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
2026-05-20 23:31:21 -07:00
..
2026-02-09 01:37:56 -08:00
2024-09-29 10:38:22 -07:00
2026-02-25 10:25:23 -08:00
2024-09-29 10:38:22 -07:00
2024-09-29 10:38:22 -07:00