mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-21 09:11:29 +00:00
* fix(shell): count physical disks in cluster.status on multi-disk nodes
The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.
Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.
* fix(ec): attribute EC balance source disk per shard and reject same-node moves
On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.
Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.
* fix(ec): spread EC balance moves across destination disks
Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
- fewer total EC shards on the disk,
- far fewer shards of the same volume on the disk (spread a volume's shards
across disks for fault tolerance), and
- data/parity anti-affinity (a data shard avoids disks holding the volume's
parity shards and vice versa).
Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.
* fix(ec): bring EC balance worker to parity with shell ec.balance
The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:
- Cross-rack and within-rack now run a two-pass balance: data shards spread
first, then parity shards spread while avoiding racks/nodes that already hold
the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
10+4 here), replacing the previously hardcoded constant at the call sites.
Selection is deterministic (sorted keys) to keep behavior reproducible.
* refactor(ec): extract shared ecbalancer package for shell and worker
The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.
- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
within-rack data/parity two-pass with anti-affinity, global per-rack balance,
and diversity-aware disk selection) over a caller-built Topology snapshot and
returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
task proposals; the shell builds it from its EcNode model and executes Moves
via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
(worker balances by fractional fullness; shell by raw count) and
GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
in one pass).
The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.
* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor
Address regressions and gaps from the ecbalancer extraction:
- Shell ec.balance honors -maxParallelization again: planned moves run phase by
phase (preserving cross-phase dependencies) with bounded concurrency within a
phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
instead of combined totals, so a data/parity skew is corrected even when the
per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
move, so limited-capacity targets are no longer over-planned.
* fix(ec): make EC dedup keeper deterministic and capacity-aware
When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.
* fix(ec): restore global volume-diversity and per-volume move serialization
Two more behaviors lost in the ecbalancer refactor:
- Global rack balancing again prefers moving a shard of a volume the destination
does not hold at all before adding another shard of an already-present volume
(two-pass, mirroring the old balanceEcRack), keeping each volume's shards
spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
while still running different volumes in parallel, so concurrent moves of the
same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.
* fix(ec): key EC balance shards by (collection, volume id)
A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.
* fix(ec): credit freed capacity from dedup before later balance phases
Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).
* test(ec): add multi-disk EC balance integration test
Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.
* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes
collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.
Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.
* test(ec): end-to-end multi-disk EC balance shard-loss regression
Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.
* fix(ec): bring EC balance worker/plugin path to parity with shell
- Per-volume serialization and phase order: key the plugin proposal dedupe by
(collection, volume) instead of (volume, shard, source), so the scheduler runs
only one of a volume's moves at a time (within a run and against in-flight jobs).
Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
because the planner emits a volume's moves in phase order, they now execute in
order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
read it into the worker config, mirroring ec.balance -shardReplicaPlacement.
* test(ec): rename worker in-process test (not a real integration test)
The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.
* ci(ec): remove redundant ec-integration workflow
ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.
* fix(ec): worker falls back to master default replication for EC balance
For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.
The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.
* fix(ec): plugin path populates master default replication too
The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.
* docs(ec): empty shard replica placement uses master default, not even spread
The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
422 lines
18 KiB
Go
422 lines
18 KiB
Go
package shell
|
|
|
|
import (
|
|
"testing"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
|
)
|
|
|
|
func TestCommandEcBalanceSmall(t *testing.T) {
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn1", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
|
newEcNode("dc1", "rack2", "dn2", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
}
|
|
|
|
func TestCommandEcBalanceNothingToMove(t *testing.T) {
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn1", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}),
|
|
newEcNode("dc1", "rack1", "dn2", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
}
|
|
|
|
func TestCommandEcBalanceAddNewServers(t *testing.T) {
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn1", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}),
|
|
newEcNode("dc1", "rack1", "dn2", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}),
|
|
newEcNode("dc1", "rack1", "dn3", 100),
|
|
newEcNode("dc1", "rack1", "dn4", 100),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
}
|
|
|
|
func TestCommandEcBalanceAddNewRacks(t *testing.T) {
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn1", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}),
|
|
newEcNode("dc1", "rack1", "dn2", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{7, 8, 9, 10, 11, 12, 13}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6}),
|
|
newEcNode("dc1", "rack2", "dn3", 100),
|
|
newEcNode("dc1", "rack2", "dn4", 100),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
}
|
|
|
|
func TestCommandEcBalanceVolumeEvenButRackUneven(t *testing.T) {
|
|
ecb := ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn_shared", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0}),
|
|
|
|
newEcNode("dc1", "rack1", "dn_a1", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{1}),
|
|
newEcNode("dc1", "rack1", "dn_a2", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{2}),
|
|
newEcNode("dc1", "rack1", "dn_a3", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{3}),
|
|
newEcNode("dc1", "rack1", "dn_a4", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{4}),
|
|
newEcNode("dc1", "rack1", "dn_a5", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{5}),
|
|
newEcNode("dc1", "rack1", "dn_a6", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{6}),
|
|
newEcNode("dc1", "rack1", "dn_a7", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{7}),
|
|
newEcNode("dc1", "rack1", "dn_a8", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{8}),
|
|
newEcNode("dc1", "rack1", "dn_a9", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{9}),
|
|
newEcNode("dc1", "rack1", "dn_a10", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{10}),
|
|
newEcNode("dc1", "rack1", "dn_a11", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{11}),
|
|
newEcNode("dc1", "rack1", "dn_a12", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{12}),
|
|
newEcNode("dc1", "rack1", "dn_a13", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{13}),
|
|
|
|
newEcNode("dc1", "rack1", "dn_b1", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{1}),
|
|
newEcNode("dc1", "rack1", "dn_b2", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{2}),
|
|
newEcNode("dc1", "rack1", "dn_b3", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{3}),
|
|
newEcNode("dc1", "rack1", "dn_b4", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{4}),
|
|
newEcNode("dc1", "rack1", "dn_b5", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{5}),
|
|
newEcNode("dc1", "rack1", "dn_b6", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{6}),
|
|
newEcNode("dc1", "rack1", "dn_b7", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{7}),
|
|
newEcNode("dc1", "rack1", "dn_b8", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{8}),
|
|
newEcNode("dc1", "rack1", "dn_b9", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{9}),
|
|
newEcNode("dc1", "rack1", "dn_b10", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{10}),
|
|
newEcNode("dc1", "rack1", "dn_b11", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{11}),
|
|
newEcNode("dc1", "rack1", "dn_b12", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{12}),
|
|
newEcNode("dc1", "rack1", "dn_b13", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{13}),
|
|
|
|
newEcNode("dc1", "rack1", "dn3", 100),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
}
|
|
|
|
func newEcNode(dc string, rack string, dataNodeId string, freeEcSlot int) *EcNode {
|
|
return &EcNode{
|
|
info: &master_pb.DataNodeInfo{
|
|
Id: dataNodeId,
|
|
DiskInfos: make(map[string]*master_pb.DiskInfo),
|
|
},
|
|
dc: DataCenterId(dc),
|
|
rack: RackId(rack),
|
|
freeEcSlot: freeEcSlot,
|
|
}
|
|
}
|
|
|
|
func (ecNode *EcNode) addEcVolumeAndShardsForTest(vid uint32, collection string, shardIds []erasure_coding.ShardId) *EcNode {
|
|
return ecNode.addEcVolumeShards(needle.VolumeId(vid), collection, shardIds, types.HardDriveType)
|
|
}
|
|
|
|
// TestCommandEcBalanceEvenDataAndParityDistribution verifies that after balancing:
|
|
// 1. Data shards (0-9) are evenly distributed across racks (max 2 per rack for 6 racks)
|
|
// 2. Parity shards (10-13) are evenly distributed across racks (max 1 per rack for 6 racks)
|
|
func TestCommandEcBalanceEvenDataAndParityDistribution(t *testing.T) {
|
|
// Setup: All 14 shards start on rack1 (simulating fresh EC encode)
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
// All shards initially on rack1/dn1
|
|
newEcNode("dc1", "rack1", "dn1", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
|
// Empty nodes on other racks
|
|
newEcNode("dc1", "rack2", "dn2", 100),
|
|
newEcNode("dc1", "rack3", "dn3", 100),
|
|
newEcNode("dc1", "rack4", "dn4", 100),
|
|
newEcNode("dc1", "rack5", "dn5", 100),
|
|
newEcNode("dc1", "rack6", "dn6", 100),
|
|
},
|
|
applyBalancing: false, // Dry-run mode (simulates moves by updating internal state)
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
|
|
// After balancing (dry-run), verify the PLANNED distribution by checking what moves were proposed
|
|
// The ecb.ecNodes state is updated during dry-run to track planned moves
|
|
vid := needle.VolumeId(1)
|
|
dataShardCount := erasure_coding.DataShardsCount // 10
|
|
parityShardCount := erasure_coding.ParityShardsCount // 4
|
|
|
|
// Count data and parity shards per rack based on current (updated) state
|
|
dataPerRack, parityPerRack := countDataAndParityShardsPerRack(ecb.ecNodes, vid, dataShardCount)
|
|
|
|
// With 6 racks:
|
|
// - Data shards (10): max 2 per rack (ceil(10/6) = 2)
|
|
// - Parity shards (4): max 1 per rack (ceil(4/6) = 1)
|
|
maxDataPerRack := ceilDivide(dataShardCount, 6) // 2
|
|
maxParityPerRack := ceilDivide(parityShardCount, 6) // 1
|
|
|
|
// Verify no rack has more than max data shards
|
|
for rackId, count := range dataPerRack {
|
|
if count > maxDataPerRack {
|
|
t.Errorf("rack %s has %d data shards, expected max %d", rackId, count, maxDataPerRack)
|
|
}
|
|
}
|
|
|
|
// Verify no rack has more than max parity shards
|
|
for rackId, count := range parityPerRack {
|
|
if count > maxParityPerRack {
|
|
t.Errorf("rack %s has %d parity shards, expected max %d", rackId, count, maxParityPerRack)
|
|
}
|
|
}
|
|
|
|
// Verify all shards are distributed (total counts)
|
|
totalData := 0
|
|
totalParity := 0
|
|
for _, count := range dataPerRack {
|
|
totalData += count
|
|
}
|
|
for _, count := range parityPerRack {
|
|
totalParity += count
|
|
}
|
|
if totalData != dataShardCount {
|
|
t.Errorf("total data shards = %d, expected %d", totalData, dataShardCount)
|
|
}
|
|
if totalParity != parityShardCount {
|
|
t.Errorf("total parity shards = %d, expected %d", totalParity, parityShardCount)
|
|
}
|
|
|
|
// Verify data shards are spread across at least 5 racks (10 shards / 2 max per rack)
|
|
racksWithData := len(dataPerRack)
|
|
minRacksForData := dataShardCount / maxDataPerRack // At least 5 racks needed for 10 data shards
|
|
if racksWithData < minRacksForData {
|
|
t.Errorf("data shards spread across only %d racks, expected at least %d", racksWithData, minRacksForData)
|
|
}
|
|
|
|
// Verify parity shards are spread across at least 4 racks (4 shards / 1 max per rack)
|
|
racksWithParity := len(parityPerRack)
|
|
if racksWithParity < parityShardCount {
|
|
t.Errorf("parity shards spread across only %d racks, expected at least %d", racksWithParity, parityShardCount)
|
|
}
|
|
|
|
t.Logf("Distribution after balancing:")
|
|
t.Logf(" Data shards per rack: %v (max allowed: %d)", dataPerRack, maxDataPerRack)
|
|
t.Logf(" Parity shards per rack: %v (max allowed: %d)", parityPerRack, maxParityPerRack)
|
|
}
|
|
|
|
// countDataAndParityShardsPerRack counts data and parity shards per rack
|
|
func countDataAndParityShardsPerRack(ecNodes []*EcNode, vid needle.VolumeId, dataShardCount int) (dataPerRack, parityPerRack map[string]int) {
|
|
dataPerRack = make(map[string]int)
|
|
parityPerRack = make(map[string]int)
|
|
|
|
for _, ecNode := range ecNodes {
|
|
si := findEcVolumeShardsInfo(ecNode, vid, types.HardDriveType)
|
|
for _, shardId := range si.Ids() {
|
|
rackId := string(ecNode.rack)
|
|
if int(shardId) < dataShardCount {
|
|
dataPerRack[rackId]++
|
|
} else {
|
|
parityPerRack[rackId]++
|
|
}
|
|
}
|
|
}
|
|
return
|
|
}
|
|
|
|
// TestCommandEcBalanceMultipleVolumesEvenDistribution tests that multiple volumes
|
|
// each get their data and parity shards evenly distributed
|
|
func TestCommandEcBalanceMultipleVolumesEvenDistribution(t *testing.T) {
|
|
// Setup: Two volumes, each with all 14 shards on different starting racks
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
// Volume 1: all shards on rack1
|
|
newEcNode("dc1", "rack1", "dn1", 100).addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
|
// Volume 2: all shards on rack2
|
|
newEcNode("dc1", "rack2", "dn2", 100).addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}),
|
|
// Empty nodes on other racks
|
|
newEcNode("dc1", "rack3", "dn3", 100),
|
|
newEcNode("dc1", "rack4", "dn4", 100),
|
|
newEcNode("dc1", "rack5", "dn5", 100),
|
|
newEcNode("dc1", "rack6", "dn6", 100),
|
|
},
|
|
applyBalancing: false, // Dry-run mode
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
|
|
// Check both volumes
|
|
for _, vid := range []needle.VolumeId{1, 2} {
|
|
dataPerRack, parityPerRack := countDataAndParityShardsPerRack(ecb.ecNodes, vid, erasure_coding.DataShardsCount)
|
|
|
|
maxDataPerRack := ceilDivide(erasure_coding.DataShardsCount, 6)
|
|
maxParityPerRack := ceilDivide(erasure_coding.ParityShardsCount, 6)
|
|
|
|
for rackId, count := range dataPerRack {
|
|
if count > maxDataPerRack {
|
|
t.Errorf("volume %d: rack %s has %d data shards, expected max %d", vid, rackId, count, maxDataPerRack)
|
|
}
|
|
}
|
|
for rackId, count := range parityPerRack {
|
|
if count > maxParityPerRack {
|
|
t.Errorf("volume %d: rack %s has %d parity shards, expected max %d", vid, rackId, count, maxParityPerRack)
|
|
}
|
|
}
|
|
|
|
t.Logf("Volume %d - Data: %v, Parity: %v", vid, dataPerRack, parityPerRack)
|
|
}
|
|
}
|
|
|
|
// TestCommandEcBalanceAllNodesShareAllVolumes reproduces the scenario from issue #8793:
|
|
// When every node has a shard of every volume, ec.balance was unable to move any shards
|
|
// because it skipped volumes that already existed on the target node at the volume level.
|
|
func TestCommandEcBalanceAllNodesShareAllVolumes(t *testing.T) {
|
|
// 4 nodes, all in same rack, 2 volumes with 14 shards each.
|
|
// Distribute shards so every node has shards of both volumes, but unevenly:
|
|
// dn1: vol1 shards 0-4, vol2 shards 0-4 => 10 shards
|
|
// dn2: vol1 shards 5-9, vol2 shards 5-9 => 10 shards
|
|
// dn3: vol1 shards 10-12, vol2 shards 10-12 => 6 shards
|
|
// dn4: vol1 shard 13, vol2 shard 13 => 2 shards
|
|
// Total: 28 shards, average = 7 per node
|
|
ecb := &ecBalancer{
|
|
ecNodes: []*EcNode{
|
|
newEcNode("dc1", "rack1", "dn1", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{0, 1, 2, 3, 4}),
|
|
newEcNode("dc1", "rack1", "dn2", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{5, 6, 7, 8, 9}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{5, 6, 7, 8, 9}),
|
|
newEcNode("dc1", "rack1", "dn3", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{10, 11, 12}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{10, 11, 12}),
|
|
newEcNode("dc1", "rack1", "dn4", 100).
|
|
addEcVolumeAndShardsForTest(1, "c1", []erasure_coding.ShardId{13}).
|
|
addEcVolumeAndShardsForTest(2, "c1", []erasure_coding.ShardId{13}),
|
|
},
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
ecb.balance([]string{"c1"})
|
|
|
|
// Count total shards per node after balancing
|
|
for _, node := range ecb.ecNodes {
|
|
count := 0
|
|
if diskInfo, found := node.info.DiskInfos[string(types.HardDriveType)]; found {
|
|
for _, ecsi := range diskInfo.EcShardInfos {
|
|
count += erasure_coding.GetShardCount(ecsi)
|
|
}
|
|
}
|
|
// Average is 7, so all nodes should be at 7 (ceil(28/4) = 7)
|
|
if count > 7 {
|
|
t.Errorf("node %s has %d shards after balancing, expected at most 7", node.info.Id, count)
|
|
}
|
|
t.Logf("node %s: %d shards", node.info.Id, count)
|
|
}
|
|
}
|
|
|
|
// TestCommandEcBalanceIssue8793Topology simulates the real cluster from issue #8793:
|
|
// 14 nodes (9 with max=80, 5 with max=33), all in one rack, with mixed capacities.
|
|
// Each EC volume has 1 shard per node. Nodes have uneven totals (some have extra volumes).
|
|
func TestCommandEcBalanceIssue8793Topology(t *testing.T) {
|
|
// Simulate 22 EC volumes across 14 nodes (each volume has 14 shards, 1 per node).
|
|
// Give nodes 0-3 an extra volume each (vol 23-26, all 14 shards) to create imbalance.
|
|
// Before balancing: nodes 0-3 have 22+14=36 shards each, nodes 4-13 have 22 shards each.
|
|
// Total = 4*36 + 10*22 = 144+220 = 364. Average = ceil(364/14) = 26.
|
|
|
|
type nodeSpec struct {
|
|
id string
|
|
maxSlot int
|
|
}
|
|
nodes := []nodeSpec{
|
|
{"192.168.0.12:8332", 80}, {"192.168.0.12:8333", 80}, {"192.168.0.12:8334", 80},
|
|
{"192.168.0.12:8335", 80}, {"192.168.0.12:8336", 80}, {"192.168.0.12:8337", 80},
|
|
{"192.168.0.12:8338", 80}, {"192.168.0.12:8339", 80}, {"192.168.0.12:8340", 80},
|
|
{"192.168.0.12:8341", 33}, {"192.168.0.12:8342", 33}, {"192.168.0.12:8343", 33},
|
|
{"192.168.0.25:8350", 33}, {"192.168.0.25:8351", 33},
|
|
}
|
|
|
|
ecNodes := make([]*EcNode, len(nodes))
|
|
for i, ns := range nodes {
|
|
ecNodes[i] = newEcNode("home", "center", ns.id, ns.maxSlot)
|
|
}
|
|
|
|
// 22 shared volumes: each node gets exactly 1 shard (shard i for node i)
|
|
for vid := uint32(1); vid <= 22; vid++ {
|
|
for i := range ecNodes {
|
|
ecNodes[i].addEcVolumeAndShardsForTest(vid, "cldata", []erasure_coding.ShardId{erasure_coding.ShardId(i)})
|
|
}
|
|
}
|
|
|
|
// 4 extra volumes only on first 4 nodes (all 14 shards each) to create imbalance
|
|
for extra := uint32(0); extra < 4; extra++ {
|
|
vid := 23 + extra
|
|
nodeIdx := int(extra)
|
|
allShards := make([]erasure_coding.ShardId, 14)
|
|
for s := 0; s < 14; s++ {
|
|
allShards[s] = erasure_coding.ShardId(s)
|
|
}
|
|
ecNodes[nodeIdx].addEcVolumeAndShardsForTest(vid, "cldata", allShards)
|
|
}
|
|
|
|
ecb := &ecBalancer{
|
|
ecNodes: ecNodes,
|
|
applyBalancing: false,
|
|
diskType: types.HardDriveType,
|
|
}
|
|
|
|
// Log initial state
|
|
for _, node := range ecb.ecNodes {
|
|
count := 0
|
|
if diskInfo, found := node.info.DiskInfos[string(types.HardDriveType)]; found {
|
|
for _, ecsi := range diskInfo.EcShardInfos {
|
|
count += erasure_coding.GetShardCount(ecsi)
|
|
}
|
|
}
|
|
t.Logf("BEFORE node %s (max %d): %d shards", node.info.Id, node.freeEcSlot+count, count)
|
|
}
|
|
|
|
ecb.balance([]string{"cldata"})
|
|
|
|
// Verify: no node should exceed the average
|
|
totalShards := 0
|
|
shardCounts := make(map[string]int)
|
|
for _, node := range ecb.ecNodes {
|
|
count := 0
|
|
if diskInfo, found := node.info.DiskInfos[string(types.HardDriveType)]; found {
|
|
for _, ecsi := range diskInfo.EcShardInfos {
|
|
count += erasure_coding.GetShardCount(ecsi)
|
|
}
|
|
}
|
|
shardCounts[node.info.Id] = count
|
|
totalShards += count
|
|
}
|
|
avg := ceilDivide(totalShards, len(ecNodes))
|
|
|
|
for _, node := range ecb.ecNodes {
|
|
count := shardCounts[node.info.Id]
|
|
t.Logf("AFTER node %s: %d shards (avg %d)", node.info.Id, count, avg)
|
|
if count > avg {
|
|
t.Errorf("node %s has %d shards, expected at most %d (avg)", node.info.Id, count, avg)
|
|
}
|
|
}
|
|
}
|