Files
seaweedfs/weed/shell/command_ec_common.go
Chris Lu 391f543ff2 fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes

The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.

Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.

* fix(ec): attribute EC balance source disk per shard and reject same-node moves

On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.

Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.

* fix(ec): spread EC balance moves across destination disks

Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
  - fewer total EC shards on the disk,
  - far fewer shards of the same volume on the disk (spread a volume's shards
    across disks for fault tolerance), and
  - data/parity anti-affinity (a data shard avoids disks holding the volume's
    parity shards and vice versa).

Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.

* fix(ec): bring EC balance worker to parity with shell ec.balance

The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:

- Cross-rack and within-rack now run a two-pass balance: data shards spread
  first, then parity shards spread while avoiding racks/nodes that already hold
  the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
  and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
  constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
  keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
  10+4 here), replacing the previously hardcoded constant at the call sites.

Selection is deterministic (sorted keys) to keep behavior reproducible.

* refactor(ec): extract shared ecbalancer package for shell and worker

The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.

- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
  within-rack data/parity two-pass with anti-affinity, global per-rack balance,
  and diversity-aware disk selection) over a caller-built Topology snapshot and
  returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
  task proposals; the shell builds it from its EcNode model and executes Moves
  via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
  each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
  (worker balances by fractional fullness; shell by raw count) and
  GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
  in one pass).

The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.

* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor

Address regressions and gaps from the ecbalancer extraction:

- Shell ec.balance honors -maxParallelization again: planned moves run phase by
  phase (preserving cross-phase dependencies) with bounded concurrency within a
  phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
  updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
  instead of combined totals, so a data/parity skew is corrected even when the
  per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
  custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
  move, so limited-capacity targets are no longer over-planned.

* fix(ec): make EC dedup keeper deterministic and capacity-aware

When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.

* fix(ec): restore global volume-diversity and per-volume move serialization

Two more behaviors lost in the ecbalancer refactor:

- Global rack balancing again prefers moving a shard of a volume the destination
  does not hold at all before adding another shard of an already-present volume
  (two-pass, mirroring the old balanceEcRack), keeping each volume's shards
  spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
  while still running different volumes in parallel, so concurrent moves of the
  same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.

* fix(ec): key EC balance shards by (collection, volume id)

A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.

* fix(ec): credit freed capacity from dedup before later balance phases

Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).

* test(ec): add multi-disk EC balance integration test

Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.

* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes

collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.

Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.

* test(ec): end-to-end multi-disk EC balance shard-loss regression

Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.

* fix(ec): bring EC balance worker/plugin path to parity with shell

- Per-volume serialization and phase order: key the plugin proposal dedupe by
  (collection, volume) instead of (volume, shard, source), so the scheduler runs
  only one of a volume's moves at a time (within a run and against in-flight jobs).
  Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
  because the planner emits a volume's moves in phase order, they now execute in
  order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
  a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
  instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
  read it into the worker config, mirroring ec.balance -shardReplicaPlacement.

* test(ec): rename worker in-process test (not a real integration test)

The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.

* ci(ec): remove redundant ec-integration workflow

ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.

* fix(ec): worker falls back to master default replication for EC balance

For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.

The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.

* fix(ec): plugin path populates master default replication too

The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.

* docs(ec): empty shard replica placement uses master default, not even spread

The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
2026-05-20 23:31:21 -07:00

952 lines
30 KiB
Go

package shell
import (
"context"
"fmt"
"regexp"
"slices"
"sort"
"time"
"github.com/seaweedfs/seaweedfs/weed/glog"
"github.com/seaweedfs/seaweedfs/weed/operation"
"github.com/seaweedfs/seaweedfs/weed/pb"
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding/ecbalancer"
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
"github.com/seaweedfs/seaweedfs/weed/storage/super_block"
"github.com/seaweedfs/seaweedfs/weed/storage/types"
"google.golang.org/grpc"
)
type DataCenterId string
type EcNodeId string
type RackId string
// EcDisk represents a single disk on a volume server
type EcDisk struct {
diskId uint32
diskType string
freeEcSlots int
ecShardCount int // Total EC shards on this disk
// Map of volumeId -> ShardsInfo for shards on this disk
ecShards map[needle.VolumeId]*erasure_coding.ShardsInfo
}
type EcNode struct {
info *master_pb.DataNodeInfo
dc DataCenterId
rack RackId
freeEcSlot int
// disks maps diskId -> EcDisk for disk-level balancing
disks map[uint32]*EcDisk
}
type CandidateEcNode struct {
ecNode *EcNode
shardCount int
}
type EcRack struct {
ecNodes map[EcNodeId]*EcNode
freeEcSlot int
}
var (
ecBalanceAlgorithmDescription = `
func EcBalance() {
for each collection:
balanceEcVolumes(collectionName)
for each rack:
balanceEcRack(rack)
}
func balanceEcVolumes(collectionName){
for each volume:
doDeduplicateEcShards(volumeId)
tracks rack~shardCount mapping
for each volume:
doBalanceEcShardsAcrossRacks(volumeId)
for each volume:
doBalanceEcShardsWithinRacks(volumeId)
}
// spread ec shards into more racks
func doBalanceEcShardsAcrossRacks(volumeId){
tracks rack~volumeIdShardCount mapping
averageShardsPerEcRack = totalShardNumber / numRacks // totalShardNumber is 14 for now, later could varies for each dc
ecShardsToMove = select overflown ec shards from racks with ec shard counts > averageShardsPerEcRack
for each ecShardsToMove {
destRack = pickOneRack(rack~shardCount, rack~volumeIdShardCount, ecShardReplicaPlacement)
destVolumeServers = volume servers on the destRack
pickOneEcNodeAndMoveOneShard(destVolumeServers)
}
}
func doBalanceEcShardsWithinRacks(volumeId){
racks = collect all racks that the volume id is on
for rack, shards := range racks
doBalanceEcShardsWithinOneRack(volumeId, shards, rack)
}
// move ec shards
func doBalanceEcShardsWithinOneRack(volumeId, shards, rackId){
tracks volumeServer~volumeIdShardCount mapping
averageShardCount = len(shards) / numVolumeServers
volumeServersOverAverage = volume servers with volumeId's ec shard counts > averageShardsPerEcRack
ecShardsToMove = select overflown ec shards from volumeServersOverAverage
for each ecShardsToMove {
destVolumeServer = pickOneVolumeServer(volumeServer~shardCount, volumeServer~volumeIdShardCount, ecShardReplicaPlacement)
pickOneEcNodeAndMoveOneShard(destVolumeServers)
}
}
// move ec shards while keeping shard distribution for the same volume unchanged or more even
func balanceEcRack(rack){
averageShardCount = total shards / numVolumeServers
for hasMovedOneEcShard {
sort all volume servers ordered by the number of local ec shards
pick the volume server A with the lowest number of ec shards x
pick the volume server B with the highest number of ec shards y
if y > averageShardCount and x +1 <= averageShardCount {
if B has a ec shard with volume id v that A does not have {
move one ec shard v from B to A
hasMovedOneEcShard = true
}
}
}
}
`
// Overridable functions for testing.
getDefaultReplicaPlacement = _getDefaultReplicaPlacement
)
func _getDefaultReplicaPlacement(commandEnv *CommandEnv) (*super_block.ReplicaPlacement, error) {
var resp *master_pb.GetMasterConfigurationResponse
var err error
err = commandEnv.MasterClient.WithClient(false, func(client master_pb.SeaweedClient) error {
resp, err = client.GetMasterConfiguration(context.Background(), &master_pb.GetMasterConfigurationRequest{})
return err
})
if err != nil {
return nil, err
}
return super_block.NewReplicaPlacementFromString(resp.DefaultReplication)
}
func parseReplicaPlacementArg(commandEnv *CommandEnv, replicaStr string) (*super_block.ReplicaPlacement, error) {
var rp *super_block.ReplicaPlacement
var err error
if replicaStr != "" {
rp, err = super_block.NewReplicaPlacementFromString(replicaStr)
if err != nil {
return rp, err
}
glog.V(1).Infof("using replica placement %q for EC volumes\n", rp.String())
} else {
// No replica placement argument provided, resolve from master default settings.
rp, err = getDefaultReplicaPlacement(commandEnv)
if err != nil {
return rp, err
}
glog.V(1).Infof("using master default replica placement %q for EC volumes\n", rp.String())
}
return rp, nil
}
func collectTopologyInfo(commandEnv *CommandEnv, delayBeforeCollecting time.Duration) (topoInfo *master_pb.TopologyInfo, volumeSizeLimitMb uint64, err error) {
if delayBeforeCollecting > 0 {
time.Sleep(delayBeforeCollecting)
}
var resp *master_pb.VolumeListResponse
err = commandEnv.MasterClient.WithClient(false, func(client master_pb.SeaweedClient) error {
resp, err = client.VolumeList(context.Background(), &master_pb.VolumeListRequest{})
return err
})
if err != nil {
return
}
return resp.TopologyInfo, resp.VolumeSizeLimitMb, nil
}
func collectDataNodes(commandEnv *CommandEnv, delayBeforeCollecting time.Duration) ([]*master_pb.DataNodeInfo, error) {
dataNodes := []*master_pb.DataNodeInfo{}
topo, _, err := collectTopologyInfo(commandEnv, delayBeforeCollecting)
if err != nil {
return nil, err
}
for _, dci := range topo.GetDataCenterInfos() {
for _, r := range dci.GetRackInfos() {
for _, dn := range r.GetDataNodeInfos() {
dataNodes = append(dataNodes, dn)
}
}
}
return dataNodes, nil
}
func collectEcNodesForDC(commandEnv *CommandEnv, selectedDataCenter string, diskType types.DiskType) (ecNodes []*EcNode, totalFreeEcSlots int, err error) {
// list all possible locations
// collect topology information
topologyInfo, _, err := collectTopologyInfo(commandEnv, 0)
if err != nil {
return
}
// find out all volume servers with one slot left.
ecNodes, totalFreeEcSlots = collectEcVolumeServersByDc(topologyInfo, selectedDataCenter, diskType)
sortEcNodesByFreeslotsDescending(ecNodes)
return
}
func collectEcNodes(commandEnv *CommandEnv, diskType types.DiskType) (ecNodes []*EcNode, totalFreeEcSlots int, err error) {
return collectEcNodesForDC(commandEnv, "", diskType)
}
// collectVolumeIdToCollection returns a map from volume ID to its collection name
func collectVolumeIdToCollection(t *master_pb.TopologyInfo, vids []needle.VolumeId) map[needle.VolumeId]string {
result := make(map[needle.VolumeId]string)
if len(vids) == 0 {
return result
}
vidSet := make(map[needle.VolumeId]bool)
for _, vid := range vids {
vidSet[vid] = true
}
for _, dc := range t.DataCenterInfos {
for _, r := range dc.RackInfos {
for _, dn := range r.DataNodeInfos {
for _, diskInfo := range dn.DiskInfos {
for _, vi := range diskInfo.VolumeInfos {
vid := needle.VolumeId(vi.Id)
if vidSet[vid] {
result[vid] = vi.Collection
}
}
}
}
}
}
return result
}
func collectCollectionsForVolumeIds(t *master_pb.TopologyInfo, vids []needle.VolumeId) []string {
if len(vids) == 0 {
return nil
}
found := map[string]bool{}
for _, dc := range t.DataCenterInfos {
for _, r := range dc.RackInfos {
for _, dn := range r.DataNodeInfos {
for _, diskInfo := range dn.DiskInfos {
for _, vi := range diskInfo.VolumeInfos {
for _, vid := range vids {
if needle.VolumeId(vi.Id) == vid {
found[vi.Collection] = true
}
}
}
for _, ecs := range diskInfo.EcShardInfos {
for _, vid := range vids {
if needle.VolumeId(ecs.Id) == vid {
found[ecs.Collection] = true
}
}
}
}
}
}
}
if len(found) == 0 {
return nil
}
collections := []string{}
for k, _ := range found {
collections = append(collections, k)
}
sort.Strings(collections)
return collections
}
func moveMountedShardToEcNode(commandEnv *CommandEnv, existingLocation *EcNode, collection string, vid needle.VolumeId, shardId erasure_coding.ShardId, destinationEcNode *EcNode, destDiskId uint32, applyBalancing bool, diskType types.DiskType) (err error) {
if !commandEnv.isLocked() {
return fmt.Errorf("lock is lost")
}
copiedShardIds := []erasure_coding.ShardId{shardId}
if applyBalancing {
existingServerAddress := pb.NewServerAddressFromDataNode(existingLocation.info)
// ask destination node to copy shard and the ecx file from source node, and mount it
copiedShardIds, err = oneServerCopyAndMountEcShardsFromSource(commandEnv.option.GrpcDialOption, destinationEcNode, []erasure_coding.ShardId{shardId}, vid, collection, existingServerAddress, destDiskId)
if err != nil {
return err
}
// unmount the to be deleted shards
err = unmountEcShards(commandEnv.option.GrpcDialOption, vid, existingServerAddress, copiedShardIds)
if err != nil {
return err
}
// ask source node to delete the shard, and maybe the ecx file
err = sourceServerDeleteEcShards(commandEnv.option.GrpcDialOption, collection, vid, existingServerAddress, copiedShardIds)
if err != nil {
return err
}
if destDiskId > 0 {
fmt.Printf("moved ec shard %d.%d %s => %s (disk %d)\n", vid, shardId, existingLocation.info.Id, destinationEcNode.info.Id, destDiskId)
} else {
fmt.Printf("moved ec shard %d.%d %s => %s\n", vid, shardId, existingLocation.info.Id, destinationEcNode.info.Id)
}
}
destinationEcNode.addEcVolumeShards(vid, collection, copiedShardIds, diskType)
existingLocation.deleteEcVolumeShards(vid, copiedShardIds, diskType)
return nil
}
func oneServerCopyAndMountEcShardsFromSource(grpcDialOption grpc.DialOption,
targetServer *EcNode, shardIdsToCopy []erasure_coding.ShardId,
volumeId needle.VolumeId, collection string, existingLocation pb.ServerAddress, destDiskId uint32) (copiedShardIds []erasure_coding.ShardId, err error) {
fmt.Printf("allocate %d.%v %s => %s\n", volumeId, shardIdsToCopy, existingLocation, targetServer.info.Id)
targetAddress := pb.NewServerAddressFromDataNode(targetServer.info)
err = operation.WithVolumeServerClient(false, targetAddress, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
if targetAddress != existingLocation {
fmt.Printf("copy %d.%v %s => %s\n", volumeId, shardIdsToCopy, existingLocation, targetServer.info.Id)
_, copyErr := volumeServerClient.VolumeEcShardsCopy(context.Background(), &volume_server_pb.VolumeEcShardsCopyRequest{
VolumeId: uint32(volumeId),
Collection: collection,
ShardIds: erasure_coding.ShardIdsToUint32(shardIdsToCopy),
CopyEcxFile: true,
CopyEcjFile: true,
CopyVifFile: true,
SourceDataNode: string(existingLocation),
DiskId: destDiskId,
})
if copyErr != nil {
return fmt.Errorf("copy %d.%v %s => %s : %v\n", volumeId, shardIdsToCopy, existingLocation, targetServer.info.Id, copyErr)
}
}
fmt.Printf("mount %d.%v on %s\n", volumeId, shardIdsToCopy, targetServer.info.Id)
_, mountErr := volumeServerClient.VolumeEcShardsMount(context.Background(), &volume_server_pb.VolumeEcShardsMountRequest{
VolumeId: uint32(volumeId),
Collection: collection,
ShardIds: erasure_coding.ShardIdsToUint32(shardIdsToCopy),
})
if mountErr != nil {
return fmt.Errorf("mount %d.%v on %s : %v\n", volumeId, shardIdsToCopy, targetServer.info.Id, mountErr)
}
if targetAddress != existingLocation {
copiedShardIds = shardIdsToCopy
glog.V(0).Infof("%s ec volume %d deletes shards %+v", existingLocation, volumeId, copiedShardIds)
}
return nil
})
if err != nil {
return
}
return
}
func eachDataNode(topo *master_pb.TopologyInfo, fn func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo)) {
for _, dc := range topo.DataCenterInfos {
for _, rack := range dc.RackInfos {
for _, dn := range rack.DataNodeInfos {
fn(DataCenterId(dc.Id), RackId(rack.Id), dn)
}
}
}
}
func sortEcNodesByFreeslotsDescending(ecNodes []*EcNode) {
slices.SortFunc(ecNodes, func(a, b *EcNode) int {
return b.freeEcSlot - a.freeEcSlot
})
}
func sortEcNodesByFreeslotsAscending(ecNodes []*EcNode) {
slices.SortFunc(ecNodes, func(a, b *EcNode) int {
return a.freeEcSlot - b.freeEcSlot
})
}
func countShards(ecShardInfos []*master_pb.VolumeEcShardInformationMessage) (count int) {
for _, eci := range ecShardInfos {
count += erasure_coding.GetShardCount(eci)
}
return
}
func countFreeShardSlots(dn *master_pb.DataNodeInfo, diskType types.DiskType) (count int) {
if dn.DiskInfos == nil {
return 0
}
diskInfo := dn.DiskInfos[string(diskType)]
if diskInfo == nil {
return 0
}
slots := int(diskInfo.MaxVolumeCount-diskInfo.VolumeCount)*erasure_coding.DataShardsCount - countShards(diskInfo.EcShardInfos)
if slots < 0 {
return 0
}
return slots
}
func (ecNode *EcNode) localShardIdCount(vid uint32) int {
for _, diskInfo := range ecNode.info.DiskInfos {
for _, eci := range diskInfo.EcShardInfos {
if vid == eci.Id {
return erasure_coding.GetShardCount(eci)
}
}
}
return 0
}
func collectEcVolumeServersByDc(topo *master_pb.TopologyInfo, selectedDataCenter string, diskType types.DiskType) (ecNodes []*EcNode, totalFreeEcSlots int) {
eachDataNode(topo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
if selectedDataCenter != "" && selectedDataCenter != string(dc) {
return
}
freeEcSlots := countFreeShardSlots(dn, diskType)
ecNode := &EcNode{
info: dn,
dc: dc,
rack: rack,
freeEcSlot: int(freeEcSlots),
disks: make(map[uint32]*EcDisk),
}
// Build disk-level information from volumes and EC shards
// First, discover all unique disk IDs from VolumeInfos (includes empty disks)
allDiskIds := make(map[uint32]string) // diskId -> diskType
for diskTypeKey, diskInfo := range dn.DiskInfos {
if diskInfo == nil {
continue
}
// Get all disk IDs from volumes
for _, vi := range diskInfo.VolumeInfos {
allDiskIds[vi.DiskId] = diskTypeKey
}
// Also get disk IDs from EC shards
for _, ecShardInfo := range diskInfo.EcShardInfos {
allDiskIds[ecShardInfo.DiskId] = diskTypeKey
}
}
// Group EC shards by disk_id
diskShards := make(map[uint32]map[needle.VolumeId]*erasure_coding.ShardsInfo)
for _, diskInfo := range dn.DiskInfos {
if diskInfo == nil {
continue
}
for _, eci := range diskInfo.EcShardInfos {
diskId := eci.DiskId
if diskShards[diskId] == nil {
diskShards[diskId] = make(map[needle.VolumeId]*erasure_coding.ShardsInfo)
}
vid := needle.VolumeId(eci.Id)
diskShards[diskId][vid] = erasure_coding.ShardsInfoFromVolumeEcShardInformationMessage(eci)
}
}
// Create EcDisk for each discovered disk
diskCount := len(allDiskIds)
if diskCount == 0 {
diskCount = 1
}
freePerDisk := int(freeEcSlots) / diskCount
for diskId, diskTypeStr := range allDiskIds {
shards := diskShards[diskId]
if shards == nil {
shards = make(map[needle.VolumeId]*erasure_coding.ShardsInfo)
}
totalShardCount := 0
for _, shardsInfo := range shards {
totalShardCount += shardsInfo.Count()
}
ecNode.disks[diskId] = &EcDisk{
diskId: diskId,
diskType: diskTypeStr,
freeEcSlots: freePerDisk,
ecShardCount: totalShardCount,
ecShards: shards,
}
}
ecNodes = append(ecNodes, ecNode)
totalFreeEcSlots += freeEcSlots
})
return
}
func sourceServerDeleteEcShards(grpcDialOption grpc.DialOption, collection string, volumeId needle.VolumeId, sourceLocation pb.ServerAddress, toBeDeletedShardIds []erasure_coding.ShardId) error {
fmt.Printf("delete %d.%v from %s\n", volumeId, toBeDeletedShardIds, sourceLocation)
return operation.WithVolumeServerClient(false, sourceLocation, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, deleteErr := volumeServerClient.VolumeEcShardsDelete(context.Background(), &volume_server_pb.VolumeEcShardsDeleteRequest{
VolumeId: uint32(volumeId),
Collection: collection,
ShardIds: erasure_coding.ShardIdsToUint32(toBeDeletedShardIds),
})
return deleteErr
})
}
func unmountEcShards(grpcDialOption grpc.DialOption, volumeId needle.VolumeId, sourceLocation pb.ServerAddress, toBeUnmountedShardIds []erasure_coding.ShardId) error {
fmt.Printf("unmount %d.%v from %s\n", volumeId, toBeUnmountedShardIds, sourceLocation)
return operation.WithVolumeServerClient(false, sourceLocation, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, deleteErr := volumeServerClient.VolumeEcShardsUnmount(context.Background(), &volume_server_pb.VolumeEcShardsUnmountRequest{
VolumeId: uint32(volumeId),
ShardIds: erasure_coding.ShardIdsToUint32(toBeUnmountedShardIds),
})
return deleteErr
})
}
func mountEcShards(grpcDialOption grpc.DialOption, collection string, volumeId needle.VolumeId, sourceLocation pb.ServerAddress, toBeMountedShardIds []erasure_coding.ShardId) error {
fmt.Printf("mount %d.%v on %s\n", volumeId, toBeMountedShardIds, sourceLocation)
return operation.WithVolumeServerClient(false, sourceLocation, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
_, mountErr := volumeServerClient.VolumeEcShardsMount(context.Background(), &volume_server_pb.VolumeEcShardsMountRequest{
VolumeId: uint32(volumeId),
Collection: collection,
ShardIds: erasure_coding.ShardIdsToUint32(toBeMountedShardIds),
})
return mountErr
})
}
func ceilDivide(a, b int) int {
var r int
if (a % b) != 0 {
r = 1
}
return (a / b) + r
}
func findEcVolumeShardsInfo(ecNode *EcNode, vid needle.VolumeId, diskType types.DiskType) *erasure_coding.ShardsInfo {
if diskInfo, found := ecNode.info.DiskInfos[string(diskType)]; found {
for _, shardInfo := range diskInfo.EcShardInfos {
if needle.VolumeId(shardInfo.Id) == vid {
return erasure_coding.ShardsInfoFromVolumeEcShardInformationMessage(shardInfo)
}
}
}
// Returns an empty ShardsInfo struct on failure, to avoid potential nil dereferences.
return erasure_coding.NewShardsInfo()
}
// TODO: simplify me
func (ecNode *EcNode) addEcVolumeShards(vid needle.VolumeId, collection string, shardIds []erasure_coding.ShardId, diskType types.DiskType) *EcNode {
foundVolume := false
diskInfo, found := ecNode.info.DiskInfos[string(diskType)]
if found {
for _, ecsi := range diskInfo.EcShardInfos {
if needle.VolumeId(ecsi.Id) == vid {
si := erasure_coding.ShardsInfoFromVolumeEcShardInformationMessage(ecsi)
oldShardCount := si.Count()
for _, shardId := range shardIds {
si.Set(erasure_coding.NewShardInfo(shardId, 0))
}
ecsi.EcIndexBits = si.Bitmap()
ecsi.ShardSizes = si.SizesInt64()
ecNode.freeEcSlot -= si.Count() - oldShardCount
foundVolume = true
break
}
}
} else {
diskInfo = &master_pb.DiskInfo{
Type: string(diskType),
}
ecNode.info.DiskInfos[string(diskType)] = diskInfo
}
if !foundVolume {
si := erasure_coding.NewShardsInfo()
for _, id := range shardIds {
si.Set(erasure_coding.NewShardInfo(id, 0))
}
diskInfo.EcShardInfos = append(diskInfo.EcShardInfos, &master_pb.VolumeEcShardInformationMessage{
Id: uint32(vid),
Collection: collection,
EcIndexBits: si.Bitmap(),
ShardSizes: si.SizesInt64(),
DiskType: string(diskType),
})
ecNode.freeEcSlot -= si.Count()
}
return ecNode
}
func (ecNode *EcNode) deleteEcVolumeShards(vid needle.VolumeId, shardIds []erasure_coding.ShardId, diskType types.DiskType) *EcNode {
if diskInfo, found := ecNode.info.DiskInfos[string(diskType)]; found {
for _, eci := range diskInfo.EcShardInfos {
if needle.VolumeId(eci.Id) == vid {
si := erasure_coding.ShardsInfoFromVolumeEcShardInformationMessage(eci)
oldCount := si.Count()
for _, shardId := range shardIds {
si.Delete(shardId)
}
eci.EcIndexBits = si.Bitmap()
eci.ShardSizes = si.SizesInt64()
ecNode.freeEcSlot -= si.Count() - oldCount
}
}
}
return ecNode
}
// pickBestDiskOnNode selects the best disk on a node for placing a new EC shard
// It prefers disks of the specified type with fewer shards and more free slots
// When shardId is provided and dataShardCount > 0, it applies anti-affinity:
// - For data shards (shardId < dataShardCount): prefer disks without parity shards
// - For parity shards (shardId >= dataShardCount): prefer disks without data shards
// If strictDiskType is false, it will fall back to other disk types if no matching disk is found
func pickBestDiskOnNode(ecNode *EcNode, vid needle.VolumeId, diskType types.DiskType, strictDiskType bool, shardId erasure_coding.ShardId, dataShardCount int) uint32 {
if len(ecNode.disks) == 0 {
return 0 // No disk info available, let the server decide
}
var bestDiskId uint32
bestScore := -1
var fallbackDiskId uint32
fallbackScore := -1
// Determine if we're placing a data or parity shard
isDataShard := dataShardCount > 0 && int(shardId) < dataShardCount
for diskId, disk := range ecNode.disks {
if disk.freeEcSlots <= 0 {
continue
}
// Check existing shards on this disk for this volume
existingShards := 0
hasDataShards := false
hasParityShards := false
if si, ok := disk.ecShards[vid]; ok {
existingShards = si.Count()
// Check what type of shards are on this disk
if dataShardCount > 0 {
for _, existingShardId := range si.Ids() {
if int(existingShardId) < dataShardCount {
hasDataShards = true
} else {
hasParityShards = true
}
}
}
}
// Score: prefer disks with fewer total shards and fewer shards of this volume
// Lower score is better
score := disk.ecShardCount*10 + existingShards*100
// Apply anti-affinity penalty if applicable
if dataShardCount > 0 {
if isDataShard && hasParityShards {
// Penalize placing data shard on disk with parity shards
score += 1000
} else if !isDataShard && hasDataShards {
// Penalize placing parity shard on disk with data shards
score += 1000
}
}
if disk.diskType == string(diskType) {
// Matching disk type - this is preferred
if bestScore == -1 || score < bestScore {
bestScore = score
bestDiskId = diskId
}
} else if !strictDiskType {
// Non-matching disk type - use as fallback if allowed
if fallbackScore == -1 || score < fallbackScore {
fallbackScore = score
fallbackDiskId = diskId
}
}
}
// Return matching disk type if found, otherwise fallback
if bestDiskId != 0 {
return bestDiskId
}
return fallbackDiskId
}
// ecBalancer drives an EC balance run: it collects the cluster's EC nodes, hands
// them to the shared ecbalancer planner, and executes the planned shard moves.
// The balancing policy lives in weed/storage/erasure_coding/ecbalancer, shared
// with the EC balance worker so the two cannot drift.
type ecBalancer struct {
commandEnv *CommandEnv
ecNodes []*EcNode
replicaPlacement *super_block.ReplicaPlacement
applyBalancing bool
maxParallelization int
diskType types.DiskType
}
func EcBalance(commandEnv *CommandEnv, collections []string, dc string, ecReplicaPlacement *super_block.ReplicaPlacement, diskType types.DiskType, maxParallelization int, applyBalancing bool) (err error) {
// collect all ec nodes
allEcNodes, totalFreeEcSlots, err := collectEcNodesForDC(commandEnv, dc, diskType)
if err != nil {
return err
}
if totalFreeEcSlots < 1 {
return fmt.Errorf("no free ec shard slots. only %d left", totalFreeEcSlots)
}
ecb := &ecBalancer{
commandEnv: commandEnv,
ecNodes: allEcNodes,
replicaPlacement: ecReplicaPlacement,
applyBalancing: applyBalancing,
maxParallelization: maxParallelization,
diskType: diskType,
}
if len(collections) == 0 {
glog.V(1).Infof("WARNING: No collections to balance EC volumes across.\n")
}
return ecb.balance(collections)
}
// shellECRatio resolves a collection's EC data/parity counts, defaulting to the
// standard scheme. This is the shell's plug-in point for custom ratios.
func shellECRatio(_ string) (int, int) {
// Custom EC ratios are an enterprise feature; OSS uses the standard scheme.
return erasure_coding.DataShardsCount, erasure_coding.ParityShardsCount
}
// balance plans EC shard moves with the shared planner and executes them. When
// collections is empty all collections present are balanced.
func (ecb *ecBalancer) balance(collections []string) error {
topo := toBalancerTopology(ecb.ecNodes, collections, ecb.diskType)
moves := ecbalancer.Plan(topo, ecbalancer.Options{
DiskType: string(ecb.diskType),
ImbalanceThreshold: 0, // the shell balances to an even distribution
ReplicaPlacement: ecb.replicaPlacement,
Ratio: shellECRatio,
})
return ecb.executeMoves(moves)
}
// toBalancerTopology builds an ecbalancer.Topology from the shell's EcNode model,
// including the shards of the requested collections (all collections when empty).
func toBalancerTopology(ecNodes []*EcNode, collections []string, diskType types.DiskType) *ecbalancer.Topology {
allowed := make(map[string]bool, len(collections))
for _, c := range collections {
allowed[c] = true
}
topo := ecbalancer.NewTopology()
for _, en := range ecNodes {
rackKey := string(en.dc) + ":" + string(en.rack)
node := topo.AddNode(en.info.Id, string(en.dc), rackKey, en.freeEcSlot)
for diskId, d := range en.disks {
node.AddDisk(diskId, d.diskType, d.freeEcSlots, d.ecShardCount)
}
diskInfo, found := en.info.DiskInfos[string(diskType)]
if !found {
continue
}
for _, eci := range diskInfo.EcShardInfos {
if len(allowed) > 0 && !allowed[eci.Collection] {
continue
}
node.AddShards(eci.Id, eci.Collection, eci.DiskId, erasure_coding.ShardBits(eci.EcIndexBits))
}
}
return topo
}
// executeMoves carries out the planned moves. Phases run in order (a within-rack
// move can depend on a cross-rack move's result), and the independent moves
// within a phase run with up to maxParallelization concurrency. Apply mode does
// only the RPCs; dry-run mode runs sequentially and mutates the in-memory EcNode
// model so callers/tests can inspect the planned end state.
func (ecb *ecBalancer) executeMoves(moves []ecbalancer.Move) error {
byID := make(map[string]*EcNode, len(ecb.ecNodes))
for _, en := range ecb.ecNodes {
byID[en.info.Id] = en
}
// Plan emits moves grouped by phase; run each contiguous same-phase group
// together, waiting before the next so cross-phase dependencies hold.
for i := 0; i < len(moves); {
j := i
for j < len(moves) && moves[j].Phase == moves[i].Phase {
j++
}
if err := ecb.executePhase(byID, moves[i:j]); err != nil {
return err
}
i = j
}
return nil
}
func (ecb *ecBalancer) executePhase(byID map[string]*EcNode, moves []ecbalancer.Move) error {
if !ecb.applyBalancing {
// Dry-run: sequential so the in-memory model updates are race-free and
// reflect the full plan for inspection.
for _, m := range moves {
if err := ecb.executeMove(byID, m); err != nil {
return err
}
}
return nil
}
// Apply mode: parallelize across volumes, but run one volume's moves within a
// phase sequentially. Concurrent moves of the same volume to a node can race
// on its shared .ecx/.ecj/.vif sidecar files.
var order []uint32
byVol := make(map[uint32][]ecbalancer.Move)
for _, m := range moves {
if _, ok := byVol[m.VolumeID]; !ok {
order = append(order, m.VolumeID)
}
byVol[m.VolumeID] = append(byVol[m.VolumeID], m)
}
ewg := NewErrorWaitGroup(ecb.maxParallelization)
for _, vid := range order {
group := byVol[vid]
ewg.Add(func() error {
for _, m := range group {
if err := ecb.executeMove(byID, m); err != nil {
return err
}
}
return nil
})
}
return ewg.Wait()
}
func (ecb *ecBalancer) executeMove(byID map[string]*EcNode, m ecbalancer.Move) error {
src := byID[m.SourceNode]
if src == nil {
return nil
}
vid := needle.VolumeId(m.VolumeID)
shardId := erasure_coding.ShardId(m.ShardID)
shardIds := []erasure_coding.ShardId{shardId}
if m.Phase == "dedup" {
fmt.Printf("dedup: delete ec shard %d.%d on %s\n", vid, shardId, m.SourceNode)
if !ecb.applyBalancing {
src.deleteEcVolumeShards(vid, shardIds, ecb.diskType)
return nil
}
grpcDialOption := ecb.commandEnv.option.GrpcDialOption
addr := pb.NewServerAddressFromDataNode(src.info)
if err := unmountEcShards(grpcDialOption, vid, addr, shardIds); err != nil {
return err
}
return sourceServerDeleteEcShards(grpcDialOption, m.Collection, vid, addr, shardIds)
}
dst := byID[m.TargetNode]
if dst == nil {
return nil
}
if m.TargetDisk > 0 {
fmt.Printf("%s moves ec shard %d.%d to %s (disk %d)\n", m.SourceNode, vid, shardId, m.TargetNode, m.TargetDisk)
} else {
fmt.Printf("%s moves ec shard %d.%d to %s\n", m.SourceNode, vid, shardId, m.TargetNode)
}
if !ecb.applyBalancing {
// Dry-run: update the in-memory model only.
return moveMountedShardToEcNode(ecb.commandEnv, src, m.Collection, vid, shardId, dst, m.TargetDisk, false, ecb.diskType)
}
return ecb.applyShardMoveRPC(src, dst, m.Collection, vid, shardId, m.TargetDisk)
}
// applyShardMoveRPC copies a shard to the destination disk, then unmounts and
// deletes it on the source. It does not touch the in-memory model, so it is safe
// to run concurrently across the moves of a phase.
func (ecb *ecBalancer) applyShardMoveRPC(src, dst *EcNode, collection string, vid needle.VolumeId, shardId erasure_coding.ShardId, destDiskId uint32) error {
grpcDialOption := ecb.commandEnv.option.GrpcDialOption
srcAddr := pb.NewServerAddressFromDataNode(src.info)
copiedShardIds, err := oneServerCopyAndMountEcShardsFromSource(grpcDialOption, dst, []erasure_coding.ShardId{shardId}, vid, collection, srcAddr, destDiskId)
if err != nil {
return err
}
if len(copiedShardIds) == 0 {
return nil
}
if err := unmountEcShards(grpcDialOption, vid, srcAddr, copiedShardIds); err != nil {
return err
}
return sourceServerDeleteEcShards(grpcDialOption, collection, vid, srcAddr, copiedShardIds)
}
// compileCollectionPattern compiles a regex pattern for collection matching.
// Empty patterns match empty collections only.
// The special keyword CollectionDefault ("_default") matches empty collections.
func compileCollectionPattern(pattern string) (*regexp.Regexp, error) {
if pattern == "" {
// empty pattern matches empty collection
return regexp.Compile("^$")
}
if pattern == CollectionDefault {
// CollectionDefault keyword matches empty collection
return regexp.Compile("^$")
}
return regexp.Compile(pattern)
}