mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-21 09:11:29 +00:00
* fix(shell): count physical disks in cluster.status on multi-disk nodes
The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.
Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.
* fix(ec): attribute EC balance source disk per shard and reject same-node moves
On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.
Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.
* fix(ec): spread EC balance moves across destination disks
Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
- fewer total EC shards on the disk,
- far fewer shards of the same volume on the disk (spread a volume's shards
across disks for fault tolerance), and
- data/parity anti-affinity (a data shard avoids disks holding the volume's
parity shards and vice versa).
Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.
* fix(ec): bring EC balance worker to parity with shell ec.balance
The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:
- Cross-rack and within-rack now run a two-pass balance: data shards spread
first, then parity shards spread while avoiding racks/nodes that already hold
the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
10+4 here), replacing the previously hardcoded constant at the call sites.
Selection is deterministic (sorted keys) to keep behavior reproducible.
* refactor(ec): extract shared ecbalancer package for shell and worker
The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.
- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
within-rack data/parity two-pass with anti-affinity, global per-rack balance,
and diversity-aware disk selection) over a caller-built Topology snapshot and
returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
task proposals; the shell builds it from its EcNode model and executes Moves
via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
(worker balances by fractional fullness; shell by raw count) and
GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
in one pass).
The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.
* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor
Address regressions and gaps from the ecbalancer extraction:
- Shell ec.balance honors -maxParallelization again: planned moves run phase by
phase (preserving cross-phase dependencies) with bounded concurrency within a
phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
instead of combined totals, so a data/parity skew is corrected even when the
per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
move, so limited-capacity targets are no longer over-planned.
* fix(ec): make EC dedup keeper deterministic and capacity-aware
When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.
* fix(ec): restore global volume-diversity and per-volume move serialization
Two more behaviors lost in the ecbalancer refactor:
- Global rack balancing again prefers moving a shard of a volume the destination
does not hold at all before adding another shard of an already-present volume
(two-pass, mirroring the old balanceEcRack), keeping each volume's shards
spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
while still running different volumes in parallel, so concurrent moves of the
same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.
* fix(ec): key EC balance shards by (collection, volume id)
A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.
* fix(ec): credit freed capacity from dedup before later balance phases
Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).
* test(ec): add multi-disk EC balance integration test
Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.
* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes
collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.
Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.
* test(ec): end-to-end multi-disk EC balance shard-loss regression
Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.
* fix(ec): bring EC balance worker/plugin path to parity with shell
- Per-volume serialization and phase order: key the plugin proposal dedupe by
(collection, volume) instead of (volume, shard, source), so the scheduler runs
only one of a volume's moves at a time (within a run and against in-flight jobs).
Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
because the planner emits a volume's moves in phase order, they now execute in
order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
read it into the worker config, mirroring ec.balance -shardReplicaPlacement.
* test(ec): rename worker in-process test (not a real integration test)
The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.
* ci(ec): remove redundant ec-integration workflow
ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.
* fix(ec): worker falls back to master default replication for EC balance
For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.
The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.
* fix(ec): plugin path populates master default replication too
The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.
* docs(ec): empty shard replica placement uses master default, not even spread
The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
483 lines
17 KiB
Go
483 lines
17 KiB
Go
package shell
|
|
|
|
import (
|
|
"context"
|
|
"flag"
|
|
"fmt"
|
|
"io"
|
|
"strings"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/glog"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/types"
|
|
|
|
"google.golang.org/grpc"
|
|
"google.golang.org/grpc/codes"
|
|
"google.golang.org/grpc/status"
|
|
|
|
"github.com/seaweedfs/seaweedfs/weed/operation"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/master_pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/pb/volume_server_pb"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
|
|
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
|
|
)
|
|
|
|
func init() {
|
|
Commands = append(Commands, &commandEcDecode{})
|
|
}
|
|
|
|
type commandEcDecode struct {
|
|
}
|
|
|
|
func (c *commandEcDecode) Name() string {
|
|
return "ec.decode"
|
|
}
|
|
|
|
func (c *commandEcDecode) Help() string {
|
|
return `decode a erasure coded volume into a normal volume
|
|
|
|
ec.decode [-collection=""] [-volumeId=<volume_id>] [-diskType=<disk_type>] [-checkMinFreeSpace]
|
|
|
|
The -collection parameter supports regular expressions for pattern matching:
|
|
- Use exact match: ec.decode -collection="^mybucket$"
|
|
- Match multiple buckets: ec.decode -collection="bucket.*"
|
|
- Match all collections: ec.decode -collection=".*"
|
|
|
|
Options:
|
|
-diskType: source disk type where EC shards are stored (hdd, ssd, or empty for default hdd)
|
|
-checkMinFreeSpace: check min free space when selecting the decode target (default true)
|
|
|
|
Examples:
|
|
# Decode EC shards from HDD (default)
|
|
ec.decode -collection=mybucket
|
|
|
|
# Decode EC shards from SSD
|
|
ec.decode -collection=mybucket -diskType=ssd
|
|
|
|
`
|
|
}
|
|
|
|
func (c *commandEcDecode) HasTag(CommandTag) bool {
|
|
return false
|
|
}
|
|
|
|
func (c *commandEcDecode) Do(args []string, commandEnv *CommandEnv, writer io.Writer) (err error) {
|
|
decodeCommand := flag.NewFlagSet(c.Name(), flag.ContinueOnError)
|
|
volumeId := decodeCommand.Int("volumeId", 0, "the volume id")
|
|
collection := decodeCommand.String("collection", "", "the collection name")
|
|
diskTypeStr := decodeCommand.String("diskType", "", "source disk type where EC shards are stored (hdd, ssd, or empty for default hdd)")
|
|
checkMinFreeSpace := decodeCommand.Bool("checkMinFreeSpace", true, "check min free space when selecting the decode target")
|
|
if err = decodeCommand.Parse(args); err != nil {
|
|
return nil
|
|
}
|
|
|
|
if err = commandEnv.confirmIsLocked(args); err != nil {
|
|
return
|
|
}
|
|
|
|
vid := needle.VolumeId(*volumeId)
|
|
diskType := types.ToDiskType(*diskTypeStr)
|
|
|
|
// collect topology information
|
|
topologyInfo, _, err := collectTopologyInfo(commandEnv, 0)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
var diskUsageState *decodeDiskUsageState
|
|
if *checkMinFreeSpace {
|
|
diskUsageState = newDecodeDiskUsageState(topologyInfo, diskType)
|
|
}
|
|
|
|
// volumeId is provided
|
|
if vid != 0 {
|
|
return doEcDecode(commandEnv, topologyInfo, *collection, vid, diskType, *checkMinFreeSpace, diskUsageState)
|
|
}
|
|
|
|
// apply to all volumes in the collection
|
|
volumeIds, err := collectEcShardIds(topologyInfo, *collection, diskType)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
fmt.Printf("ec decode volumes: %v\n", volumeIds)
|
|
for _, vid := range volumeIds {
|
|
if err = doEcDecode(commandEnv, topologyInfo, *collection, vid, diskType, *checkMinFreeSpace, diskUsageState); err != nil {
|
|
return err
|
|
}
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
func doEcDecode(commandEnv *CommandEnv, topoInfo *master_pb.TopologyInfo, collection string, vid needle.VolumeId, diskType types.DiskType, checkMinFreeSpace bool, diskUsageState *decodeDiskUsageState) (err error) {
|
|
|
|
if !commandEnv.isLocked() {
|
|
return fmt.Errorf("lock is lost")
|
|
}
|
|
|
|
// find volume location
|
|
nodeToEcShardsInfo := collectEcNodeShardsInfo(topoInfo, vid, diskType)
|
|
|
|
fmt.Printf("ec volume %d shard locations: %+v\n", vid, nodeToEcShardsInfo)
|
|
|
|
if len(nodeToEcShardsInfo) == 0 {
|
|
return fmt.Errorf("no EC shards found for volume %d (diskType %s)", vid, diskType.ReadableString())
|
|
}
|
|
|
|
var originalShardCounts map[pb.ServerAddress]int
|
|
if diskUsageState != nil {
|
|
originalShardCounts = make(map[pb.ServerAddress]int, len(nodeToEcShardsInfo))
|
|
for location, si := range nodeToEcShardsInfo {
|
|
originalShardCounts[location] = si.Count()
|
|
}
|
|
}
|
|
|
|
var eligibleTargets map[pb.ServerAddress]struct{}
|
|
if checkMinFreeSpace {
|
|
if diskUsageState == nil {
|
|
return fmt.Errorf("min free space checking requires disk usage state")
|
|
}
|
|
eligibleTargets = make(map[pb.ServerAddress]struct{})
|
|
for location := range nodeToEcShardsInfo {
|
|
if freeCount, found := diskUsageState.freeVolumeCount(location); found && freeCount > 0 {
|
|
eligibleTargets[location] = struct{}{}
|
|
}
|
|
}
|
|
if len(eligibleTargets) == 0 {
|
|
return fmt.Errorf("no eligible target datanodes with free volume slots for volume %d (diskType %s); use -checkMinFreeSpace=false to override", vid, diskType.ReadableString())
|
|
}
|
|
}
|
|
|
|
// collect ec shards to the server with most space
|
|
targetNodeLocation, err := collectEcShards(commandEnv, nodeToEcShardsInfo, collection, vid, eligibleTargets)
|
|
if err != nil {
|
|
return fmt.Errorf("collectEcShards for volume %d: %v", vid, err)
|
|
}
|
|
|
|
// generate a normal volume
|
|
err = generateNormalVolume(commandEnv.option.GrpcDialOption, vid, collection, targetNodeLocation)
|
|
if err != nil {
|
|
// Special case: if the EC index has no live entries, decoding is a no-op.
|
|
// Just purge EC shards and return success without generating/mounting an empty volume.
|
|
if isEcDecodeEmptyVolumeErr(err) {
|
|
if err := unmountAndDeleteEcShards(commandEnv.option.GrpcDialOption, collection, nodeToEcShardsInfo, vid); err != nil {
|
|
return err
|
|
}
|
|
if diskUsageState != nil {
|
|
diskUsageState.applyDecode(targetNodeLocation, originalShardCounts, false)
|
|
}
|
|
return nil
|
|
}
|
|
return fmt.Errorf("generate normal volume %d on %s: %v", vid, targetNodeLocation, err)
|
|
}
|
|
|
|
// mount the decoded volume after server-side offline compaction succeeded
|
|
err = mountDecodedVolume(commandEnv.option.GrpcDialOption, targetNodeLocation, vid)
|
|
if err != nil {
|
|
return fmt.Errorf("mount decoded volume %d on %s: %v", vid, targetNodeLocation, err)
|
|
}
|
|
|
|
// Confirm the regenerated .dat is present and non-empty before destroying
|
|
// the shards. Without this gate, a silent failure in generate/mount could
|
|
// leave the cluster with neither shards nor volume.
|
|
if err := verifyDecodedVolumeBeforeDelete(commandEnv.option.GrpcDialOption, targetNodeLocation, vid); err != nil {
|
|
return fmt.Errorf("verify decoded volume %d on %s before deleting shards: %w", vid, targetNodeLocation, err)
|
|
}
|
|
|
|
// delete the previous ec shards
|
|
err = unmountAndDeleteEcShardsWithPrefix("deleteDecodedEcShards", commandEnv.option.GrpcDialOption, collection, nodeToEcShardsInfo, vid)
|
|
if err != nil {
|
|
return fmt.Errorf("delete ec shards for volume %d: %v", vid, err)
|
|
}
|
|
if diskUsageState != nil {
|
|
diskUsageState.applyDecode(targetNodeLocation, originalShardCounts, true)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
func isEcDecodeEmptyVolumeErr(err error) bool {
|
|
st, ok := status.FromError(err)
|
|
if !ok {
|
|
return false
|
|
}
|
|
if st.Code() != codes.FailedPrecondition {
|
|
return false
|
|
}
|
|
// Keep this robust against wording tweaks while still being specific.
|
|
return strings.Contains(st.Message(), erasure_coding.EcNoLiveEntriesSubstring)
|
|
}
|
|
|
|
func unmountAndDeleteEcShards(grpcDialOption grpc.DialOption, collection string, nodeToShardsInfo map[pb.ServerAddress]*erasure_coding.ShardsInfo, vid needle.VolumeId) error {
|
|
return unmountAndDeleteEcShardsWithPrefix("unmountAndDeleteEcShards", grpcDialOption, collection, nodeToShardsInfo, vid)
|
|
}
|
|
|
|
func unmountAndDeleteEcShardsWithPrefix(prefix string, grpcDialOption grpc.DialOption, collection string, nodeToShardsInfo map[pb.ServerAddress]*erasure_coding.ShardsInfo, vid needle.VolumeId) error {
|
|
ewg := NewErrorWaitGroup(len(nodeToShardsInfo))
|
|
|
|
// unmount and delete ec shards in parallel (one goroutine per location)
|
|
for location, si := range nodeToShardsInfo {
|
|
location, si := location, si // capture loop variables for goroutine
|
|
ewg.Add(func() error {
|
|
fmt.Printf("unmount ec volume %d on %s has shards: %+v\n", vid, location, si.Ids())
|
|
if err := unmountEcShards(grpcDialOption, vid, location, si.Ids()); err != nil {
|
|
return fmt.Errorf("%s unmount ec volume %d on %s: %w", prefix, vid, location, err)
|
|
}
|
|
|
|
fmt.Printf("delete ec volume %d on %s has shards: %+v\n", vid, location, si.Ids())
|
|
if err := sourceServerDeleteEcShards(grpcDialOption, collection, vid, location, si.Ids()); err != nil {
|
|
return fmt.Errorf("%s delete ec volume %d on %s: %w", prefix, vid, location, err)
|
|
}
|
|
return nil
|
|
})
|
|
}
|
|
return ewg.Wait()
|
|
}
|
|
|
|
func verifyDecodedVolumeBeforeDelete(grpcDialOption grpc.DialOption, target pb.ServerAddress, vid needle.VolumeId) error {
|
|
var resp *volume_server_pb.ReadVolumeFileStatusResponse
|
|
if err := operation.WithVolumeServerClient(false, target, grpcDialOption, func(client volume_server_pb.VolumeServerClient) error {
|
|
r, e := client.ReadVolumeFileStatus(context.Background(), &volume_server_pb.ReadVolumeFileStatusRequest{
|
|
VolumeId: uint32(vid),
|
|
})
|
|
if e != nil {
|
|
return e
|
|
}
|
|
resp = r
|
|
return nil
|
|
}); err != nil {
|
|
return fmt.Errorf("read volume file status: %w", err)
|
|
}
|
|
if resp.DatFileSize == 0 {
|
|
return fmt.Errorf("decoded .dat is 0 bytes")
|
|
}
|
|
if resp.IdxFileSize == 0 {
|
|
return fmt.Errorf("decoded .idx is 0 bytes")
|
|
}
|
|
glog.V(0).Infof("ec decode verification ok for volume %d on %s: dat=%d idx=%d", vid, target, resp.DatFileSize, resp.IdxFileSize)
|
|
return nil
|
|
}
|
|
|
|
func mountDecodedVolume(grpcDialOption grpc.DialOption, targetNodeLocation pb.ServerAddress, vid needle.VolumeId) error {
|
|
return operation.WithVolumeServerClient(false, targetNodeLocation, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
|
|
_, mountErr := volumeServerClient.VolumeMount(context.Background(), &volume_server_pb.VolumeMountRequest{
|
|
VolumeId: uint32(vid),
|
|
})
|
|
return mountErr
|
|
})
|
|
}
|
|
|
|
func generateNormalVolume(grpcDialOption grpc.DialOption, vid needle.VolumeId, collection string, sourceVolumeServer pb.ServerAddress) error {
|
|
fmt.Printf("generateNormalVolume from ec volume %d on %s\n", vid, sourceVolumeServer)
|
|
|
|
err := operation.WithVolumeServerClient(false, sourceVolumeServer, grpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
|
|
_, genErr := volumeServerClient.VolumeEcShardsToVolume(context.Background(), &volume_server_pb.VolumeEcShardsToVolumeRequest{
|
|
VolumeId: uint32(vid),
|
|
Collection: collection,
|
|
})
|
|
return genErr
|
|
})
|
|
|
|
return err
|
|
|
|
}
|
|
|
|
func collectEcShards(commandEnv *CommandEnv, nodeToShardsInfo map[pb.ServerAddress]*erasure_coding.ShardsInfo, collection string, vid needle.VolumeId, eligibleTargets map[pb.ServerAddress]struct{}) (targetNodeLocation pb.ServerAddress, err error) {
|
|
|
|
maxShardCount := -1
|
|
existingShardsInfo := erasure_coding.NewShardsInfo()
|
|
for loc, si := range nodeToShardsInfo {
|
|
if eligibleTargets != nil {
|
|
if _, ok := eligibleTargets[loc]; !ok {
|
|
continue
|
|
}
|
|
}
|
|
toBeCopiedShardCount := si.MinusParityShards().Count()
|
|
if toBeCopiedShardCount > maxShardCount {
|
|
maxShardCount = toBeCopiedShardCount
|
|
targetNodeLocation = loc
|
|
existingShardsInfo = si
|
|
}
|
|
}
|
|
if targetNodeLocation == "" {
|
|
return "", fmt.Errorf("no eligible target datanodes available to decode volume %d", vid)
|
|
}
|
|
|
|
fmt.Printf("collectEcShards: ec volume %d collect shards to %s from: %+v\n", vid, targetNodeLocation, nodeToShardsInfo)
|
|
|
|
copiedShardsInfo := erasure_coding.NewShardsInfo()
|
|
for loc, si := range nodeToShardsInfo {
|
|
if loc == targetNodeLocation {
|
|
continue
|
|
}
|
|
|
|
needToCopyShardsInfo := si.Minus(existingShardsInfo).MinusParityShards()
|
|
|
|
err = operation.WithVolumeServerClient(false, targetNodeLocation, commandEnv.option.GrpcDialOption, func(volumeServerClient volume_server_pb.VolumeServerClient) error {
|
|
|
|
// Always collect .ecj from every shard location. Each server's .ecj
|
|
// only contains deletions for needles whose data resides in shards
|
|
// held by that server. Without merging all .ecj files, deletions
|
|
// recorded on other servers would be lost during decode.
|
|
if needToCopyShardsInfo.Count() > 0 {
|
|
fmt.Printf("copy %d.%v %s => %s\n", vid, needToCopyShardsInfo.Ids(), loc, targetNodeLocation)
|
|
} else {
|
|
fmt.Printf("collect ecj %d %s => %s\n", vid, loc, targetNodeLocation)
|
|
}
|
|
|
|
_, copyErr := volumeServerClient.VolumeEcShardsCopy(context.Background(), &volume_server_pb.VolumeEcShardsCopyRequest{
|
|
VolumeId: uint32(vid),
|
|
Collection: collection,
|
|
ShardIds: needToCopyShardsInfo.IdsUint32(),
|
|
CopyEcxFile: false,
|
|
CopyEcjFile: true,
|
|
CopyVifFile: needToCopyShardsInfo.Count() > 0,
|
|
SourceDataNode: string(loc),
|
|
})
|
|
if copyErr != nil {
|
|
return fmt.Errorf("copy %d.%v %s => %s : %v\n", vid, needToCopyShardsInfo.Ids(), loc, targetNodeLocation, copyErr)
|
|
}
|
|
|
|
if needToCopyShardsInfo.Count() > 0 {
|
|
fmt.Printf("mount %d.%v on %s\n", vid, needToCopyShardsInfo.Ids(), targetNodeLocation)
|
|
_, mountErr := volumeServerClient.VolumeEcShardsMount(context.Background(), &volume_server_pb.VolumeEcShardsMountRequest{
|
|
VolumeId: uint32(vid),
|
|
Collection: collection,
|
|
ShardIds: needToCopyShardsInfo.IdsUint32(),
|
|
})
|
|
if mountErr != nil {
|
|
return fmt.Errorf("mount %d.%v on %s : %v\n", vid, needToCopyShardsInfo.Ids(), targetNodeLocation, mountErr)
|
|
}
|
|
}
|
|
|
|
return nil
|
|
})
|
|
|
|
if err != nil {
|
|
break
|
|
}
|
|
|
|
copiedShardsInfo.Add(needToCopyShardsInfo)
|
|
}
|
|
|
|
nodeToShardsInfo[targetNodeLocation] = existingShardsInfo.Plus(copiedShardsInfo)
|
|
|
|
return targetNodeLocation, err
|
|
}
|
|
|
|
func lookupVolumeIds(commandEnv *CommandEnv, volumeIds []string) (volumeIdLocations []*master_pb.LookupVolumeResponse_VolumeIdLocation, err error) {
|
|
var resp *master_pb.LookupVolumeResponse
|
|
err = commandEnv.MasterClient.WithClient(false, func(client master_pb.SeaweedClient) error {
|
|
resp, err = client.LookupVolume(context.Background(), &master_pb.LookupVolumeRequest{VolumeOrFileIds: volumeIds})
|
|
return err
|
|
})
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
return resp.VolumeIdLocations, nil
|
|
}
|
|
|
|
func collectEcShardIds(topoInfo *master_pb.TopologyInfo, collectionPattern string, diskType types.DiskType) (vids []needle.VolumeId, err error) {
|
|
// compile regex pattern for collection matching
|
|
collectionRegex, err := compileCollectionPattern(collectionPattern)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("invalid collection pattern '%s': %v", collectionPattern, err)
|
|
}
|
|
|
|
vidMap := make(map[uint32]bool)
|
|
eachDataNode(topoInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
if diskInfo, found := dn.DiskInfos[string(diskType)]; found {
|
|
for _, v := range diskInfo.EcShardInfos {
|
|
if collectionRegex.MatchString(v.Collection) {
|
|
vidMap[v.Id] = true
|
|
}
|
|
}
|
|
}
|
|
})
|
|
|
|
for vid := range vidMap {
|
|
vids = append(vids, needle.VolumeId(vid))
|
|
}
|
|
|
|
return
|
|
}
|
|
|
|
func collectEcNodeShardsInfo(topoInfo *master_pb.TopologyInfo, vid needle.VolumeId, diskType types.DiskType) map[pb.ServerAddress]*erasure_coding.ShardsInfo {
|
|
res := make(map[pb.ServerAddress]*erasure_coding.ShardsInfo)
|
|
eachDataNode(topoInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
if diskInfo, found := dn.DiskInfos[string(diskType)]; found {
|
|
// A node may report several EcShardInfos for one volume — one per
|
|
// physical disk holding shards of it (multi-disk nodes). Union them
|
|
// rather than overwriting, or only the last disk's shards survive and
|
|
// the node looks like it is missing shards it actually has.
|
|
for _, v := range diskInfo.EcShardInfos {
|
|
if v.Id == uint32(vid) {
|
|
addr := pb.NewServerAddressFromDataNode(dn)
|
|
si := erasure_coding.ShardsInfoFromVolumeEcShardInformationMessage(v)
|
|
if existing, ok := res[addr]; ok {
|
|
existing.Add(si)
|
|
} else {
|
|
res[addr] = si
|
|
}
|
|
}
|
|
}
|
|
}
|
|
})
|
|
|
|
return res
|
|
}
|
|
|
|
type decodeDiskUsageState struct {
|
|
byNode map[pb.ServerAddress]*decodeDiskUsageCounts
|
|
}
|
|
|
|
type decodeDiskUsageCounts struct {
|
|
maxVolumeCount int64
|
|
volumeCount int64
|
|
remoteVolumeCount int64
|
|
ecShardCount int64
|
|
}
|
|
|
|
func newDecodeDiskUsageState(topoInfo *master_pb.TopologyInfo, diskType types.DiskType) *decodeDiskUsageState {
|
|
state := &decodeDiskUsageState{byNode: make(map[pb.ServerAddress]*decodeDiskUsageCounts)}
|
|
eachDataNode(topoInfo, func(dc DataCenterId, rack RackId, dn *master_pb.DataNodeInfo) {
|
|
if diskInfo, found := dn.DiskInfos[string(diskType)]; found {
|
|
state.byNode[pb.NewServerAddressFromDataNode(dn)] = &decodeDiskUsageCounts{
|
|
maxVolumeCount: diskInfo.MaxVolumeCount,
|
|
volumeCount: diskInfo.VolumeCount,
|
|
remoteVolumeCount: diskInfo.RemoteVolumeCount,
|
|
ecShardCount: int64(countShards(diskInfo.EcShardInfos)),
|
|
}
|
|
}
|
|
})
|
|
return state
|
|
}
|
|
|
|
func (state *decodeDiskUsageState) freeVolumeCount(location pb.ServerAddress) (int64, bool) {
|
|
if state == nil {
|
|
return 0, false
|
|
}
|
|
usage, found := state.byNode[location]
|
|
if !found {
|
|
return 0, false
|
|
}
|
|
free := usage.maxVolumeCount - (usage.volumeCount - usage.remoteVolumeCount)
|
|
free -= (usage.ecShardCount + int64(erasure_coding.DataShardsCount) - 1) / int64(erasure_coding.DataShardsCount)
|
|
return free, true
|
|
}
|
|
|
|
func (state *decodeDiskUsageState) applyDecode(targetNodeLocation pb.ServerAddress, shardCounts map[pb.ServerAddress]int, createdVolume bool) {
|
|
if state == nil {
|
|
return
|
|
}
|
|
for location, shardCount := range shardCounts {
|
|
if usage, found := state.byNode[location]; found {
|
|
usage.ecShardCount -= int64(shardCount)
|
|
}
|
|
}
|
|
if createdVolume {
|
|
if usage, found := state.byNode[targetNodeLocation]; found {
|
|
usage.volumeCount++
|
|
}
|
|
}
|
|
}
|