seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-21 17:21:34 +00:00

Author	SHA1	Message	Date
Chris Lu	391f543ff2	fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594 ) * fix(shell): count physical disks in cluster.status on multi-disk nodes The master keys DataNodeInfo.DiskInfos by disk type, so several same-type physical disks on one node collapse into a single DiskInfo entry. cluster.status (printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting one disk per node instead of the real physical disk count, while volume.list and the admin ActiveTopology already split per physical disk. Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N same-type disks reports N. Cosmetic/diagnostic only; placement already uses the per-disk activeDisk map. * fix(ec): attribute EC balance source disk per shard and reject same-node moves On multi-disk nodes the EC balance worker built a node-level view that kept only the first physical disk id per (node, volume), so a move of a shard living on a different disk reported the wrong source disk. That source disk drives the per-disk capacity reservation, so the wrong disk drifts the capacity model the EC placement planner relies on. Track shards per physical disk and resolve the actual source disk for every emitted move (dedup, cross-rack, within-rack, global), keeping the per-disk view consistent as simulated moves are applied. Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the shard from every disk on the node) and copyAndMountShard skips the copy when source and target addresses match, so a same-node move would erase a shard it never copied. isDedupPhase now requires the same node AND disk, and Validate / Execute reject same-node cross-disk moves outright. * fix(ec): spread EC balance moves across destination disks Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance worker so a moved shard is placed on a good physical disk instead of always deferring to the volume server (target disk 0). The detection now builds a per-physical-disk view of each node (free slots split from the node total, exact EC shard count, disk type, discovered from both regular volumes and EC shards) and, for each cross-rack, within-rack, and global move, chooses the destination disk by ascending score: - fewer total EC shards on the disk, - far fewer shards of the same volume on the disk (spread a volume's shards across disks for fault tolerance), and - data/parity anti-affinity (a data shard avoids disks holding the volume's parity shards and vice versa). Planned placements are reserved on the in-memory model during a run so multiple shards moved to the same node spread across its disks rather than piling on one. * fix(ec): bring EC balance worker to parity with shell ec.balance The worker's cross-rack and within-rack balancing balanced shards by total count; the shell balances data and parity shards separately with anti-affinity and honors replica placement. Port that logic so the automatic balancer makes the same fault-tolerance-aware decisions as the manual command: - Cross-rack and within-rack now run a two-pass balance: data shards spread first, then parity shards spread while avoiding racks/nodes that already hold the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks and doBalanceEcShardsWithinOneRack. - Optional replica placement: a new replica_placement config (e.g. "020") constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty keeps the previous even-spread behavior. - The data/parity boundary is resolved from a per-collection EC ratio (standard 10+4 here), replacing the previously hardcoded constant at the call sites. Selection is deterministic (sorted keys) to keep behavior reproducible. * refactor(ec): extract shared ecbalancer package for shell and worker The EC shard balancing policy was duplicated between the shell ec.balance command and the admin EC balance worker, and the two had drifted (multi-disk handling, data/parity anti-affinity, replica placement). Extract the policy into a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers share so it cannot drift again. - ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and within-rack data/parity two-pass with anti-affinity, global per-rack balance, and diversity-aware disk selection) over a caller-built Topology snapshot and returns the shard Moves. It depends only on erasure_coding and super_block. - The worker builds the Topology from the master topology and turns Moves into task proposals; the shell builds it from its EcNode model and executes Moves via the existing move/delete RPCs. Per-collection EC ratio resolution stays in each caller (passed as Options.Ratio). - Options expose the two genuine policy differences: GlobalUtilizationBased (worker balances by fractional fullness; shell by raw count) and GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains in one pass). The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to the ecbalancer package; the shell and worker keep their adapter/execution tests. * fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor Address regressions and gaps from the ecbalancer extraction: - Shell ec.balance honors -maxParallelization again: planned moves run phase by phase (preserving cross-phase dependencies) with bounded concurrency within a phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and updates the in-memory model for inspection. - Rack and node balancing gate on per-type spread (data and parity separately) instead of combined totals, so a data/parity skew is corrected even when the per-rack/node totals are even. - Global rack balancing iterates the full shard-id space (MaxShardCount) so custom EC ratios with more than the standard total are candidates. - Cross-rack planning decrements the destination node's free slots per planned move, so limited-capacity targets are no longer over-planned. * fix(ec): make EC dedup keeper deterministic and capacity-aware When a shard is duplicated across nodes, keep the copy on the node with the most free slots and delete the duplicates from the more-constrained nodes, relieving capacity pressure where it is tightest. Tie-break on node id so the choice is deterministic. This unifies the shell and worker (the shell previously kept the least-free node, an incidental default) on the more sensible behavior. * fix(ec): restore global volume-diversity and per-volume move serialization Two more behaviors lost in the ecbalancer refactor: - Global rack balancing again prefers moving a shard of a volume the destination does not hold at all before adding another shard of an already-present volume (two-pass, mirroring the old balanceEcRack), keeping each volume's shards spread across nodes. - Shell apply-mode execution serializes a single volume's moves within a phase while still running different volumes in parallel, so concurrent moves of the same volume cannot race on its shared .ecx/.ecj/.vif sidecar files. * fix(ec): key EC balance shards by (collection, volume id) A numeric volume id can be reused across collections, and EC identity is (collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed Node.shards by vid alone, so volumes sharing an id across collections merged into one entry — letting dedup delete a "duplicate" that is actually a different collection's shard, and letting moves act across collections. Key shards by (collection, vid) throughout so each volume stays distinct. * fix(ec): credit freed capacity from dedup before later balance phases Dedup deletions are simulated only by applyMovesToTopology, which cleared shard bits but did not return the freed disk/node/rack slots. Later phases reject destinations with no free slots, so a slot opened by dedup could not be reused in the same Plan/ec.balance run. applyMovesToTopology now credits the freed disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline accounting their phase already did). * test(ec): add multi-disk EC balance integration test Cover issue 9593 end-to-end at the unit level the old tests missed: build the master's actual multi-disk wire format (same-type disks collapsed into one DiskInfo, real DiskId only in per-shard records), run it through a real ActiveTopology and the Detection entry point, then replay the planned moves with the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume, and a multi-rack spread, and asserts moves are safe (no same-node cross-disk), correctly attributed to the source disk, and redistribute concentrated volumes across both other racks and multiple destination disks. * fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a volume. A multi-disk node reports one EcShardInfo per physical disk holding shards of the volume, so only the last disk's shards survived — the node looked like it was missing shards it actually had. This made ec.encode's pre-delete verification (and ec.decode) under-count volumes whose shards are spread across disks on one server, falsely aborting the encode on multi-disk clusters. Union the per-disk shard sets per server instead. Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the master via volume-server heartbeats, so a freshly distributed shard set may not be fully visible the instant the balance returns. Retry before concluding the set is incomplete; genuine loss still fails after the retries are exhausted. * test(ec): end-to-end multi-disk EC balance shard-loss regression Start a real cluster of multi-disk volume servers (3 servers x 4 disks), EC-encode a volume, run ec.balance, and assert hard invariants the prior integration tests only logged: after encode all 14 shards exist, ec.balance loses no shard, shards span more than one disk per node, and cluster.status counts physical disks (not one per node). This reproduces issue 9593 end to end and would have caught the multi-disk shard-aggregation bug fixed alongside it. * fix(ec): bring EC balance worker/plugin path to parity with shell - Per-volume serialization and phase order: key the plugin proposal dedupe by (collection, volume) instead of (volume, shard, source), so the scheduler runs only one of a volume's moves at a time (within a run and against in-flight jobs). Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and because the planner emits a volume's moves in phase order, they now execute in order across detection cycles, matching the shell. - disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks instead of nothing; apply the canonical type to planner options and move params. - Replica placement: expose shard_replica_placement in the admin config form and read it into the worker config, mirroring ec.balance -shardReplicaPlacement. * test(ec): rename worker in-process test (not a real integration test) The worker-package multi-disk tests build a fake master topology and simulate move execution; they are not real-cluster integration tests. Rename integration_test.go -> multidisk_detection_test.go and drop the Integration prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding. * ci(ec): remove redundant ec-integration workflow ec-integration.yml duplicated EC Integration Tests under the same workflow name but ran only 'go test ec_integration_test.go' (one file), so it never ran new test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole test/erasure_coding package on every push/PR. * fix(ec): worker falls back to master default replication for EC balance For strict parity with the shell, the EC balance worker now uses the master's configured default replication as the replica-placement fallback when no explicit shard_replica_placement is set, instead of always defaulting to even spread. The maintenance scanner reads it via GetMasterConfiguration each cycle and passes it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint (explicit config wins, else master default, else none) in resolveReplicaPlacement. A zero-replication default (the common 000 case) still means even spread, so the common configuration is unchanged. * fix(ec): plugin path populates master default replication too The plugin worker built ClusterInfo with only ActiveTopology, so the master default replication fallback added for the maintenance path never reached plugin-driven EC balance detection — empty shard_replica_placement still meant even spread there. Fetch the master default via GetMasterConfiguration (new pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement so both detection paths resolve replica placement identically to the shell. * docs(ec): empty shard replica placement uses master default, not even spread The EC balance config text (admin plugin form, legacy form help text, and the struct/proto field comments) still said an empty shard_replica_placement spreads evenly. The runtime resolves empty to the master default replication (resolveReplicaPlacement), matching shell ec.balance, with even spread only when that default is empty or zero. Update the text to match and regenerate worker_pb for the proto comment change.	2026-05-20 23:31:21 -07:00
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	af68449a26	Process .ecj deletions during EC decode and vacuum decoded volume (#8863 ) * Process .ecj deletions during EC decode and vacuum decoded volume (#8798) When decoding EC volumes back to normal volumes, deletions recorded in the .ecj journal were not being applied before computing the dat file size or checking for live needles. This caused the decoded volume to include data for deleted files and could produce false positives in the all-deleted check. - Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first - Vacuum the decoded volume after mounting in ec.decode to compact out deleted needle data from the .dat file - Add integration tests for decoding with non-empty .ecj files * storage: add offline volume compaction helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: compact decoded volumes before deleting shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: address PR review comments - Fall back to data directory for .ecx when idx directory lacks it - Make compaction failure non-fatal during EC decode - Remove misleading "buffer: 10%" from space check error message * ec: collect .ecj from all shard locations during decode Each server's .ecj only contains deletions for needles whose data resides in shards held by that server. Previously, sources with no new data shards to contribute were skipped entirely, losing their .ecj deletion entries. Now .ecj is always appended from every shard location so RebuildEcxFile sees the full set of deletions. * ec: add integration tests for .ecj collection during decode TestEcDecodePreservesDeletedNeedles: verifies that needles deleted via VolumeEcBlobDelete are excluded from the decoded volume. TestEcDecodeCollectsEcjFromPeer: regression test for the fix in collectEcShards. Deletes a needle only on a peer server that holds no new data shards, then verifies the deletion survives decode via .ecj collection. * ec: address review nits in decode and tests - Remove double error wrapping in mountDecodedVolume - Check VolumeUnmount error in peer ecj test - Assert 404 specifically for deleted needles, fail on 5xx --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-01 01:15:26 -07:00
Chris Lu	2dd3944819	Respect -minFreeSpace during ec.decode (#8467 ) * shell: add ec.decode ignoreMinFreeSpace flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * shell: respect minFreeSpace in ec.decode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * shell: rename ec.decode minFreeSpace flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * shell: error when ec.decode has no shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * shell: select ec.decode target with zero shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * shell: adjust free counts across ec.decode Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * unused * Update weed/shell/command_ec_decode.go Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>	2026-02-27 23:54:30 -08:00
Lisandro Pin	6b98b52acc	Fix reporting of EC shard sizes from nodes to masters. (#7835 ) SeaweedFS tracks EC shard sizes on topology data stuctures, but this information is never relayed to master servers :( The end result is that commands reporting disk usage, such as `volume.list` and `cluster.status`, yield incorrect figures when EC shards are present. As an example for a simple 5-node test cluster, before... ``` > volume.list Topology volumeSizeLimit:30000 MB hdd(volume:6/40 active:6 free:33 remote:0) DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9001 hdd(volume:1/8 active:1 free:7 remote:0) Disk hdd(volume:1/8 active:1 free:7 remote:0) id:0 volume id:3 size:88967096 file_count:172 replica_placement:2 version:3 modified_at_second:1766349617 ec volume id:1 collection: shards:[1 5] Disk hdd total size:88967096 file_count:172 DataNode 192.168.10.111:9001 total size:88967096 file_count:172 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9002 hdd(volume:2/8 active:2 free:6 remote:0) Disk hdd(volume:2/8 active:2 free:6 remote:0) id:0 volume id:2 size:77267536 file_count:166 replica_placement:2 version:3 modified_at_second:1766349617 volume id:3 size:88967096 file_count:172 replica_placement:2 version:3 modified_at_second:1766349617 ec volume id:1 collection: shards:[0 4] Disk hdd total size:166234632 file_count:338 DataNode 192.168.10.111:9002 total size:166234632 file_count:338 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9003 hdd(volume:1/8 active:1 free:7 remote:0) Disk hdd(volume:1/8 active:1 free:7 remote:0) id:0 volume id:2 size:77267536 file_count:166 replica_placement:2 version:3 modified_at_second:1766349617 ec volume id:1 collection: shards:[2 6] Disk hdd total size:77267536 file_count:166 DataNode 192.168.10.111:9003 total size:77267536 file_count:166 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9004 hdd(volume:2/8 active:2 free:6 remote:0) Disk hdd(volume:2/8 active:2 free:6 remote:0) id:0 volume id:2 size:77267536 file_count:166 replica_placement:2 version:3 modified_at_second:1766349617 volume id:3 size:88967096 file_count:172 replica_placement:2 version:3 modified_at_second:1766349617 ec volume id:1 collection: shards:[3 7] Disk hdd total size:166234632 file_count:338 DataNode 192.168.10.111:9004 total size:166234632 file_count:338 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9005 hdd(volume:0/8 active:0 free:8 remote:0) Disk hdd(volume:0/8 active:0 free:8 remote:0) id:0 ec volume id:1 collection: shards:[8 9 10 11 12 13] Disk hdd total size:0 file_count:0 Rack DefaultRack total size:498703896 file_count:1014 DataCenter DefaultDataCenter total size:498703896 file_count:1014 total size:498703896 file_count:1014 ``` ...and after: ``` > volume.list Topology volumeSizeLimit:30000 MB hdd(volume:6/40 active:6 free:33 remote:0) DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9001 hdd(volume:1/8 active:1 free:7 remote:0) Disk hdd(volume:1/8 active:1 free:7 remote:0) id:0 volume id:2 size:81761800 file_count:161 replica_placement:2 version:3 modified_at_second:1766349495 ec volume id:1 collection: shards:[1 5 9] sizes:[1:8.00 MiB 5:8.00 MiB 9:8.00 MiB] total:24.00 MiB Disk hdd total size:81761800 file_count:161 DataNode 192.168.10.111:9001 total size:81761800 file_count:161 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9002 hdd(volume:1/8 active:1 free:7 remote:0) Disk hdd(volume:1/8 active:1 free:7 remote:0) id:0 volume id:3 size:88678712 file_count:170 replica_placement:2 version:3 modified_at_second:1766349495 ec volume id:1 collection: shards:[11 12 13] sizes:[11:8.00 MiB 12:8.00 MiB 13:8.00 MiB] total:24.00 MiB Disk hdd total size:88678712 file_count:170 DataNode 192.168.10.111:9002 total size:88678712 file_count:170 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9003 hdd(volume:2/8 active:2 free:6 remote:0) Disk hdd(volume:2/8 active:2 free:6 remote:0) id:0 volume id:2 size:81761800 file_count:161 replica_placement:2 version:3 modified_at_second:1766349495 volume id:3 size:88678712 file_count:170 replica_placement:2 version:3 modified_at_second:1766349495 ec volume id:1 collection: shards:[0 4 8] sizes:[0:8.00 MiB 4:8.00 MiB 8:8.00 MiB] total:24.00 MiB Disk hdd total size:170440512 file_count:331 DataNode 192.168.10.111:9003 total size:170440512 file_count:331 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9004 hdd(volume:2/8 active:2 free:6 remote:0) Disk hdd(volume:2/8 active:2 free:6 remote:0) id:0 volume id:2 size:81761800 file_count:161 replica_placement:2 version:3 modified_at_second:1766349495 volume id:3 size:88678712 file_count:170 replica_placement:2 version:3 modified_at_second:1766349495 ec volume id:1 collection: shards:[2 6 10] sizes:[2:8.00 MiB 6:8.00 MiB 10:8.00 MiB] total:24.00 MiB Disk hdd total size:170440512 file_count:331 DataNode 192.168.10.111:9004 total size:170440512 file_count:331 DataCenter DefaultDataCenter hdd(volume:6/40 active:6 free:33 remote:0) Rack DefaultRack hdd(volume:6/40 active:6 free:33 remote:0) DataNode 192.168.10.111:9005 hdd(volume:0/8 active:0 free:8 remote:0) Disk hdd(volume:0/8 active:0 free:8 remote:0) id:0 ec volume id:1 collection: shards:[3 7] sizes:[3:8.00 MiB 7:8.00 MiB] total:16.00 MiB Disk hdd total size:0 file_count:0 Rack DefaultRack total size:511321536 file_count:993 DataCenter DefaultDataCenter total size:511321536 file_count:993 total size:511321536 file_count:993 ```	2025-12-28 19:30:42 -08:00
Chris Lu	7ed7578424	fix(ec.decode): purge EC shards when volume is empty (#7749 ) * fix(ec.decode): purge EC shards when volume is empty When an EC volume has no live entries (all deleted), ec.decode should not generate an empty normal volume. Instead, treat decode as a no-op and allow shard purge to proceed cleanly.\n\nFixes: #7748 * chore: address PR review comments * test: cover live EC index + avoid magic string * chore: harden empty-EC handling - Make shard cleanup best-effort (collect errors)\n- Remove unreachable EOF handling in HasLiveNeedles\n- Add empty ecx test case\n- Share no-live-entries substring between server/client\n * perf: parallelize EC shard unmount/delete across locations * refactor: combine unmount+delete into single goroutine per location * refactor: use errors.Join for multi-error aggregation * refactor: use existing ErrorWaitGroup for parallel execution * fix: capture loop variables + clarify SuperBlockSize safety	2025-12-14 17:06:13 -08:00
Chris Lu	df4f2f7020	ec: add -diskType flag to EC commands for SSD support (#7607 ) * ec: add diskType parameter to core EC functions Add diskType parameter to: - ecBalancer struct - collectEcVolumeServersByDc() - collectEcNodesForDC() - collectEcNodes() - EcBalance() This allows EC operations to target specific disk types (hdd, ssd, etc.) instead of being hardcoded to HardDriveType only. For backward compatibility, all callers currently pass types.HardDriveType as the default value. Subsequent commits will add -diskType flags to the individual EC commands. * ec: update helper functions to use configurable diskType Update the following functions to accept/use diskType parameter: - findEcVolumeShards() - addEcVolumeShards() - deleteEcVolumeShards() - moveMountedShardToEcNode() - countShardsByRack() - pickNEcShardsToMoveFrom() All ecBalancer methods now use ecb.diskType instead of hardcoded types.HardDriveType. Non-ecBalancer callers (like volumeServer.evacuate and ec.rebuild) use types.HardDriveType as the default. Update all test files to pass diskType where needed. * ec: add -diskType flag to ec.balance and ec.encode commands Add -diskType flag to specify the target disk type for EC operations: - ec.balance -diskType=ssd - ec.encode -diskType=ssd The disk type can be 'hdd', 'ssd', or empty for default (hdd). This allows placing EC shards on SSD or other disk types instead of only HDD. Example usage: ec.balance -collection=mybucket -diskType=ssd -apply ec.encode -collection=mybucket -diskType=ssd -force * test: add integration tests for EC disk type support Add integration tests to verify the -diskType flag works correctly: - TestECDiskTypeSupport: Tests EC encode and balance with SSD disk type - TestECDiskTypeMixedCluster: Tests EC operations on a mixed HDD/SSD cluster The tests verify: - Volume servers can be configured with specific disk types - ec.encode accepts -diskType flag and encodes to the correct disk type - ec.balance accepts -diskType flag and balances on the correct disk type - Mixed disk type clusters work correctly with separate collections * ec: add -sourceDiskType to ec.encode and -diskType to ec.decode ec.encode: - Add -sourceDiskType flag to filter source volumes by disk type - This enables tier migration scenarios (e.g., SSD volumes → HDD EC shards) - -diskType specifies target disk type for EC shards ec.decode: - Add -diskType flag to specify source disk type where EC shards are stored - Update collectEcShardIds() and collectEcNodeShardBits() to accept diskType Examples: # Encode SSD volumes to HDD EC shards (tier migration) ec.encode -collection=mybucket -sourceDiskType=ssd -diskType=hdd # Decode EC shards from SSD ec.decode -collection=mybucket -diskType=ssd Integration tests updated to cover new flags. * ec: fix variable shadowing and add -diskType to ec.rebuild and volumeServer.evacuate Address code review comments: 1. Fix variable shadowing in collectEcVolumeServersByDc(): - Rename loop variable 'diskType' to 'diskTypeKey' and 'diskTypeStr' to avoid shadowing the function parameter 2. Fix hardcoded HardDriveType in ecBalancer methods: - balanceEcRack(): use ecb.diskType instead of types.HardDriveType - collectVolumeIdToEcNodes(): use ecb.diskType 3. Add -diskType flag to ec.rebuild command: - Add diskType field to ecRebuilder struct - Pass diskType to collectEcNodes() and addEcVolumeShards() 4. Add -diskType flag to volumeServer.evacuate command: - Add diskType field to commandVolumeServerEvacuate struct - Pass diskType to collectEcVolumeServersByDc() and moveMountedShardToEcNode() * test: add diskType field to ecBalancer in TestPickEcNodeToBalanceShardsInto Address nitpick comment: ensure test ecBalancer struct has diskType field set for consistency with other tests. * ec: filter disk selection by disk type in pickBestDiskOnNode When evacuating or rebalancing EC shards, pickBestDiskOnNode now filters disks by the target disk type. This ensures: 1. EC shards from SSD disks are moved to SSD disks on destination nodes 2. EC shards from HDD disks are moved to HDD disks on destination nodes 3. No cross-disk-type shard movement occurs This maintains the storage tier isolation when moving EC shards between nodes during evacuation or rebalancing operations. * ec: allow disk type fallback during evacuation Update pickBestDiskOnNode to accept a strictDiskType parameter: - strictDiskType=true (balancing): Only use disks of matching type. This maintains storage tier isolation during normal rebalancing. - strictDiskType=false (evacuation): Prefer same disk type, but fall back to other disk types if no matching disk is available. This ensures evacuation can complete even when same-type capacity is insufficient. Priority order for evacuation: 1. Same disk type with lowest shard count (preferred) 2. Different disk type with lowest shard count (fallback) * test: use defer for lock/unlock to prevent lock leaks Use defer to ensure locks are always released, even on early returns or test failures. This prevents lock leaks that could cause subsequent tests to hang or fail. Changes: - Return early if lock acquisition fails - Immediately defer unlock after successful lock - Remove redundant explicit unlock calls at end of tests - Fix unused variable warning (err -> encodeErr/locErr) * ec: dynamically discover disk types from topology for evacuation Disk types are free-form tags (e.g., 'ssd', 'nvme', 'archive') that come from the topology, not a hardcoded set. Only 'hdd' (or empty) is the default disk type. Use collectVolumeDiskTypes() to discover all disk types present in the cluster topology instead of hardcoding [HardDriveType, SsdType]. * test: add evacuation fallback and cross-rack EC placement tests Add two new integration tests: 1. TestEvacuationFallbackBehavior: - Tests that when same disk type has no capacity, shards fall back to other disk types during evacuation - Creates cluster with 1 SSD + 2 HDD servers (limited SSD capacity) - Verifies pickBestDiskOnNode behavior with strictDiskType=false 2. TestCrossRackECPlacement: - Tests EC shard distribution across different racks - Creates cluster with 4 servers in 4 different racks - Verifies shards are spread across multiple racks - Tests that ec.balance respects rack placement Helper functions added: - startLimitedSsdCluster: 1 SSD + 2 HDD servers - startMultiRackCluster: 4 servers in 4 racks - countShardsPerRack: counts EC shards per rack from disk * test: fix collection mismatch in TestCrossRackECPlacement The EC commands were using collection 'rack_test' but uploaded test data uses collection 'test' (default). This caused ec.encode/ec.balance to not find the uploaded volume. Fix: Change EC commands to use '-collection test' to match the uploaded data. Addresses review comment from PR #7607. * test: close log files in MultiDiskCluster.Stop() to prevent FD leaks Track log files in MultiDiskCluster.logFiles and close them in Stop() to prevent file descriptor accumulation in long-running or many-test scenarios. Addresses review comment about logging resources cleanup. * test: improve EC integration tests with proper assertions - Add assertNoFlagError helper to detect flag parsing regressions - Update diskType subtests to fail on flag errors (ec.encode, ec.balance, ec.decode) - Update verify_disktype_flag_parsing to check help output contains diskType - Remove verify_fallback_disk_selection (was documentation-only, not executable) - Add assertion to verify_cross_rack_distribution for minimum 2 racks - Consolidate uploadTestDataWithDiskType to accept collection parameter - Remove duplicate uploadTestDataWithDiskTypeMixed function * test: extract captureCommandOutput helper and fix error handling - Add captureCommandOutput helper to reduce code duplication in diskType tests - Create commandRunner interface to match shell command Do method - Update ec_encode_with_ssd_disktype, ec_balance_with_ssd_disktype, ec_encode_with_source_disktype, ec_decode_with_disktype to use helper - Fix filepath.Glob error handling in countShardsPerRack instead of ignoring it * test: add flag validation to ec_balance_targets_correct_disk_type Add assertNoFlagError calls after ec.balance commands to ensure -diskType flag is properly recognized for both SSD and HDD disk types. * test: add proper assertions for EC command results - ec_encode_with_ssd_disktype: check for expected volume-related errors - ec_balance_with_ssd_disktype: require success with require.NoError - ec_encode_with_source_disktype: check for expected no-volume errors - ec_decode_with_disktype: check for expected no-ec-volume errors - upload_to_ssd_and_hdd: use require.NoError for setup validation Tests now properly fail on unexpected errors rather than just logging. * test: fix missing unlock in ec_encode_with_disk_awareness Add defer unlock pattern to ensure lock is always released, matching the pattern used in other subtests. * test: improve helper robustness - Make assertNoFlagError case-insensitive for pattern matching - Use defer in captureCommandOutput to restore stdout/stderr and close pipe ends to avoid FD leaks even if cmd.Do panics	2025-12-10 22:42:52 -08:00
Chris Lu	41aedaa687	Shell: support regular expression for collection selection (#7158 ) * support regular expression for collection selection * refactor * ordering * fix exact match * Update command_volume_balance_test.go * simplify * Update command_volume_balance.go * comment	2025-08-23 11:04:24 -07:00
Lisandro Pin	8c82c037b9	Unify the re-balancing logic for `ec.encode` with `ec.balance`. (#6339 ) Among others, this enables recent changes related to topology aware re-balancing at EC encoding time.	2024-12-10 13:30:13 -08:00
Lisandro Pin	0d5393641e	Unify usage of shell.EcNode.dc as DataCenterId. (#6258 )	2024-11-19 06:33:18 -08:00
chrislu	ec30a504ba	refactor	2024-09-29 10:38:22 -07:00
chrislu	701abbb9df	add IsResourceHeavy() to command interface	2024-09-28 20:23:01 -07:00
jsh	47112917ff	ec.decode: mount the collected ec shards	2023-11-10 00:04:42 -08:00
chrislu	0fd7222d65	default to skip if less than 4 nodes	2023-10-05 11:13:48 -07:00
Ryan Russell	dfbd8efd26	refactor(command_ec_decode): `exisitngEcIndexBits` -> `existingEcInde… (#3674 ) refactor(command_ec_decode): `exisitngEcIndexBits` -> `existingEcIndexBits` Signed-off-by: Ryan Russell <git@ryanrussell.org> Signed-off-by: Ryan Russell <git@ryanrussell.org>	2022-09-14 12:02:33 -07:00
chrislu	676e27c589	shell: stop long running jobs if lock is lost	2022-08-22 14:12:23 -07:00
chrislu	26dbc6c905	move to https://github.com/seaweedfs/seaweedfs	2022-07-29 00:17:28 -07:00
chrislu	6793bc853c	help message when in simulation mode	2022-05-31 14:48:46 -07:00
chrislu	f18803424a	volume.balance: add delay during tight loop fix https://github.com/chrislusf/seaweedfs/issues/2637	2022-02-08 00:53:55 -08:00
chrislu	9f9ef1340c	use streaming mode for long poll grpc calls streaming mode would create separate grpc connections for each call. this is to ensure the long poll connections are properly closed.	2021-12-26 00:15:03 -08:00
chrislu	a2d3f89c7b	add lock messages	2021-12-10 13:24:38 -08:00
Chris Lu	d774fa6c9a	rename variable	2021-10-25 14:39:20 -07:00
Chris Lu	2539ba0b62	fix compilation	2021-10-25 14:38:48 -07:00
Chris Lu	5f2d7c1589	erasure coding: skip erasure coding if less than recommended 4 nodes	2021-10-25 14:38:11 -07:00
Chris Lu	e862b2529a	refactor	2021-10-01 12:10:11 -07:00
Konstantin Lebedev	5e64b22b45	check that the topology has been updated	2021-10-01 18:51:22 +05:00
Chris Lu	119d5908dd	shell: do not need to lock to see volume -h	2021-09-13 22:13:34 -07:00
Chris Lu	e5fc35ed0c	change server address from string to a type	2021-09-12 22:47:52 -07:00
Chris Lu	c6f992b2a3	remove dead code	2021-07-30 15:18:01 -07:00
Chris Lu	1c233ad986	refactoring	2021-02-22 00:28:42 -08:00
Chris Lu	36f95e50a9	avoid possible nil disk info	2021-02-16 05:13:48 -08:00
Chris Lu	f8446b42ab	this can compile now!!!	2021-02-16 02:47:02 -08:00
Chris Lu	99c4e50d3d	minor	2020-11-28 00:14:11 -08:00
Chris Lu	73564e6a01	master: add cluster wide lock/unlock operation in weed shell fix https://github.com/chrislusf/seaweedfs/issues/1286	2020-04-23 13:37:31 -07:00
Chris Lu	892e726eb9	avoid reusing context object fix https://github.com/chrislusf/seaweedfs/issues/1182	2020-02-25 21:50:12 -08:00
Chris Lu	72a64a5cf8	use the same context object in order to retry	2020-01-26 14:42:11 -08:00
Chris Lu	37b64a50b4	ec: generate and copy .vif file	2019-12-28 12:44:59 -08:00
Chris Lu	d960b3474a	tier storage: support downloading the remote dat files	2019-12-25 09:53:13 -08:00
Chris Lu	1ad34a2487	ed.decode prefers servers with most data shards	2019-12-24 00:00:45 -08:00
Chris Lu	a18f62bbe7	only copy required shards	2019-12-23 18:06:13 -08:00
Chris Lu	09ca936c78	shell: add ec.decode command	2019-12-23 12:48:20 -08:00

41 Commits