mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-21 17:21:34 +00:00
* fix(ec_mount): reject 0-byte .ecx and aggregate cross-disk failures MountEcShards's per-disk loop bailed on the first disk returning a non-ENOENT error, and NewEcVolume wrapped its ENOENT with %v so the caller's `err == os.ErrNotExist` check never matched. On a multi-disk volume server where ec.balance / ec.rebuild had distributed shards across sibling disks while the matching .ecx never arrived, the mount loop bailed after disk 0 with "cannot open ec volume index" and the operator never saw that the rest of the disks were also empty. The companion failure mode is a 0-byte .ecx stub left by EC distribute's writeToFile after a mid-stream copy failure: Stat() succeeds, treating the stub as a valid index, and downstream mount work proceeds against an empty file. Wrap the ec-volume open errors with %w, treat a 0-byte .ecx as os.ErrNotExist (in NewEcVolume, findEcxIdxDirForVolume, and HasEcxFileOnDisk), and have MountEcShards collect per-disk failures before returning a single aggregated error. The "no .ecx anywhere" case gets a distinct error so the orchestrator can re-copy the index from a healthy replica rather than retry against the same broken state. * fix(ec_reconcile): indexEcxOwners also rejects 0-byte .ecx stubs findEcxIdxDirForVolume already skipped 0-byte .ecx during MountEcShards, but indexEcxOwners (used by reconcileEcShardsAcrossDisks at startup) still recorded the first .ecx by name only. On a store where one disk holds a 0-byte stub left by a failed EC distribute and a sibling disk holds the real index, the stub would win the owner selection — and NewEcVolume's new size check would then refuse to load against it, leaving the orphan shards unloaded even though a valid index exists. Mirror the size check from findEcxIdxDirForVolume: skip directory entries whose .ecx Info() reports size 0 or whose Info() call fails. * fix(ec_mount): accept 0-byte .ecx as valid empty index The previous commit treated a 0-byte .ecx in NewEcVolume as os.ErrNotExist, on the assumption that any empty .ecx was a stub left by a failed copy stream. That broke the legitimate empty-volume case: when an EC volume's source .idx has no live entries (e.g. all needles deleted before WriteSortedFileFromIdx), the sorted .ecx is genuinely 0 bytes and must mount. The integration test TestEcShardsToVolumeMissingShardAndNoLiveEntries fails with "MountEcShards: no .ecx index found on any local disk" because the mount path now refuses the legitimate empty index. A 0-byte .ecx left by a failed copy stream is indistinguishable from the legitimate empty case by file size alone. Preventing stub files from being written is the receiver-side cleanup in writeToFile's job (the companion EC distribute PR), not NewEcVolume's at mount time. The cross-disk lookup helpers (findEcxIdxDirForVolume, HasEcxFileOnDisk, indexEcxOwners) keep their size > 0 preference: when a real .ecx exists on a sibling disk alongside a stub, we still want to route mounts and reconcile at the real one. If no non-zero .ecx exists anywhere, the per-disk fallback in MountEcShards can still open the 0-byte .ecx and the volume mounts. Replace TestMountEcShards_ZeroByteEcxOnlyDisk with TestMountEcShards_EmptyEcxMountsSuccessfully, which pins the empty-volume invariant.