mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-23 18:21:28 +00:00
* fix(volume_server): load orphan EC shards across disks on startup (#9212) When ec.balance / ec.rebuild copies an EC shard onto a destination node without also pinning subsequent shards to the disk that holds .ecx, the shard ends up on a different physical disk than its index files. The per-disk loadAllEcShards has no visibility into other DiskLocations on the same store, so those orphan shards were silently left out of ecVolumes and never reported to master — volume.list showed partial counts, and ec.rebuild reported the volume as unrepairable even though all shards were physically present. After every DiskLocation finishes its initial pass, sweep the store for shard files that are on disk but not yet in any EcVolume, look up the .ecx-owning sibling disk, and load each shard against its physical disk with dirIdx pointing at the sibling. Each shard is still registered on its own disk's ecVolumes map so heartbeat reporting carries the right DiskId per shard (master fix #9219 already aggregates per-disk messages correctly). Also fall back to dirIdx for .vif lookup when dir != dirIdx, so the reconciliation path doesn't write a stub .vif on the shard disk and lose the real EC config and datFileSize. * fix(volume_server): track actual .ecx dir in cross-disk reconcile indexEcxOwners scans both IdxDirectory and Directory to find each volume's .ecx — the second scan covers the legacy case where index files were written into the data dir before -dir.idx was configured (removeEcVolumeFiles already accounts for this in disk_location_ec.go). But the returned map dropped which directory matched, and reconcile unconditionally passed owner.IdxDirectory to loadEcShardsWithIdxDir. When the owner's .ecx is in Directory and IdxDirectory != Directory (server later re-configured with -dir.idx pointing at a fresh path), NewEcVolume opens IdxDirectory/.ecx → ENOENT, retries the same-disk fallback at dataBaseFileName+.ecx — but dataBaseFileName uses the *orphan* disk's data dir, not the owner's, so it ENOENTs again and the orphan shards stay unloaded. Track which scan dir matched in indexEcxOwners and pass it through. Adds TestLoadEcShardsWhenOwnerEcxIsInDataDir as the regression. Reported in PR #9244 review by @gemini-code-assist and @coderabbitai. * refactor(storage): thread dataShardCount as a parameter into calculateExpectedShardSize The helper used erasure_coding.DataShardsCount directly, but tests in store_ec_orphan_shard_test.go save .vif with a local dataShards=10 constant. If the package default ever diverged from 10 (e.g. an enterprise build), the test would write a .vif for one layout while sizing shard files for another and silently break. Take dataShardCount as a parameter. Existing callers (validateEcVolume + size-validation tests + real-world tests) pass erasure_coding.DataShardsCount unchanged. The orphan-shard tests pass the same dataShards local they save into .vif, so the persisted shape and the on-disk shape stay consistent. Reported in PR #9244 review by @coderabbitai.