Files
seaweedfs/weed/storage/erasure_coding
Chris Lu cba2f7b1dd fix(volume_server): load orphan EC shards across disks on startup (#9212) (#9244)
* fix(volume_server): load orphan EC shards across disks on startup (#9212)

When ec.balance / ec.rebuild copies an EC shard onto a destination node
without also pinning subsequent shards to the disk that holds .ecx, the
shard ends up on a different physical disk than its index files. The
per-disk loadAllEcShards has no visibility into other DiskLocations on
the same store, so those orphan shards were silently left out of
ecVolumes and never reported to master — volume.list showed partial
counts, and ec.rebuild reported the volume as unrepairable even though
all shards were physically present.

After every DiskLocation finishes its initial pass, sweep the store for
shard files that are on disk but not yet in any EcVolume, look up the
.ecx-owning sibling disk, and load each shard against its physical disk
with dirIdx pointing at the sibling. Each shard is still registered on
its own disk's ecVolumes map so heartbeat reporting carries the right
DiskId per shard (master fix #9219 already aggregates per-disk
messages correctly).

Also fall back to dirIdx for .vif lookup when dir != dirIdx, so the
reconciliation path doesn't write a stub .vif on the shard disk and
lose the real EC config and datFileSize.

* fix(volume_server): track actual .ecx dir in cross-disk reconcile

indexEcxOwners scans both IdxDirectory and Directory to find each
volume's .ecx — the second scan covers the legacy case where index
files were written into the data dir before -dir.idx was configured
(removeEcVolumeFiles already accounts for this in disk_location_ec.go).
But the returned map dropped which directory matched, and reconcile
unconditionally passed owner.IdxDirectory to loadEcShardsWithIdxDir.

When the owner's .ecx is in Directory and IdxDirectory != Directory
(server later re-configured with -dir.idx pointing at a fresh path),
NewEcVolume opens IdxDirectory/.ecx → ENOENT, retries the same-disk
fallback at dataBaseFileName+.ecx — but dataBaseFileName uses the
*orphan* disk's data dir, not the owner's, so it ENOENTs again and the
orphan shards stay unloaded.

Track which scan dir matched in indexEcxOwners and pass it through.
Adds TestLoadEcShardsWhenOwnerEcxIsInDataDir as the regression.

Reported in PR #9244 review by @gemini-code-assist and @coderabbitai.

* refactor(storage): thread dataShardCount as a parameter into calculateExpectedShardSize

The helper used erasure_coding.DataShardsCount directly, but tests in
store_ec_orphan_shard_test.go save .vif with a local dataShards=10
constant. If the package default ever diverged from 10 (e.g. an
enterprise build), the test would write a .vif for one layout while
sizing shard files for another and silently break.

Take dataShardCount as a parameter. Existing callers
(validateEcVolume + size-validation tests + real-world tests) pass
erasure_coding.DataShardsCount unchanged. The orphan-shard tests pass
the same dataShards local they save into .vif, so the persisted shape
and the on-disk shape stay consistent.

Reported in PR #9244 review by @coderabbitai.
2026-04-27 16:01:10 -07:00
..
2019-05-15 01:02:00 -07:00
2019-05-15 01:02:00 -07:00
2026-04-10 17:31:14 -07:00
2026-04-10 17:31:14 -07:00