Files
Chris Lu 2c1482f7a6 fix(ec): clear cross-server stale EC shards before re-distribute (#9478) (#9499)
* fix(ec): clear cross-server stale EC shards before re-distribute (#9478)

A previous failed encode leaves partial .ec?? shards mounted on
destination volume servers that are not the .dat owner. PR #9480 only
prunes when the .dat sits on a sibling disk of the SAME store, so the
cross-server case stays stuck: every retry trips
volume_grpc_copy.go:570's "ec volume %d is mounted; refusing overwrite"
guard and the scheduler loops.

Detection already lists existing EC shards as CleanupECShards sources;
plumb the shard ids through (ActiveTopology.GetECShardLocations,
TaskSourceSpec, TaskSource.shard_ids) and have the EC worker call
VolumeEcShardsUnmount + VolumeEcShardsDelete on each destination after
the local shard set is generated and before distributeEcShards. Skip
EC-shard sources in getReplicas so the post-encode VolumeDelete step
does not target destination-only nodes.

Integration test mounts a partial shard subset, asserts the
mounted-volume refusal, runs cleanupStaleEcShards, and asserts the
next ReceiveFile lands.

* chore(ec): tighten code comments in stale-shard cleanup

Drop issue-number refs from code comments and shorten the docstrings
on cleanupStaleEcShards / unmountAndDeleteEcShards / getReplicas plus
the new test file. Behavior unchanged.

* fix(ec): skip empty-ShardIds locations; dedupe getReplicas by node

GetECShardLocations dropped entries where ecShardMatchesCollection saw a
phantom info record with EcIndexBits=0 — without ShardIds, getReplicas
misread the resulting source as a regular replica and would have called
VolumeDelete on a destination-only node.

getReplicas now dedupes by Node since VolumeDelete is server-wide;
per-disk source rows on the same server collapse to one call.

* refactor(ec): use MaxShardCount and ShardBits in collectShardIdsForDisk

Drop the literal 32 bit-iteration bound for erasure_coding.MaxShardCount
and treat the EcIndexBits union as a ShardBits so Count() drives the
slice preallocation. Keeps the helper aligned with the rest of the EC
code and survives any future expansion of the shard-count ceiling.
2026-05-14 11:57:45 -07:00
..
2026-02-22 13:34:06 -08:00