Files
seaweedfs/test/plugin_workers
Chris Lu 940eed0bd3 fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972)
* fix(ec): generate .ecx before EC shards to prevent data inconsistency

In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER
the EC shards were generated from .dat. If any write occurred between
these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses
the read-only check), the .ecx would contain entries pointing to data
that doesn't exist in the EC shards, causing "shard too short" and
"size mismatch" errors on subsequent reads and scrubs.

Fix by generating .ecx FIRST, then snapshotting datFileSize, then
encoding EC shards. If a write sneaks in after .ecx generation, the
EC shards contain more data than .ecx references — which is harmless
(the extra data is simply not indexed).

Also snapshot datFileSize before EC encoding to ensure the .vif
reflects the same .dat state that .ecx was generated from.

Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the
race condition by appending data between EC encoding and .ecx generation.

* fix: pass actual offset to ReadBytes, improve test quality

- Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve
  correct error metrics and error messages within ReadBytes
- Handle Stat() error in assembleFromIntervalsAllowError
- Rename TestEcConsistency_DatFileGrowsDuringEncoding to
  TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size
  encoding, not concurrent growth)
- Update test comment to clarify it reproduces the old buggy sequence
- Fix verification loop to advance by readSize for full data coverage

* fix(ec): add dat/idx consistency check in worker EC encoding

The erasure_coding worker copies .dat and .idx as separate network
transfers. If a write lands on the source between these copies, the
.idx may have entries pointing past the end of .dat, leading to EC
volumes with .ecx entries that reference non-existent shard data.

Add verifyDatIdxConsistency() that walks the .idx and verifies no
entry's offset+size exceeds the .dat file size. This fails the EC
task early with a clear error instead of silently producing corrupt
EC volumes.

* test(ec): add integration test verifying .ecx/.ecd consistency

TestEcIndexConsistencyAfterEncode uploads multiple needles of varying
sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then
reads every needle back via the EC read path and verifies payload
correctness. This catches any inconsistency between .ecx index entries
and EC shard data.

* fix(test): account for needle overhead in test volume fixture

WriteTestVolumeFiles created a .dat of exactly datSize bytes but the
.idx entry claimed a needle of that same size. GetActualSize adds
header + checksum + timestamp overhead, so the consistency check
correctly rejects this as the needle extends past the .dat file.

Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry
is consistent with the .dat contents.

* fix(test): remove flaky shard ID assertion in EC scrub test

When shard 0 is truncated on disk after mount, the volume server may
detect corruption via parity mismatches (shards 10-13) rather than a
direct read failure on shard 0, depending on OS caching/mmap behavior.
Replace the brittle shard-0-specific check with a volume ID validation.

* fix(test): close upload response bodies and tighten file count assertion

Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd
leaks during test execution. Also tighten TotalFiles check from >= 1
to == 1 since ecSetup uploads exactly one file.
2026-04-07 19:05:36 -07:00
..