Files
seaweedfs/weed/storage/erasure_coding/ec_decoder_test.go
Chris Lu 300e906330 admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes

The admin bucket size fix (#9058) left object counts at zero for
EC-encoded data because VolumeEcShardInformationMessage carried no file
count. Billing/monitoring dashboards therefore still under-report
objects once a bucket is EC-encoded.

Thread file_count and delete_count end-to-end:

- Add file_count/delete_count to VolumeEcShardInformationMessage (proto
  fields 8 and 9) and regenerate master_pb.
- Compute them lazily on volume servers by walking the .ecx index once
  per EcVolume, cache on the struct, and keep the cache in sync inside
  DeleteNeedleFromEcx (distinguishing live vs already-tombstoned
  entries so idempotent deletes do not drift the counts).
- Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage
  and carry them through the master-side EcVolumeInfo / topology sync.
- Aggregate in admin collectCollectionStats, deduping per volume id:
  every node holding shards of an EC volume reports the same counts, so
  summing across nodes would otherwise multiply the object count by the
  number of shard holders.

Regression tests cover the initial .ecx walk, live/tombstoned delete
bookkeeping (including idempotent and missing-key cases), and the admin
dedup path for an EC volume reported by multiple nodes.

* ec: include .ecj journal in EcVolume delete count

The initial delete count only reflected .ecx tombstones, missing any
needle that was journaled in .ecj but not yet folded into .ecx — e.g.
on partial recovery. Expand initCountsLocked to take the union of
.ecx tombstones and .ecj journal entries, deduped by needle id, so:

  - an id that is both tombstoned in .ecx and listed in .ecj counts once
  - a duplicate .ecj entry counts once
  - an .ecj id with a live .ecx entry is counted as deleted (not live)
  - an .ecj id with no matching .ecx entry is still counted

Covered by TestEcVolumeFileAndDeleteCountEcjUnion.

* ec: report delete count authoritatively and tombstone once per delete

Address two issues with the previous EcVolume file/delete count work:

1. The delete count was computed lazily on first heartbeat and mixed
   in a .ecj-union fallback to "recover" partial state. That diverged
   from how regular volumes report counts (always live from the needle
   map) and had drift cases when .ecj got reconciled. Replace with an
   eager walk of .ecx at NewEcVolume time, maintained incrementally on
   every DeleteNeedleFromEcx call. Semantics now match needle_map_metric:
   FileCount is the total number of needles ever recorded in .ecx
   (live + tombstoned), DeleteCount is the tombstones — so live =
   FileCount - DeleteCount. Drop the .ecj-union logic entirely.

2. A single EC needle delete fanned out to every node holding a replica
   of the primary data shard and called DeleteNeedleFromEcx on each,
   which inflated the per-volume delete total by the replica factor.
   Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in
   order and stop at the first success (one tombstone per delete), and
   only fall back to other shards when the primary shard has no home
   (ErrEcShardMissing sentinel), not on transient RPC errors.

Admin aggregation now folds EC counts correctly: FileCount is deduped
per volume id (every shard holder has an identical .ecx) and DeleteCount
is summed across nodes (each delete tombstones exactly one node). Live
object count = deduped FileCount - summed DeleteCount.

Tests updated to match the new semantics:
  - EC volume counts seed FileCount as total .ecx entries (live +
    tombstoned), DeleteCount as tombstones.
  - DeleteNeedleFromEcx keeps FileCount constant and increments
    DeleteCount only on live->tombstone transitions.
  - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2)
    to prove they're summed, while FileCount=100 is applied once.

* ec: test fixture uses real vid; admin warns on skewed ec counts

- writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the
  actual vid passed in, instead of hardcoding "_1". The existing tests
  all use vid=1 so behaviour is unchanged, but the helper no longer
  silently diverges from its documented parameter.
- collectCollectionStats logs a glog warning when an EC volume's summed
  delete count exceeds its deduped file count, surfacing the anomaly
  (stale heartbeat, counter drift, etc.) instead of silently dropping
  the volume from the object count.

* ec: derive file/delete counts from .ecx/.ecj file sizes

seedCountsFromEcx walked the full .ecx index at volume load, which is
wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj
has fixed-size deletion records (NeedleIdSize), so both counts are pure
file-size arithmetic.

  fileCount   = ecxFileSize / NeedleMapEntrySize
  deleteCount = ecjFileSize / NeedleIdSize

Rip out the cached counters, countsLock, seedCountsFromEcx, and the
recordDelete helper. Track ecjFileSize directly on the EcVolume struct,
seed it from Stat() at load, and bump it on every successful .ecj append
inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write
entirely when the needle is already tombstoned so the derived delete
count stays idempotent on repeat deletes. Heartbeats now compute counts
in O(1).

Tests updated: the initial fixture pre-populates .ecj with two ids to
verify the file-size derivation end-to-end, and the delete test keeps
its idempotent-re-delete / missing-needle invariants (unchanged
externally, now enforced by the early return rather than a cache guard).

* ec: sync Rust volume server with Go file/delete count semantics

Mirror the Go-side EC file/delete count work in the Rust volume server
so mixed Go/Rust clusters report consistent bucket object counts in
the admin dashboard.

- Add file_count (8) and delete_count (9) to the Rust copy of
  VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto).
- EcVolume gains ecj_file_size, seeded from the journal's metadata on
  open and bumped inside journal_delete on every successful append.
- file_and_delete_count() returns counts derived in O(1) from
  ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and
  ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount.
- to_volume_ec_shard_information_messages populates the new proto
  fields instead of defaulting them to zero.
- mark_needle_deleted_in_ecx now returns a DeleteOutcome enum
  (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip
  both the .ecj append and the size bump when the needle is missing
  or already tombstoned, keeping the derived delete_count idempotent
  on repeat or no-op deletes.
- Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's
  RebuildEcxFile is only called from specific decode/rebuild gRPC
  handlers, not on volume open, and replaying on load was hiding the
  deletion journal from the new file-size-derived delete counter.
  rebuild_ecx_from_journal is kept as dead_code for future decode
  paths that may want the same replay semantics.

Also clean up the Go FileAndDeleteCount to drop unnecessary runtime
guards against zero constants — NeedleMapEntrySize and NeedleIdSize
are compile-time non-zero.

test_ec_volume_journal updated to pre-populate the .ecx with the
needles it deletes, and extended to verify that repeat and
missing-id deletes do not drift the derived counts.

* ec: document enterprise-reserved proto field range on ec shard info

Both OSS master.proto copies now note that fields 10-19 are reserved
for future upstream additions while 20+ are owned by the enterprise
fork. Enterprise already pins data_shards/parity_shards at 20/21, so
keeping OSS additions inside 8-19 avoids wire-level collisions for
mixed deployments.

* ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir

ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but
new() opens the actual files from ecx_actual_dir (which may fall back
to the data dir when the idx dir does not contain the index). After a
fallback, read_deleted_needles() and rebuild_ecx_from_journal() would
read/rebuild the wrong (nonexistent) path while heartbeats reported
counts from the file actually in use — silently dropping deletes.

Point idx_base_name() at ecx_actual_dir, which is initialized to
dir_idx and only diverges after a successful fallback, so every call
site agrees with the file new() has open. The pre-fallback call in
new() (line 142) still returns the dir_idx path because
ecx_actual_dir == dir_idx at that point.

Update the destroy() sweep to build the dir_idx cleanup paths
explicitly instead of leaning on the helpers, so post-fallback stale
files in the idx dir are still removed.

* ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure

Two EC delete-count correctness fixes applied symmetrically to Go and
Rust volume servers.

1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after
   recreating the empty journal, matching the on-disk truth.
   Previously the cached size still reflected the pre-rebuild journal
   and file_and_delete_count() would keep reporting stale delete
   counts. The Go side has no equivalent bug because RebuildEcxFile
   runs in an offline helper that does not touch an EcVolume struct.

2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx
   entry before writing the .ecj record. If the .ecj append then
   failed, the needle was permanently marked deleted but the
   heartbeat-reported delete_count never advanced (it is derived from
   .ecj file size), and a retry would see AlreadyDeleted and early-
   return, leaving the drift permanent.

   Both languages now capture the entry's file offset and original
   size bytes during the mark step, attempt the .ecj append, and on
   failure roll the .ecx tombstone back by writing the original size
   bytes at the known offset. A rollback that itself errors is
   logged (glog / tracing) but cannot re-sync the files — this is
   the same failure mode a double disk error would produce, and is
   unavoidable without a full on-disk transaction log.

Go: wrap MarkNeedleDeleted in a closure that captures the file
offset into an outer variable, then pass the offset + oldSize to the
new rollbackEcxTombstone helper on .ecj seek/write errors.

Rust: DeleteOutcome::Tombstoned now carries the size_offset and a
[u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete
destructures on Tombstoned and calls restore_ecx_size on .ecj append
failure.

* test(ec): widen admin /health wait to 180s for cold CI

TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and
admin in sequence, then waited only 60s for admin's HTTP server to come
up. On cold GitHub runners the tail of the earlier subprocess startups
eats most of that budget and the wait occasionally times out (last hit
on run 24374773031). The local fast path is still ~20s total, so the
bump only extends the timeout ceiling, not the happy path.

* test(ec): fork volume servers in parallel in TestEcEndToEnd

startWeed is non-blocking (just cmd.Start()), so the per-process fork +
mkdir + log-file-open overhead for 14 volume servers was serialized for
no reason. On cold CI disks that overhead stacks up and eats into the
subsequent admin /health wait, which is how run 24374773031 flaked.

Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds
with a mutex so concurrent appends are safe. startWeed still calls
t.Fatalf on failure, which is fine from a goroutine for a fatal test
abort; the fail-fast isn't something we rely on for precise ordering.

* ec: fsync ecx before ecj, truncate on failure, harden rebuild

Four correctness fixes covering both volume servers.

1. Durability ordering (Go + Rust). After marking the .ecx tombstone
   we now fsync .ecx before touching .ecj, so a crash between the two
   files cannot leave the journal with an entry for a needle whose
   tombstone is still sitting in page cache. Once the fsync returns,
   the tombstone is the source of truth: reads see "deleted",
   delete_count may under-count by one (benign, idempotent retries)
   but never over-reports. If the fsync itself fails we restore the
   original size bytes and surface the error. The .ecj append is then
   followed by its own Sync so the reported delete_count matches the
   on-disk journal once the write returns.

2. .ecj truncation on append failure. write_all may have extended the
   journal on disk before sync_all / Sync errors out, leaving the
   cached ecj_file_size out of sync with the physical length and
   drifting delete_count permanently after restart. Both languages
   now capture the pre-append size, truncate the file back via
   set_len / Truncate on any write or sync failure, and only then
   restore the .ecx tombstone. Truncation errors are logged — same-fd
   length resets cannot realistically fail — but cannot themselves
   re-sync the files.

3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired
   up on any future decode path). Previously a failed
   mark_needle_deleted_in_ecx call was swallowed with `let _ = ...`
   and the journal was still removed, silently losing tombstones.
   We now bubble up any non-NotFound error, fsync .ecx after the
   whole replay succeeds, and only then drop and recreate .ecj.
   NotFound is still ignored (expected race between delete and encode).

4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to
   return Ok(NotFound) when self.ecx_file was None, hiding a closed or
   corrupt volume behind what looks like an idempotent no-op. It now
   returns an io::Error carrying the volume id so callers (e.g.
   journal_delete) fail loudly instead.

Existing Go and Rust EC test suites stay green.

* ec: make .ecx immutable at runtime; track deletes in memory + .ecj

Refactors both volume servers so the sealed sorted .ecx index is never
mutated during normal operation. Runtime deletes are committed to the
.ecj deletion journal and tracked in an in-memory deleted-needle set;
read-path lookups consult that set to mask out deleted ids on top of
the immutable .ecx record. Mirrors the intended design on both Go and
Rust sides.

EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from
.ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx /
journal_delete:

  1. Looks the needle up read-only in .ecx.
  2. Missing needle -> no-op.
  3. Pre-existing .ecx tombstone (from a prior decode/rebuild) ->
     mirror into the in-memory set, no .ecj append.
  4. Otherwise append the id to .ecj, fsync, and only then publish
     the id into the set. A partial write is truncated back to the
     pre-append length so the on-disk journal and the in-memory set
     cannot drift.

FindNeedleFromEcx / find_needle_from_ecx now return
TombstoneFileSize when the id is in the in-memory set, even though
the bytes on disk still show the original size.

FileAndDeleteCount:
  fileCount   = .ecx size / NeedleMapEntrySize (unchanged)
  deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize)

The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers
still fold .ecj into .ecx — that is the one place tombstones land in
the physical index, and it runs offline on closed files. Rust's
rebuild helper now also clears the in-memory set when it succeeds.

Dead code removed on the Rust side: `DeleteOutcome`,
`mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the
runtime `rollbackEcxTombstone` path. Neither helper was needed once
.ecx stopped being a runtime mutation target.

TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten
as TestEcVolumeDeleteDurableToJournal, which exercises the full
durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks
via the in-memory set -> raw .ecx bytes are *unchanged* -> Close +
RebuildEcxFile folds the journal into .ecx -> raw bytes now show
the tombstone, as CopyFile in the decode path expects.
2026-04-13 21:10:36 -07:00

580 lines
19 KiB
Go

package erasure_coding_test
import (
"os"
"path/filepath"
"testing"
erasure_coding "github.com/seaweedfs/seaweedfs/weed/storage/erasure_coding"
"github.com/seaweedfs/seaweedfs/weed/storage/needle"
"github.com/seaweedfs/seaweedfs/weed/storage/types"
)
func TestHasLiveNeedles_AllDeletedIsFalse(t *testing.T) {
dir := t.TempDir()
collection := "foo"
base := filepath.Join(dir, collection+"_1")
// Build an ecx file with only deleted entries.
// ecx file entries are the same format as .idx entries.
ecx := makeNeedleMapEntry(types.NeedleId(1), types.Offset{}, types.TombstoneFileSize)
if err := os.WriteFile(base+".ecx", ecx, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if hasLive {
t.Fatalf("expected no live entries")
}
}
func TestHasLiveNeedles_WithLiveEntryIsTrue(t *testing.T) {
dir := t.TempDir()
collection := "foo"
base := filepath.Join(dir, collection+"_1")
// Build an ecx file containing at least one live entry.
// ecx file entries are the same format as .idx entries.
live := makeNeedleMapEntry(types.NeedleId(1), types.Offset{}, types.Size(1))
if err := os.WriteFile(base+".ecx", live, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if !hasLive {
t.Fatalf("expected live entries")
}
}
func TestHasLiveNeedles_EmptyFileIsFalse(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "foo_1")
// Create an empty ecx file.
if err := os.WriteFile(base+".ecx", []byte{}, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if hasLive {
t.Fatalf("expected no live entries for empty file")
}
}
func makeNeedleMapEntry(key types.NeedleId, offset types.Offset, size types.Size) []byte {
b := make([]byte, types.NeedleIdSize+types.OffsetSize+types.SizeSize)
types.NeedleIdToBytes(b[0:types.NeedleIdSize], key)
types.OffsetToBytes(b[types.NeedleIdSize:types.NeedleIdSize+types.OffsetSize], offset)
types.SizeToBytes(b[types.NeedleIdSize+types.OffsetSize:types.NeedleIdSize+types.OffsetSize+types.SizeSize], size)
return b
}
// TestWriteIdxFileFromEcIndex_PreservesDeletedNeedles verifies that WriteIdxFileFromEcIndex
// correctly marks deleted needles in the generated .idx file.
// This tests the fix for issue #7751 where deleted files in encoded volumes
// were not properly marked as deleted when decoded.
func TestWriteIdxFileFromEcIndex_PreservesDeletedNeedles(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "foo_1")
// Create an .ecx file with one live needle and one deleted needle
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
needle2 := makeNeedleMapEntry(types.NeedleId(2), types.ToOffset(128), types.TombstoneFileSize) // deleted
ecxData := append(needle1, needle2...)
if err := os.WriteFile(base+".ecx", ecxData, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// Generate .idx from .ecx
if err := erasure_coding.WriteIdxFileFromEcIndex(base); err != nil {
t.Fatalf("WriteIdxFileFromEcIndex: %v", err)
}
// Verify .idx file has the same content
idxData, err := os.ReadFile(base + ".idx")
if err != nil {
t.Fatalf("read idx: %v", err)
}
if len(idxData) != len(ecxData) {
t.Fatalf("idx file size mismatch: got %d, want %d", len(idxData), len(ecxData))
}
// Verify the second needle is still marked as deleted
entrySize := types.NeedleIdSize + types.OffsetSize + types.SizeSize
entry2 := idxData[entrySize : entrySize*2]
size2 := types.BytesToSize(entry2[types.NeedleIdSize+types.OffsetSize:])
if !size2.IsDeleted() {
t.Fatalf("expected needle 2 to be marked as deleted, got size: %d", size2)
}
}
// TestWriteIdxFileFromEcIndex_ProcessesEcjJournal verifies that WriteIdxFileFromEcIndex
// correctly processes deletions from the .ecj journal file.
func TestWriteIdxFileFromEcIndex_ProcessesEcjJournal(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "foo_1")
// Create an .ecx file with two live needles
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
needle2 := makeNeedleMapEntry(types.NeedleId(2), types.ToOffset(128), types.Size(200))
ecxData := append(needle1, needle2...)
if err := os.WriteFile(base+".ecx", ecxData, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// Create an .ecj file that records needle 2 as deleted
ecjData := make([]byte, types.NeedleIdSize)
types.NeedleIdToBytes(ecjData, types.NeedleId(2))
if err := os.WriteFile(base+".ecj", ecjData, 0644); err != nil {
t.Fatalf("write ecj: %v", err)
}
// Generate .idx from .ecx and .ecj
if err := erasure_coding.WriteIdxFileFromEcIndex(base); err != nil {
t.Fatalf("WriteIdxFileFromEcIndex: %v", err)
}
// Verify .idx file has 3 entries: 2 from .ecx + 1 deletion from .ecj
idxData, err := os.ReadFile(base + ".idx")
if err != nil {
t.Fatalf("read idx: %v", err)
}
entrySize := types.NeedleIdSize + types.OffsetSize + types.SizeSize
expectedSize := entrySize * 3 // 2 from ecx + 1 deletion append from ecj
if len(idxData) != expectedSize {
t.Fatalf("idx file size mismatch: got %d, want %d", len(idxData), expectedSize)
}
// The third entry should be the deletion record for needle 2
entry3 := idxData[entrySize*2 : entrySize*3]
key3 := types.BytesToNeedleId(entry3[0:types.NeedleIdSize])
size3 := types.BytesToSize(entry3[types.NeedleIdSize+types.OffsetSize:])
if key3 != types.NeedleId(2) {
t.Fatalf("expected needle id 2 in deletion record, got: %d", key3)
}
if !size3.IsDeleted() {
t.Fatalf("expected deletion record to have tombstone size, got: %d", size3)
}
}
// TestDecodeWithNonEmptyEcj_AllDeleted verifies the full decode pre-processing
// when .ecj contains deletions for ALL live entries in .ecx.
// After RebuildEcxFile merges .ecj into .ecx, HasLiveNeedles must return false
// and WriteIdxFileFromEcIndex must produce an .idx where every entry is tombstoned.
func TestDecodeWithNonEmptyEcj_AllDeleted(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "test_1")
// .ecx: two live entries
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
needle2 := makeNeedleMapEntry(types.NeedleId(2), types.ToOffset(128), types.Size(200))
ecxData := append(needle1, needle2...)
if err := os.WriteFile(base+".ecx", ecxData, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// .ecj: both needles deleted
ecjData := make([]byte, 2*types.NeedleIdSize)
types.NeedleIdToBytes(ecjData[0:types.NeedleIdSize], types.NeedleId(1))
types.NeedleIdToBytes(ecjData[types.NeedleIdSize:], types.NeedleId(2))
if err := os.WriteFile(base+".ecj", ecjData, 0644); err != nil {
t.Fatalf("write ecj: %v", err)
}
// Before rebuild, ecx entries look live
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles before rebuild: %v", err)
}
if !hasLive {
t.Fatal("expected live entries before rebuild")
}
// Simulate what VolumeEcShardsToVolume now does: merge .ecj into .ecx
if err := erasure_coding.RebuildEcxFile(base); err != nil {
t.Fatalf("RebuildEcxFile: %v", err)
}
// .ecj should be removed after rebuild
if _, err := os.Stat(base + ".ecj"); !os.IsNotExist(err) {
t.Fatal("expected .ecj to be removed after RebuildEcxFile")
}
// After rebuild, HasLiveNeedles must return false
hasLive, err = erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles after rebuild: %v", err)
}
if hasLive {
t.Fatal("expected no live entries after rebuild merged all deletions")
}
// WriteIdxFileFromEcIndex should still work (no .ecj to process)
if err := erasure_coding.WriteIdxFileFromEcIndex(base); err != nil {
t.Fatalf("WriteIdxFileFromEcIndex: %v", err)
}
idxData, err := os.ReadFile(base + ".idx")
if err != nil {
t.Fatalf("read idx: %v", err)
}
// .idx should have exactly 2 entries (copied from .ecx, both now tombstoned)
entrySize := types.NeedleIdSize + types.OffsetSize + types.SizeSize
if len(idxData) != 2*entrySize {
t.Fatalf("idx file size: got %d, want %d", len(idxData), 2*entrySize)
}
// Both entries must be tombstoned
for i := 0; i < 2; i++ {
entry := idxData[i*entrySize : (i+1)*entrySize]
size := types.BytesToSize(entry[types.NeedleIdSize+types.OffsetSize:])
if !size.IsDeleted() {
t.Fatalf("entry %d: expected tombstone, got size %d", i+1, size)
}
}
}
// TestDecodeWithNonEmptyEcj_PartiallyDeleted verifies decode pre-processing
// when .ecj deletes only some entries. After RebuildEcxFile, HasLiveNeedles
// must still return true for the surviving entries, and WriteIdxFileFromEcIndex
// must produce an .idx that correctly distinguishes live from deleted needles.
func TestDecodeWithNonEmptyEcj_PartiallyDeleted(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "test_1")
// .ecx: three live entries
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
needle2 := makeNeedleMapEntry(types.NeedleId(2), types.ToOffset(128), types.Size(200))
needle3 := makeNeedleMapEntry(types.NeedleId(3), types.ToOffset(256), types.Size(300))
ecxData := append(append(needle1, needle2...), needle3...)
if err := os.WriteFile(base+".ecx", ecxData, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// .ecj: only needle 2 is deleted
ecjData := make([]byte, types.NeedleIdSize)
types.NeedleIdToBytes(ecjData, types.NeedleId(2))
if err := os.WriteFile(base+".ecj", ecjData, 0644); err != nil {
t.Fatalf("write ecj: %v", err)
}
// Merge .ecj into .ecx
if err := erasure_coding.RebuildEcxFile(base); err != nil {
t.Fatalf("RebuildEcxFile: %v", err)
}
// HasLiveNeedles must still return true (needles 1 and 3 survive)
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if !hasLive {
t.Fatal("expected live entries after partial deletion")
}
// WriteIdxFileFromEcIndex
if err := erasure_coding.WriteIdxFileFromEcIndex(base); err != nil {
t.Fatalf("WriteIdxFileFromEcIndex: %v", err)
}
idxData, err := os.ReadFile(base + ".idx")
if err != nil {
t.Fatalf("read idx: %v", err)
}
entrySize := types.NeedleIdSize + types.OffsetSize + types.SizeSize
if len(idxData) != 3*entrySize {
t.Fatalf("idx file size: got %d, want %d", len(idxData), 3*entrySize)
}
// Verify each entry
for i := 0; i < 3; i++ {
entry := idxData[i*entrySize : (i+1)*entrySize]
key := types.BytesToNeedleId(entry[0:types.NeedleIdSize])
size := types.BytesToSize(entry[types.NeedleIdSize+types.OffsetSize:])
switch key {
case types.NeedleId(1), types.NeedleId(3):
if size.IsDeleted() {
t.Fatalf("needle %d: should be live, got tombstone", key)
}
case types.NeedleId(2):
if !size.IsDeleted() {
t.Fatalf("needle %d: should be tombstoned, got size %d", key, size)
}
default:
t.Fatalf("unexpected needle id %d", key)
}
}
}
// TestDecodeWithEmptyEcj verifies that the decode flow is a no-op when
// .ecj exists but is empty (no deletions recorded).
func TestDecodeWithEmptyEcj(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "test_1")
// .ecx: one live entry
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
if err := os.WriteFile(base+".ecx", needle1, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// .ecj: empty
if err := os.WriteFile(base+".ecj", []byte{}, 0644); err != nil {
t.Fatalf("write ecj: %v", err)
}
// RebuildEcxFile with empty .ecj should not change anything
if err := erasure_coding.RebuildEcxFile(base); err != nil {
t.Fatalf("RebuildEcxFile: %v", err)
}
// HasLiveNeedles must still return true
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if !hasLive {
t.Fatal("expected live entries with empty .ecj")
}
}
// TestDecodeWithNoEcjFile verifies that the decode flow works when no .ecj
// file exists at all.
func TestDecodeWithNoEcjFile(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "test_1")
// .ecx: one live entry
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
if err := os.WriteFile(base+".ecx", needle1, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// No .ecj file
// RebuildEcxFile should be a no-op
if err := erasure_coding.RebuildEcxFile(base); err != nil {
t.Fatalf("RebuildEcxFile: %v", err)
}
hasLive, err := erasure_coding.HasLiveNeedles(base)
if err != nil {
t.Fatalf("HasLiveNeedles: %v", err)
}
if !hasLive {
t.Fatal("expected live entries without .ecj file")
}
}
// TestEcxFileDeletionVisibleAfterSync verifies that deletions made to .ecx
// via MarkNeedleDeleted are visible to other readers after Sync().
// This is a regression test for issue #7751.
func TestEcxFileDeletionVisibleAfterSync(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "foo_1")
// Create an .ecx file with one live needle
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
if err := os.WriteFile(base+".ecx", needle1, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// Open the file for writing (simulating what EcVolume does)
ecxFile, err := os.OpenFile(base+".ecx", os.O_RDWR, 0644)
if err != nil {
t.Fatalf("open ecx: %v", err)
}
// Mark needle as deleted using MarkNeedleDeleted
err = erasure_coding.MarkNeedleDeleted(ecxFile, 0)
if err != nil {
ecxFile.Close()
t.Fatalf("MarkNeedleDeleted: %v", err)
}
// Sync the file to ensure changes are visible to other readers
if err := ecxFile.Sync(); err != nil {
ecxFile.Close()
t.Fatalf("Sync: %v", err)
}
ecxFile.Close()
// Now open with a new file handle and verify deletion is visible
data, err := os.ReadFile(base + ".ecx")
if err != nil {
t.Fatalf("read ecx: %v", err)
}
size := types.BytesToSize(data[types.NeedleIdSize+types.OffsetSize:])
if !size.IsDeleted() {
t.Fatalf("expected needle to be marked as deleted after sync, got size: %d", size)
}
}
// TestEcxFileDeletionNotVisibleWithoutSync verifies that without Sync(),
// deletions may not be visible to other readers (demonstrating the bug).
// Note: This test may be flaky depending on OS caching behavior, but it
// documents the expected behavior.
func TestEcxFileDeletionWithSeparateHandles(t *testing.T) {
dir := t.TempDir()
base := filepath.Join(dir, "foo_1")
// Create an .ecx file with one live needle
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
if err := os.WriteFile(base+".ecx", needle1, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
// Open the file for writing (writer handle)
writerFile, err := os.OpenFile(base+".ecx", os.O_RDWR, 0644)
if err != nil {
t.Fatalf("open ecx for write: %v", err)
}
defer writerFile.Close()
// Open the file for reading (reader handle - simulating CopyFile behavior)
readerFile, err := os.OpenFile(base+".ecx", os.O_RDONLY, 0644)
if err != nil {
t.Fatalf("open ecx for read: %v", err)
}
defer readerFile.Close()
// Mark needle as deleted via writer handle
err = erasure_coding.MarkNeedleDeleted(writerFile, 0)
if err != nil {
t.Fatalf("MarkNeedleDeleted: %v", err)
}
// Sync the writer to flush changes
if err := writerFile.Sync(); err != nil {
t.Fatalf("Sync: %v", err)
}
// Read via reader handle - after sync, changes should be visible
data := make([]byte, types.NeedleIdSize+types.OffsetSize+types.SizeSize)
if _, err := readerFile.ReadAt(data, 0); err != nil {
t.Fatalf("ReadAt: %v", err)
}
size := types.BytesToSize(data[types.NeedleIdSize+types.OffsetSize:])
if !size.IsDeleted() {
t.Fatalf("expected deletion to be visible after Sync(), got size: %d", size)
}
}
// TestEcVolumeDeleteDurableToJournal tracks issue #7751: a runtime needle
// delete must be observable by the ec.decode CopyFile path. Under the
// current design .ecx is an immutable sealed index at runtime — deletes
// are journaled to .ecj and tracked in an in-memory set — so the
// durability chain decode relies on is:
//
// 1. DeleteNeedleFromEcx appends the needle id to .ecj and fsyncs it.
// 2. Runtime reads via FindNeedleFromEcx consult the in-memory set and
// return TombstoneFileSize even though the sealed .ecx record on
// disk still shows the original size.
// 3. ec.decode later closes the EcVolume and calls RebuildEcxFile on
// the now-quiescent files, which walks .ecj and writes tombstones
// into .ecx. CopyFile then reads the rebuilt .ecx.
//
// This test exercises the full chain on a tempdir fixture.
func TestEcVolumeDeleteDurableToJournal(t *testing.T) {
dir := t.TempDir()
collection := "test"
vid := 1
base := filepath.Join(dir, collection+"_1")
// Seed .ecx with two live needles.
needle1 := makeNeedleMapEntry(types.NeedleId(1), types.ToOffset(64), types.Size(100))
needle2 := makeNeedleMapEntry(types.NeedleId(2), types.ToOffset(128), types.Size(200))
ecxData := append(needle1, needle2...)
if err := os.WriteFile(base+".ecx", ecxData, 0644); err != nil {
t.Fatalf("write ecx: %v", err)
}
if err := os.WriteFile(base+".ecj", []byte{}, 0644); err != nil {
t.Fatalf("write ecj: %v", err)
}
if err := os.WriteFile(base+".ec00", make([]byte, 8), 0644); err != nil {
t.Fatalf("write ec00: %v", err)
}
if err := os.WriteFile(base+".vif", []byte{}, 0644); err != nil {
t.Fatalf("write vif: %v", err)
}
ecVolume, err := erasure_coding.NewEcVolume("hdd", dir, dir, collection, needle.VolumeId(vid))
if err != nil {
t.Fatalf("NewEcVolume: %v", err)
}
// Runtime delete must not mutate .ecx.
if err := ecVolume.DeleteNeedleFromEcx(types.NeedleId(2)); err != nil {
t.Fatalf("DeleteNeedleFromEcx: %v", err)
}
// FindNeedleFromEcx masks the id via the in-memory set.
_, size, err := ecVolume.FindNeedleFromEcx(types.NeedleId(2))
if err != nil {
t.Fatalf("FindNeedleFromEcx(2): %v", err)
}
if !size.IsDeleted() {
t.Fatalf("expected FindNeedleFromEcx to return tombstone for deleted needle, got size=%d", size)
}
// Direct .ecx reader should still see the original size — .ecx is
// immutable at runtime.
entrySize := int64(types.NeedleIdSize + types.OffsetSize + types.SizeSize)
rawBuf := make([]byte, entrySize)
rawReader, err := os.Open(base + ".ecx")
if err != nil {
t.Fatalf("open ecx raw: %v", err)
}
if _, err := rawReader.ReadAt(rawBuf, entrySize); err != nil {
t.Fatalf("read raw ecx entry: %v", err)
}
rawReader.Close()
rawSize := types.BytesToSize(rawBuf[types.NeedleIdSize+types.OffsetSize:])
if rawSize.IsDeleted() {
t.Fatalf("runtime delete must not mutate .ecx on disk; got tombstone in raw entry")
}
// Close the volume so RebuildEcxFile can operate on the files, then
// fold .ecj into .ecx as the decode path does and verify the rebuilt
// index has the tombstone visible to external readers.
ecVolume.Close()
if err := erasure_coding.RebuildEcxFile(base); err != nil {
t.Fatalf("RebuildEcxFile: %v", err)
}
rebuilt, err := os.Open(base + ".ecx")
if err != nil {
t.Fatalf("open rebuilt ecx: %v", err)
}
defer rebuilt.Close()
if _, err := rebuilt.ReadAt(rawBuf, entrySize); err != nil {
t.Fatalf("read rebuilt ecx entry: %v", err)
}
rebuiltSize := types.BytesToSize(rawBuf[types.NeedleIdSize+types.OffsetSize:])
if !rebuiltSize.IsDeleted() {
t.Fatalf("expected needle 2 to be tombstoned in rebuilt .ecx, got size=%d", rebuiltSize)
}
}