Files
seaweedfs/weed/shell
Chris Lu 9a6b566fb1 fix(shell): volume.fsck keeps going past a single broken chunk manifest (#9140)
* fix(shell): volume.fsck no longer aborts on a single broken chunk manifest

Previously a single entry whose chunk-manifest could not be read (e.g. the
manifest needle was missing or its sub-chunks pointed at a now-gone volume)
caused collectFilerFileIdAndPaths to return immediately with
"failed to ResolveChunkManifest". The whole fsck run failed, so an operator
with even one corrupted file could not use volume.fsck to find or clean up
unrelated orphan needles on other volumes — they had to locate and delete
the bad entries first, blind, with no help from fsck.

Log the resolution failure with the entry path, fall back to recording the
top-level chunk fids the entry references (data fids and manifest fids
themselves; sub-chunks behind the unresolvable manifest stay unknown), and
keep traversing. Track the count of unresolved entries on the command struct
and refuse -reallyDeleteFromVolume for the run when the count is non-zero,
since the in-use fid set is incomplete and a purge could otherwise delete
live sub-chunks behind the broken manifest. Read-only fsck still produces a
useful (if conservatively over-reported) orphan listing so the operator can
see and fix the broken entries first, then re-run with apply.

Discovered while diagnosing #9116.

* address review: use callback ctx and atomic counter

- Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error
  cancellation propagates into the manifest fetch instead of using
  context.Background().
- TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go),
  so the unresolvedManifestEntries field on commandVolumeFsck is shared across
  workers and was racing. Switch it to atomic.Int64 with Add/Load.

* address review: reset counter per Do(), pass through ctx errors

- commandVolumeFsck is a singleton registered in init() and reused across
  shell invocations. Without resetting the unresolved-manifest counter at
  the top of Do(), a single failed run permanently suppressed
  -reallyDeleteFromVolume in the same shell session. Reset to 0 right
  after flag parsing.
- Treating context cancellation as manifest corruption was wrong: a
  Ctrl+C or deadline mid-traversal would inflate the counter and emit
  misleading "manifest broken" warnings for entries that were never
  examined. Detect context.Canceled / context.DeadlineExceeded and
  return the error so the BFS unwinds cleanly.

Not changing the findMissingChunksInFiler branch's purgeAbsent /
applyPurging gating: that path checks recorded filer fids against
volume idx files, and a broken-manifest entry's recorded manifest fid
will fail the existence check and get purged — which is the cleanup
the operator wants for those entries. Adding a gate would block the
exact use case the warning points them at.
2026-04-19 23:06:28 -07:00
..
2026-02-09 01:37:56 -08:00
2025-12-28 11:39:06 -08:00
2026-04-10 17:31:14 -07:00
2024-09-29 10:38:22 -07:00
2026-02-25 10:25:23 -08:00
2024-09-29 10:38:22 -07:00
2024-09-29 10:38:22 -07:00