* fix(shell): volume.fsck no longer aborts on a single broken chunk manifest
Previously a single entry whose chunk-manifest could not be read (e.g. the
manifest needle was missing or its sub-chunks pointed at a now-gone volume)
caused collectFilerFileIdAndPaths to return immediately with
"failed to ResolveChunkManifest". The whole fsck run failed, so an operator
with even one corrupted file could not use volume.fsck to find or clean up
unrelated orphan needles on other volumes — they had to locate and delete
the bad entries first, blind, with no help from fsck.
Log the resolution failure with the entry path, fall back to recording the
top-level chunk fids the entry references (data fids and manifest fids
themselves; sub-chunks behind the unresolvable manifest stay unknown), and
keep traversing. Track the count of unresolved entries on the command struct
and refuse -reallyDeleteFromVolume for the run when the count is non-zero,
since the in-use fid set is incomplete and a purge could otherwise delete
live sub-chunks behind the broken manifest. Read-only fsck still produces a
useful (if conservatively over-reported) orphan listing so the operator can
see and fix the broken entries first, then re-run with apply.
Discovered while diagnosing #9116.
* address review: use callback ctx and atomic counter
- Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error
cancellation propagates into the manifest fetch instead of using
context.Background().
- TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go),
so the unresolvedManifestEntries field on commandVolumeFsck is shared across
workers and was racing. Switch it to atomic.Int64 with Add/Load.
* address review: reset counter per Do(), pass through ctx errors
- commandVolumeFsck is a singleton registered in init() and reused across
shell invocations. Without resetting the unresolved-manifest counter at
the top of Do(), a single failed run permanently suppressed
-reallyDeleteFromVolume in the same shell session. Reset to 0 right
after flag parsing.
- Treating context cancellation as manifest corruption was wrong: a
Ctrl+C or deadline mid-traversal would inflate the counter and emit
misleading "manifest broken" warnings for entries that were never
examined. Detect context.Canceled / context.DeadlineExceeded and
return the error so the BFS unwinds cleanly.
Not changing the findMissingChunksInFiler branch's purgeAbsent /
applyPurging gating: that path checks recorded filer fids against
volume idx files, and a broken-manifest entry's recorded manifest fid
will fail the existence check and get purged — which is the cleanup
the operator wants for those entries. Adding a gate would block the
exact use case the warning points them at.