* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd
Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.
1. [versioning-heal] grep prefix on every heal-related log line, with a
small fixed event vocabulary (produced / surfaced / healed / enqueue /
drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
/ teardown_failed / queue_full). One grep gives operators a single
event stream across the produce-to-drain lifecycle.
2. Escalate the "scanned N>0 entries but no valid latest" case in
updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
names the orphan entries it saw. This is the listing-after-rm
inconsistency signature that pinned down 259064a8's failure — it
should not be invisible at default log levels.
3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
that walks .versions/ directories under a prefix and reports the
stranded population. With -heal it clears the latest-version pointer
in place on stranded directories so subsequent reads return a clean
NoSuchKey instead of replaying the 10-retry self-heal loop.
* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry
Address PR review:
1. s3.versions.audit walked only the first 1024-entry page of each
.versions/ directory, false-positiving "stranded" on large dirs.
Loop until the page returns < 1024 entries, advancing startName.
2. clean and orphan-only categories double-counted when a directory
had no pointer and at least one orphan: incremented both. Make them
mutually exclusive so report totals sum to versionsDirs.
3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
non-interruptible by ctx. A server shutdown / client disconnect
would wait out the budget per in-flight delete. Thread ctx through
deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
gRPC lifecycle handlers pass the stream ctx.
New test pins the behavior: cancelling ctx mid-backoff returns
ctx.Err() in <500ms instead of blocking ~6.3s.
* fix(s3/versioning): clearStale outcome + escape grep-able log fields
Two coderabbit follow-ups:
1. Successful pointer clear should suppress `produced`.
updateLatestVersionAfterDeletion's transient-rm fallback called
clearStaleLatestVersionPointer best-effort, then unconditionally
returned retryErr. The caller (deleteSpecificObjectVersion) saw the
error and emitted `event=produced` + enqueued the reconciler, even
though clearStaleLatestVersionPointer had just driven the pointer to
consistency and the next reader would get NoSuchKey via the
clean-miss path. Make clearStaleLatestVersionPointer return cleared
bool; on success the caller returns nil so neither produced nor the
reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
and CAS mismatches still report false so genuinely stranded state
keeps surfacing.
2. Escape user-controlled fields in heal log lines.
versioningHealInfof / Warningf / Errorf interpolated raw bucket /
key / filename / err text into a single-space-separated line. An S3
key (or error string from gRPC) containing whitespace, newlines, or
`event=...` could split one event into multiple tokens and spoof
fake fields downstream. Sanitize each arg in the helper: safe
values pass through; anything with whitespace, quotes, control
chars, or backslashes is replaced with its strconv.Quote form. No
caller changes — the format strings remain unchanged.
Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.