Files
seaweedfs/weed/command/command.go
Jaehoon Kim be451d22b5 feat(filer.sync): add -verifySync mode to filer.sync for cross-cluster file comparison (#9284)
* Add -verifySync flag to filer.sync for cross-cluster file comparison

Add a verification mode to filer.sync that compares entries between two
filers without performing actual synchronization. Uses directory-level
sorted merge of ListEntries to detect missing files, size mismatches,
and ETag mismatches. Supports -isActivePassive for unidirectional check
and -modifyTimeAgo to skip recently modified files during sync lag.

* Add mtime annotation and JSON output to filer.sync -verifySync

Add automatic mtime relation analysis for SIZE_MISMATCH and
ETAG_MISMATCH diffs, and an NDJSON output mode for external tooling.

mtime classification:
- B_NEWER => "late_updates_skip_likely" hint. Surfaces the case
  where target has a stub entry whose mtime is ahead of source's
  real file, causing UpdateEntry's mtime guard in filersink to
  permanently skip the update.
- A_NEWER => "sync_lag_or_event_miss" hint.
- EQUAL   => no hint (chunk-level issue suspected).

Text output example:
  [SIZE_MISMATCH] /path (a=996, b=0, B newer +274d [late-updates skip likely])

Add -verifyJsonOutput flag. When set, emits one JSON object per
line (NDJSON) for diffs and a final SUMMARY object, suitable for
piping into external diagnostic pipelines.

Concurrent writes from the directory worker pool are now serialized
via outputMu to keep both text lines and JSON records atomic.

* fix(filer.sync): use shared global semaphore in verifySync to bound goroutine explosion

Replace the per-call local semaphore in compareDirectory with a single
shared semaphore created in runVerifySync. The old per-level semaphore
applied a limit of verifySyncConcurrency only within each directory level,
allowing effective concurrency to grow as verifySyncConcurrency^depth on
deep trees.

The shared semaphore is held only for each directory's I/O phase
(listEntries + merge) and released before recursing into subdirectories,
so a parent never blocks waiting for children to acquire slots — which
would deadlock once tree depth exceeds the semaphore capacity.

Extract the capacity into a named constant (verifySyncConcurrency = 5)
with a comment explaining the memory vs. performance trade-off.

Add unit tests:
- correctness: missing file, only-in-B, size mismatch, active-passive mode
- concurrency bound: peak concurrent listings ≤ verifySyncConcurrency
- no-deadlock: binary tree of depth 10 completes within timeout

* fix(filer.sync): stream directory entries to prevent OOM on large directories

Replace the listEntries helper (which accumulated all entries into a
single []filer_pb.Entry slice) with an entryStream type that pages
through the directory in the background and forwards entries one at a
time through a buffered channel. Memory per directory comparison is now
O(channel buffer size = 64) regardless of how many entries the directory
contains.

Key design points:
- entryStream wraps a goroutine + buffered channel with a one-entry
  lookahead (peek/advance) so the two-pointer sorted merge in
  compareDirectory can work without buffering any full listing.
- A child context (mergeCtx) is passed to both stream goroutines so
  they are cancelled promptly if compareDirectory returns early (e.g.
  on error); the ctx.Done() select arm in the callback prevents
  goroutine leaks when the consumer stops reading.
- stream.err is written by the goroutine before close(ch), so it is
  safe to read after the channel is exhausted (Go memory model:
  channel close happens-before the zero-value receive).
- countMissingRecursive is rewritten to use ReadDirAllEntries with a
  direct callback, eliminating its own slice allocation.
- listEntries is removed; it is no longer called anywhere.

* fix(filer.sync): address verifySync review findings

Four real bugs found and fixed; one finding already resolved (shared
semaphore was introduced in a prior commit).

path.Join for child paths (filer_sync_verify.go)
  fmt.Sprintf("%s/%s", dir, name) produced "//name" when dir was "/".
  Replace all child-path concatenations with path.Join so root-level
  walks emit clean paths.

cutoffTime check for ONLY_IN_B entries (filer_sync_verify.go)
  The B-only branch ignored -modifyTimeAgo, so files recently written
  to B were reported as ONLY_IN_B instead of being skipped. Mirror the
  A-side mtime guard: skip and increment skippedRecent when the entry
  is newer than cutoffTime.

Summary emitted before error check (filer_sync_verify.go)
  A filer I/O error mid-walk still caused a SUMMARY record (or text
  summary) to be printed, making partial runs appear complete. Move the
  error check to before summary emission; on error, return immediately
  without printing any summary.

Return false on verification failure (filer_sync.go)
  runVerifySync returned true (exit 0) even when diffs were found or the
  walk failed. Return false so the main binary sets exit status 1,
  consistent with how all other commands signal failure.

* test(filer.sync): add missing verifySync test coverage

Four new tests covering gaps identified during review:

TestVerifySyncETagMismatch
  Verifies that two files with identical size but different Md5 checksums
  are counted as etagMismatch (not sizeMismatch). Exercises the second
  branch of compareEntries that was previously untested.

TestVerifySyncCutoffTime (4 subtests)
  A-only recent  — recent file skipped (skippedRecent++), not MISSING
  A-only old     — old file reported as MISSING
  B-only recent  — recent file skipped (skippedRecent++), not ONLY_IN_B
  B-only old     — old file reported as ONLY_IN_B
  The B-only subtests specifically cover the cutoffTime fix added in the
  previous commit.

TestVerifySyncRootPath
  Regression for the path.Join fix: walks from "/" and verifies that the
  child directory is reached and compared correctly (the old Sprintf
  produced "//data" which would silently produce wrong results).
  Asserts dirCount=2 and fileCount=1 to confirm the full tree is walked.

* fix(filer.sync): use os.Exit(2) instead of return false on verify failure

return false triggered weed.go's error handler which printed the full
command usage — appropriate for invalid arguments, not for a completed
verification that found differences. Use os.Exit(2) consistent with
the existing pattern in filer_sync.go (lines 251, 293).

* refactor(filer.sync.verify): split verify into its own command

The verify mode is a one-shot batch operation with a fundamentally
different lifecycle from the long-running sync subscriber, and most of
filer.sync's flags (replication, metrics port, debug pprof, concurrency,
etc.) do not apply to it. Extract it into a sibling command alongside
filer.copy/filer.backup/filer.export rather than a flag mode on
filer.sync.

Also rename modifyTimeAgo to modifiedTimeAgo (grammatical) and drop the
verifyJsonOutput prefix to plain jsonOutput now that the verify context
is implicit in the command name.

* fix(filer.sync.verify): address review comments

- Bounded worker pool: cap subdirectory goroutines per level via a
  jobs channel and min(verifySyncConcurrency, len(subDirs)) workers
  instead of spawning one goroutine per child. Wide directories no
  longer park ~2KB per queued goroutine.

- Don't gate recursion on a directory's mtime: a fresh child write
  bumps the parent mtime, but older files inside should still be
  reported as missing. Always recurse for missing-in-B directories
  and apply the cutoff per-file inside countMissingRecursive.

- Apply -modifiedTimeAgo symmetrically: matched-name files now skip
  the comparison when EITHER side is recently modified, not just A.
  This restores lag tolerance when B was just rewritten.

Adds tests for both new behaviors and a shared isTooRecent helper.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-29 12:33:53 -07:00

1.9 KiB