Files
seaweedfs/weed/command/filer_backup_test.go
Chris Lu c1ccbe97dd feat(filer.backup): -initialSnapshot seeds destination from live tree (#9126)
* feat(filer.backup): -initialSnapshot seeds destination from live tree

Replaying the metadata event log on a fresh sync only leaves files that
still exist on the source at replay time: any entry that was created and
later deleted is replayed as a create/delete pair and never materializes
on the destination. Users who wipe the destination and re-run
filer.backup therefore see "only new files" instead of a full backup,
even when -timeAgo=876000h is passed and the subscription genuinely
starts from epoch (ref discussion #8672).

Add a -initialSnapshot opt-in flag: when set on a fresh sync (no prior
checkpoint, -timeAgo unset), walk the live filer tree under -filerPath
via TraverseBfs and seed the destination through sink.CreateEntry, then
persist the walk-start timestamp as the checkpoint and subscribe from
there. Capturing the timestamp before the walk lets the subscription
catch any create/update/delete racing with the walk — sink CreateEntry
is idempotent across the builtin sinks so replay is safe.

Honors existing -filerExcludePaths / -filerExcludeFileNames /
-filerExcludePathPatterns filters and skips /topics/.system/log the
same way the subscription path does.

Also log "starting from <t> (no prior checkpoint)" instead of a
misleading "resuming from 1970-01-01" when the KV has no stored offset.

* fix(filer.backup): guard initialSnapshot counters under TraverseBfs workers

TraverseBfs fans the callback out across 5 worker goroutines, so the
entryCount / byteCount updates and the 5-second progress-log gate in
runInitialSnapshot were racing. Switch the counters to atomic.Int64 and
protect the lastLog check/update with a short-scoped mutex so the heavy
sink.CreateEntry call stays outside the critical section.

Flagged by gemini-code-assist on #9126; verified with go test -race.

* fix(filer.backup): harden initialSnapshot against transient errors and path edge cases

Three review items from CodeRabbit on #9126:

1. getOffset errors no longer leave isFreshSync=true. Before, a transient
   KV read failure would cause runFilerBackup's retry loop to redo the
   full -initialSnapshot walk on every retry. Treat any offset-read
   error as "not fresh" so the snapshot only runs when we've verified
   there really is no prior checkpoint.

2. initialSnapshotTargetKey now normalizes sourcePath to a trailing-
   slash base before stripping the prefix, so edge cases where
   sourceKey equals sourcePath (trailing-slash mismatch or root-entry
   emission) no longer index past the end. Unit tests cover both
   forms.

3. Documented the TraverseBfs-enumerates-excluded-subtrees performance
   characteristic on runInitialSnapshot, since pruning requires a
   separate change to TraverseBfs itself.

* fix(filer.backup): retry setOffset after initialSnapshot to avoid full re-walks

If the snapshot walk finishes but the subsequent setOffset fails, the
retry loop in runFilerBackup will re-enter doFilerBackup with an empty
checkpoint and run the full BFS again — on a multi-million-entry tree
that's hours of wasted work over a 100-byte KV write. Retry the write a
handful of times with exponential backoff before giving up, and log
loudly at the final failure (with snapshotTsNs + sinkId) so operators
recognize the symptom instead of guessing at mysterious repeated walks.

Nitpick raised by CodeRabbit on #9126.

* fix(filer.backup): initialSnapshot ignore404, skew margin, exclude dir-entry itself

Three review items from CodeRabbit on #9126:

1. ignore404Error now threads into runInitialSnapshot. If a file is listed
   by TraverseBfs and then deleted before CreateEntry reads its chunks,
   the follow path already ignores 404s — the snapshot path was aborting
   and triggering a full re-walk. Treat an ignorable 404 as "skip this
   entry, continue."

2. snapshotTsNs now uses `time.Now() - 1min` instead of `time.Now()`.
   Metadata events are stamped server-side, so a fast backup-host clock
   could skip events that fire during or right after the walk. Matches
   the 1-minute margin meta_aggregator.go applies on initial peer
   traversal; duplicate replay is harmless because CreateEntry is
   idempotent.

3. Exclude checks now run against the entry's own full path, not just
   its parent. A walked directory whose full path matches SystemLogDir
   or -filerExcludePaths was being seeded to the destination; only its
   descendants were being skipped. Verified with a manual repro where
   -filerExcludePaths=/data/skipdir now keeps the skipdir entry itself
   off the destination.

* refactor(filer): share destKey helper between buildKey and initialSnapshot

Extract destKey(dataSink, targetPath, sourcePath, sourceKey, mTime) from
buildKey in filer_sync.go. Both the event-log path (buildKey) and the
initialSnapshot walk (initialSnapshotTargetKey) now go through the same
helper, so a walk-seeded file and an event-replayed file always resolve
to the same destination key.

As a bonus, buildKey picks up the defensive trailing-slash normalization
that initialSnapshotTargetKey introduced — no more index-past-end risk
when sourceKey happens to equal sourcePath. Also tightens the mTime
lookup to guard against nil Attributes (caught by an existing test
against buildKey when I first moved the lookup out of the incremental
branch).
2026-04-17 21:21:32 -07:00

142 lines
5.8 KiB
Go

package command
import (
"context"
"fmt"
"net/http"
"net/http/httptest"
"os"
"testing"
"time"
"github.com/seaweedfs/seaweedfs/weed/pb/filer_pb"
"github.com/seaweedfs/seaweedfs/weed/replication/sink"
"github.com/seaweedfs/seaweedfs/weed/replication/source"
"github.com/seaweedfs/seaweedfs/weed/util"
util_http "github.com/seaweedfs/seaweedfs/weed/util/http"
)
func TestMain(m *testing.M) {
util_http.InitGlobalHttpClient()
os.Exit(m.Run())
}
// readUrlError starts a test HTTP server returning the given status code
// and returns the error produced by ReadUrlAsStream.
//
// The error format is defined in ReadUrlAsStream:
// https://github.com/seaweedfs/seaweedfs/blob/3a765df2ff90839acb9acf910b73513417fa84d1/weed/util/http/http_global_client_util.go#L353
func readUrlError(t *testing.T, statusCode int) error {
t.Helper()
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, http.StatusText(statusCode), statusCode)
}))
defer server.Close()
_, err := util_http.ReadUrlAsStream(context.Background(),
server.URL+"/437,03f591a3a2b95e?readDeleted=true", "",
nil, false, true, 0, 1024, func(data []byte) {})
if err == nil {
t.Fatal("expected error from ReadUrlAsStream, got nil")
}
return err
}
func TestIsIgnorable404_WrappedErrNotFound(t *testing.T) {
readErr := readUrlError(t, http.StatusNotFound)
// genProcessFunction wraps sink errors with %w:
// https://github.com/seaweedfs/seaweedfs/blob/3a765df2ff90839acb9acf910b73513417fa84d1/weed/command/filer_sync.go#L496
genErr := fmt.Errorf("create entry1 : %w", readErr)
if !isIgnorable404(genErr) {
t.Errorf("expected ignorable, got not: %v", genErr)
}
}
func TestIsIgnorable404_BrokenUnwrapChain(t *testing.T) {
readErr := readUrlError(t, http.StatusNotFound)
// AWS SDK v1 wraps transport errors via awserr.New which uses origErr.Error()
// instead of %w, so errors.Is cannot unwrap through it:
// https://github.com/aws/aws-sdk-go/blob/v1.55.8/aws/corehandlers/handlers.go#L173
// https://github.com/aws/aws-sdk-go/blob/v1.55.8/aws/awserr/types.go#L15
awsSdkErr := fmt.Errorf("RequestError: send request failed\n"+
"caused by: Put \"https://s3.amazonaws.com/bucket/key\": %s", readErr.Error())
genErr := fmt.Errorf("create entry1 : %w", awsSdkErr)
if !isIgnorable404(genErr) {
t.Errorf("expected ignorable, got not: %v", genErr)
}
}
func TestIsIgnorable404_NonIgnorableError(t *testing.T) {
readErr := readUrlError(t, http.StatusForbidden)
genErr := fmt.Errorf("create entry1 : %w", readErr)
if isIgnorable404(genErr) {
t.Errorf("expected not ignorable, got ignorable: %v", genErr)
}
}
// stubSink is a minimal ReplicationSink used to exercise initialSnapshotTargetKey
// without standing up a real sink. Only the two methods read by the key builder
// (GetName, IsIncremental) need meaningful behavior; the rest satisfy the interface.
type stubSink struct {
name string
isIncremental bool
}
func (s *stubSink) GetName() string { return s.name }
func (s *stubSink) Initialize(util.Configuration, string) error { return nil }
func (s *stubSink) DeleteEntry(string, bool, bool, []int32) error {
return nil
}
func (s *stubSink) CreateEntry(string, *filer_pb.Entry, []int32) error { return nil }
func (s *stubSink) UpdateEntry(string, *filer_pb.Entry, string, *filer_pb.Entry, bool, []int32) (bool, error) {
return false, nil
}
func (s *stubSink) GetSinkToDirectory() string { return "" }
func (s *stubSink) SetSourceFiler(*source.FilerSource) {}
func (s *stubSink) IsIncremental() bool { return s.isIncremental }
var _ sink.ReplicationSink = (*stubSink)(nil)
func TestInitialSnapshotTargetKey(t *testing.T) {
// Mirror the non-incremental path of buildKey so a refactor of one without
// the other will fail this test.
mirror := &stubSink{name: "mirror", isIncremental: false}
got := initialSnapshotTargetKey(mirror, "/backup", "/data", util.FullPath("/data/sub/file.txt"), &filer_pb.Entry{})
if got != "/backup/sub/file.txt" {
t.Errorf("mirror sink: got %q, want %q", got, "/backup/sub/file.txt")
}
// Incremental sinks partition by entry mtime, so the seed must use the same
// YYYY-MM-DD prefix a replayed CreateEntry would produce. buildKey in
// filer_sync.go formats the date in local time, so compute the expected
// key the same way to keep the test timezone-independent.
inc := &stubSink{name: "inc", isIncremental: true}
mtime := int64(1704196800) // 2024-01-02T12:00:00 UTC — unambiguously Jan 2 in nearly all timezones
gotInc := initialSnapshotTargetKey(inc, "/backup", "/data", util.FullPath("/data/sub/file.txt"), &filer_pb.Entry{
Attributes: &filer_pb.FuseAttributes{Mtime: mtime},
})
wantInc := "/backup/" + time.Unix(mtime, 0).Format("2006-01-02") + "/sub/file.txt"
if gotInc != wantInc {
t.Errorf("incremental sink: got %q, want %q", gotInc, wantInc)
}
// Trailing-slash sourcePath still produces a clean relative key.
gotTrail := initialSnapshotTargetKey(mirror, "/backup", "/data/", util.FullPath("/data/file.txt"), &filer_pb.Entry{})
if gotTrail != "/backup/file.txt" {
t.Errorf("trailing-slash sourcePath: got %q, want %q", gotTrail, "/backup/file.txt")
}
// Edge cases CodeRabbit called out: sourceKey equal to sourcePath
// (non-trailing and trailing variants). Real TraverseBfs walks never emit
// the root itself, but the helper must not panic if something else does.
if got := initialSnapshotTargetKey(mirror, "/backup", "/data", util.FullPath("/data"), &filer_pb.Entry{}); got != "/backup" {
t.Errorf("sourceKey == sourcePath (no slash): got %q, want %q", got, "/backup")
}
if got := initialSnapshotTargetKey(mirror, "/backup", "/data/", util.FullPath("/data"), &filer_pb.Entry{}); got != "/backup" {
t.Errorf("sourceKey == sourcePath (trailing slash mismatch): got %q, want %q", got, "/backup")
}
}