seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-22 09:41:28 +00:00

Files

Chris Lu 1854101125 feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386 )

* feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8)

scan_only actions only fire from the bootstrap walk: the engine
classifies a rule as scan_only when its retention horizon exceeds the
meta-log retention, so event-driven routing can't be trusted. Today
each bucket walks once per process, so a long-running worker never
revisits — scan_only retention only catches up when the worker
restarts.

Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk
(name -> completion time). KickOffNew now re-walks a bucket whose
last walk completed more than BootstrapInterval ago. Zero interval
preserves the legacy walk-once-per-process behavior so existing
deployments don't change cadence by default. walkBucket re-stamps
on success and clears the stamp on failure (via MarkDirty), so the
next KickOffNew picks failed walks back up.

Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and
a Now func() for testable time travel.

weed shell run-shard grows --bootstrap-interval (cadence knob) and
--force-bootstrap (drop in-memory state at startup so every bucket
walks again immediately, useful when a config change should take
effect without a restart).

Tests: cadence respected (skip inside interval, re-walk past it);
zero interval keeps once-per-process; MarkDirty forces re-walk
under a 24h interval; MarkAllDirty resets every record. The
fakeClock helper guards the test clock with a mutex so race-detector
runs are clean.

* fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag

Three issues with the Phase 8 cadence work as it landed:

1. lastWalk did double duty as both completed-walk timestamp and
in-flight debounce. A walk that took longer than BootstrapInterval
would have a fresh KickOffNew start a duplicate goroutine on the
next refresh tick because the stamp from KickOffNew looked stale
against the interval. Split into lastCompleted (set on success)
and inFlight (set on dispatch, cleared after the walk goroutine
returns success or failure). KickOffNew skips inFlight buckets
regardless of cadence.

2. The cadence knob existed on `weed shell` but not on the production
path: scheduler.Scheduler constructed BucketBootstrapper without
BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had
no field for it. Add Scheduler.BootstrapInterval, parse
`bootstrap_interval_minutes` in ParseConfig (zero = legacy walk-
once-per-process; negative clamps to zero), and forward it from
the handler. Tests cover default, override, clamp, and explicit-zero.

3. --force-bootstrap was a no-op: BucketBootstrapper is freshly
allocated at command start, so MarkAllDirty on empty state does
nothing, and the flag couldn't influence an already-running
process anyway. Remove it; a real runtime trigger (SIGHUP, control
RPC) is a separate change.

In-flight regression: a blockingInjector pins the first walk in
progress while the test advances the clock past the interval. The
second KickOffNew is a no-op (inFlight check). After release, the
post-completion KickOffNew within the interval is also a no-op.

* test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock

The cadence test polled listedN to know "the walk happened" — but
that fires once both list passes are issued, while the success-stamp
lands later, after walkBucketDir returns. A clock.Advance(30m)
between those two events would record the stamp at clock+30m
instead of T0; the next assertion would then see now.Sub(last) < 1h
and skip the expected re-walk. Tight in practice but exposed under
-race / load.

Add a waitForCompleted helper that polls b.lastCompleted directly,
and use it before each clock advance in both the cadence and zero-
interval tests.

* fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks

Two follow-ups on Phase 8.

The worker config descriptor had no bootstrap_interval_minutes field,
so the production operator UI couldn't enable the cadence — only the
internal ParseConfig + Scheduler wiring knew about it. Add the field
to the cadence section (MinValue=0 since 0 is the legacy default) and
include the default in DefaultValues so existing deployments see the
knob with the right preset.

MarkDirty / MarkAllDirty silently lost their effect when a walk was
in flight: the methods cleared lastCompleted, but the walk's success
path then wrote a fresh timestamp, hiding the operator's invalidation.
Track a pendingDirty set; the walk goroutine consumes the flag on
exit and skips the success stamp, so the next KickOffNew picks the
bucket up immediately.

Regression: pin a walk in progress with a blockingInjector, MarkDirty
the bucket, release the walk, and assert lastCompleted stayed empty
plus the next KickOffNew triggers a new walk inside the
BootstrapInterval window.

* refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty

These methods were the operator-hook half of Phase 8, but the only
caller (--force-bootstrap on the shell command) was removed when it
turned out to be a no-op against a freshly-allocated bootstrapper.
Nothing in production calls them anymore.

Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty
set, the dirty-suppression branch in walkBucket, and the three tests
that only exercised those methods. BootstrapInterval-driven
re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP,
control RPC) is a separate change with a real call site.

2026-05-09 13:42:31 -07:00