1158 Commits

Author SHA1 Message Date
Chris Lu
e9bcb8f4ad docs(s3/lifecycle): refresh DESIGN.md as-built (#9491)
* docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages

DESIGN.md was written as a phased implementation plan ("Phase 2 will
ship X, Phase 4 will ship Y"). All phases are now merged, plus the
post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially
changed the worker model (single subscription, walker throttle,
observability gauges). Rewrite the doc in present tense describing
what's actually there.

Net changes vs the prior plan-style doc:
- Algorithm pseudo-code reflects the single-subscription fan-out plus
  walkedThisPass within-pass guard.
- Walker invocation table replaces the implicit "two distinct calls"
  prose with three call sites (recovery / steady-state / empty-replay)
  and their throttle gates.
- New section on the subscription model (one Reader, ShardPredicate,
  fan-out by ev.ShardID).
- New section on cursor.LastWalkedNs and the WalkerInterval throttle.
- Observability section: gauges, heartbeat tokens, what each means.
- "Implementation history" table maps phases to merged PRs.
- "Future work" lists the four optimizations we deferred (long-lived
  subscription, bucket-coordinated walker, per-bucket lag metric,
  filer meta-log retention).

Drop the "Phase N — ..." narrative from the bottom; the PR history
table is the durable artifact now.

Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for
the operator-facing docs. README explains the sync workflow with the
external seaweedfs.wiki.git repo. Five pages:

- Home.md — landing page, supported rule shapes, what the worker does
- Operator-Guide.md — config knobs, when to change each, walker
  interval recommendations by cluster size
- Monitoring.md — Prometheus metric reference + heartbeat token table
  + suggested PromQL alerts
- Troubleshooting.md — stuck cursor, walker stuck, failure outcomes,
  cursor schema for manual inspection
- Architecture.md — high-level overview for newcomers; sits between
  Home.md (operator) and DESIGN.md (developer)

* docs(s3/lifecycle): address PR review feedback on docs

Coderabbit + gemini findings on #9491:

- Monitoring.md: clarify the "matches all dispatched" phrasing; note
  that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value
  (shouldn't appear in healthy systems); filter PromQL alerts to
  ignore zero-valued gauges so fresh-install heartbeats don't trip.
- Operator-Guide.md, Troubleshooting.md: clarify weed shell -master
  format as host:http_port.grpc_port (SeaweedFS ServerAddress).
- Troubleshooting.md: pause the s3_lifecycle job in the admin UI
  before manually editing a cursor file, otherwise the worker's
  save races with the operator's edit.
- Architecture.md, Home.md, Operator-Guide.md, Monitoring.md,
  Troubleshooting.md, DESIGN.md: add language tags (`text`) to
  fenced code blocks for markdownlint MD040 compliance.
- DESIGN.md: standardize on the S3 spec rule names
  (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`,
  `AbortIncompleteMultipartUpload`) and add a one-line note mapping
  them to the engine's `ActionKind*` constants.
- README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the
  sync workflow so the `cp` commands' repo-root-relative paths work
  whether the operator's shell is at the repo root or at
  docs/wiki/s3-lifecycle/.
- Home.md: was lagging the wiki-repo merged version (had the older
  pre-merge content). Re-sync from the wiki repo so source matches.

* docs(s3/lifecycle): remove wiki pages from PR

The wiki pages belong in seaweedfs.wiki.git, not the main repo. The
source-of-truth concern that motivated adding them here is real but
the cost — every code-review touchpoint requires reviewers to load
operator-facing pages too — outweighs it. The wiki pages are already
pushed locally (~/dev/seaweedfs.wiki); they'll publish on the
operator-side workflow.

This PR remains scoped to DESIGN.md (the developer-facing reference
that does belong with the code).

* docs(s3/lifecycle): drop Implementation history section

git log is the durable record of what shipped when; the prose table
duplicates it and goes stale faster than commit metadata.

* docs(s3/lifecycle): soften 'exactly once per run' in Goal

The prior phrasing overstated the guarantee versus the failure model
documented later in the same file. Reword to: 'process due objects
each pass; retryable/blocked outcomes get retried from the cursor on
later runs.' Surfaces the head-of-line-blocking semantics up front so
the rest of the doc reads consistently.

Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those
pages live in the wiki repo, not the main repo.
2026-05-13 17:06:14 -07:00
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
c6582228b8 feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.
2026-05-13 14:09:13 -07:00
Chris Lu
79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00
Chris Lu
f5a4bfb514 fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete

deleteSpecificObjectVersion did two non-atomic filer ops: rm the version
blob, then update the .versions/ pointer. Step 2 failures were silently
logged and the client got 204 OK, so any transient blip (filer timeout,
process restart between RPCs, lock contention) left the .versions/
directory naming a missing file. Subsequent GETs paid the 10-retry
self-heal cost and returned NoSuchKey — surfacing as "Storage not found"
to Veeam, which is what triggered this investigation.

Three changes:

1. Pre-roll the pointer for the singleton / multi-version-deleting-latest
   cases. The pointer is repointed (multi) or cleared (singleton) before
   the blob rm. A failure between leaves a recoverable orphan blob —
   pointer is consistent, GETs succeed or correctly miss without
   entering the stale-pointer self-heal path.

2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion
   with bounded retries (~6.3s worst case). When retries are exhausted
   the function now returns a non-nil error instead of swallowing it.
   The caller logs at Error level and queues the path for the
   reconciler.

3. Background reconciler drains stranded .versions/ pointer-to-missing
   states off the hot path. Bounded in-memory queue with capped retries;
   read-path heal remains as a last-resort safety net.

* fix(s3/versioning): address review on #9460

Four fixes addressing review on PR #9460. All four are correctness;
no behavioural change for the happy path.

1. repointLatestBeforeDeletion: discriminate NotFound from transient
   errors when re-fetching the .versions/ entry. Previously any error
   returned rolled=true,nil — a transient filer hiccup at that point
   would cause the caller to skip the post-delete reconciliation AND
   proceed with the blob rm, producing exactly the dangling pointer
   state the PR aims to prevent. NotFound stays "vacuously consistent"
   (directory already gone); other errors surface so the caller aborts
   before removing the blob.

2. Move the singleton .versions/ teardown out of
   repointLatestBeforeDeletion (where it ran BEFORE the blob rm and
   always failed with "non-empty folder") into deleteSpecificObjectVersion
   AFTER the blob rm. Adds a wasSingleton return value so the caller
   knows when to run the teardown. Without this, every singleton-version
   delete in a versioned bucket leaked an empty .versions/ directory.

3. Wrap the list, getEntry, and mkFile calls inside
   repointLatestBeforeDeletion with retryFilerOp so the pre-roll has
   the same transient-failure resilience as the post-roll path. Without
   retries, a single transient blip causes the caller to fall back to
   the legacy non-atomic flow even when the filer recovers immediately.

4. healVersionsPointer in the reconciler: same NotFound-vs-transient
   discrimination on both the .versions/ getEntry and the latest-file
   presence probe. Previously a transient filer error would silently
   evict the candidate from the queue as "healed", leaving the real
   stranded state until a client read happened to surface it.

Also fixes the gemini-flagged consistency nit: the queued-for-reconciler
error log now uses normalizedObject instead of object so it matches the
queue entry's key.

* fix(s3/versioning): short-circuit terminal errors in retryFilerOp

Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound,
gRPC NotFound, context.Canceled, and context.DeadlineExceeded.
retryFilerOp now bails immediately on a terminal error and returns it
unwrapped, so callers like repointLatestBeforeDeletion.getEntry and
updateLatestVersionAfterDeletion.rm see the raw NotFound instead of
paying the ~6.3 s retry-budget delay AND parsing it out of an
"exhausted N retries" wrapper.

errors.Is and status.Code already walk the %w chain so today's call
sites still work, but the delay was real on the hot DELETE path
whenever a key was genuinely absent. Test added covering all five
terminal-error shapes — each must run the wrapped fn exactly once and
return in under 50 ms.
2026-05-13 10:14:27 -07:00
Chris Lu
3f1eaf9724 fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD

Successful GET/HEAD object requests never produced a fluent audit entry
because those handlers write the response directly (streaming for GET,
WriteHeader for HEAD) and never reach a PostLog call site. The wiki
advertises GET as an audited verb, so the asymmetry surprises operators
who rely on the log for read-access auditing.

Move the safety net into the track() middleware: tag each request with
an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it,
and emit a single fallback entry after the handler returns when nothing
fired. The recorder's status flows into the fallback so the audit row
still reflects 200/206 vs 404 etc. No double logging for handlers that
already emit (write helpers, error paths, bulk delete).

Refs #9463

* fix(s3/audit): defensive nil checks on audit-tracking helpers

Address PR review: guard against nil request and nil *atomic.Bool stored
under the audit-tracking key. The conditions are unreachable today (the
key is private and we only ever store new(atomic.Bool)), but the checks
are free and keep the helpers safe if a future caller misbehaves.

* test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469)

test(s3/audit): cover track() fallback wiring + cleanup

Adds two unit tests in weed/s3api/stats_test.go that exercise the
audit-tracking flag set up by track(): one verifies the fallback path
fires when a handler writes the response directly (the GET/HEAD object
regression in #9463), the other verifies the flag is set when a handler
emits PostLog itself so the fallback is skipped.

To make the wiring observable without standing up fluent, PostLog now
marks the audit flag before short-circuiting on a nil Logger; production
behavior is unchanged (no logger, no posting) but the flag stays
consistent.

Also drops two stale comments in s3api_object_handlers.go that still
referenced proxyToFiler — that helper was removed when GET/HEAD started
streaming from volume servers directly.

Stacks on #9467.
2026-05-13 09:24:59 -07:00
Chris Lu
d5372f9eb7 feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster
rate.Limiter that processMatches honors. A walker hitting a large
bucket on the recovery branch could burst LifecycleDelete RPCs past
the cluster_deletes_per_second cap that streaming-replay respects.

WalkerDispatcher now takes a *rate.Limiter and waits on it before
each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds
just like processMatches does. The handler passes the same limiter
to both paths so replay + walk share one budget; nil disables
throttling (unchanged default).

Tests pin: the limiter actually delays a dispatch when the burst
token is drained, and a ctx cancellation in Limiter.Wait surfaces
as an error without sending the RPC.
2026-05-13 09:24:50 -07:00
Chris Lu
37e505b8fd refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass

Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.

drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.

cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.

Filer load: 16x -> 1x SubscribeMetadata streams per pass.

* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)

The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.

Same regression I fixed in 6796ab6db for the per-shard subscription;
now applied at the shared level. computeGlobalStartTsNs loads every
shard's cursor and picks the minimum, falling back to the cold-start
floor only for shards with no persisted cursor.
2026-05-13 02:13:11 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
5004b4e542 feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b)

Phase 5a (PR #9465) retired the algorithm flag and made daily_replay
the only execution path. The streaming-side code (scheduler.Scheduler,
scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher,
dispatcher.FilerPersister, and their tests) has had no in-tree caller
since then. This PR deletes it.

Net change: ~4800 lines removed, ~130 added (the scheduler/configload
tests' helper file the deleted bootstrap_test.go used to host).

Removed:
  - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test,
    scheduler,scheduler_test,pipeline_fanout_test,
    refresh_default,refresh_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test,
    dispatcher_helpers_test,edge_cases_test,multi_shard_test,
    pipeline,pipeline_test,pipeline_helpers_test,toproto_test,
    dispatch_ticks_default,dispatch_ticks_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go
    (FilerPersister deleted; FilerStore tests don't need their own
    file)
  - weed/shell/command_s3_lifecycle_run_shard{,_test}.go
    (debug-only shell command that only ever wrapped the streaming
    pipeline; the production worker now exercises the same path
    every daily run)

Trimmed:
  - dispatcher/filer_persister.go down to FilerStore +
    NewFilerStoreClient — the small interface daily_replay's cursor
    persister (dailyrun.FilerCursorPersister) plugs into.

Kept (still consumed by daily_replay):
  - scheduler/configload.{go,_test.go} (LoadCompileInputs,
    AllActivePriorStates)
  - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister,
    FilerSiblingLister)
  - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient)

scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream,
dirEntry, fileEntry — helpers the configload tests used to share with
the deleted bootstrap_test.go.

Updates the handler-package doc strings and one reader-package
comment that still named the streaming pipeline.

* fix(s3/lifecycle): hold lock through tree read in test filer client

gemini caught an inconsistency in scheduler/testhelpers_test.go:
LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was
releasing the lock before reading c.tree. The map is effectively
static during tests so there's no actual race today, but matching
the convention keeps the helper safe if a future test mutates the
tree mid-run.
2026-05-12 12:54:52 -07:00
Chris Lu
2f682303fb fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464)
Follow-up to PR #9459 (merged before this fix landed). The walker
dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and
"NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay
drain (processMatches in run.go via #9462) use "RPC_ERROR" for the
same condition. Aligning so a single Prometheus query covers all
three delete paths.

Folds nil-response under RPC_ERROR rather than a separate label —
operationally it's the same class of failure (server returned no
usable response).
2026-05-12 12:38:52 -07:00
Chris Lu
495632730c feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log

Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.

  - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
    wall-clock per shard. p95 climbing toward MaxRuntime means the
    shard is brushing its budget.
  - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
    events drainShardEvents processed. Pairs with the duration so a
    spike in events-per-shard correlates with a slow shard.
  - S3LifecycleDispatchCounter (existing, reused): processMatches now
    increments this with the outcome label, so streaming and
    daily_replay paths share one outcome view. Transport errors are
    counted under outcome="TRANSPORT_ERROR".

dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.

Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.

* fix(s3/lifecycle): align dispatch error label + clean test labels

Two PR-9462 review fixes from gemini:

1. processMatches' transport-failure label was "TRANSPORT_ERROR";
   streaming's dispatcher uses "RPC_ERROR" for the same condition
   (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
   the same Prometheus query covers both delete paths.

2. The dispatch-counter assertion test now deletes its label row
   on exit so the in-process Prometheus registry doesn't accumulate
   per-test state across the suite.
2026-05-12 12:15:20 -07:00
Chris Lu
f954781169 feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459)
* feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config

Adds a Config.RetentionWindow field that runShard threads into
engine.PromotedHash. Zero (the default) falls back to maxTTL, which
matches Phase 4a behavior — PromotedHash stays empty and the
partition-flip recovery trigger stays dormant.

Pure plumbing. The handler still passes zero so nothing changes at
runtime. The walker work (Phase 4b proper) sets a real retention from
the meta-log boundary and the partition-flip trigger starts firing.

* feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker

Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient
so the same LifecycleDelete RPC drives both the meta-log replay path
and the walker. No CAS witness — the server's identityMatches treats
nil ExpectedIdentity as a bootstrap call and rebuilds the witness from
the live entry, which is the right contract for a full-tree walk.

Adds VersionID to bootstrap.Entry so versioned-bucket walks address
the right version. MPU init uses DestKey for ObjectPath (matching the
prefix-match contract); rejecting empty DestKey keeps malformed init
records out of the dispatch path.

Not wired yet — runShard still doesn't invoke the walker. Follow-up
commits add the ListFunc adapter and the recovery-branch wiring.

* feat(s3/lifecycle): wire Walker hook into runShard's recovery branch

Adds a Config.Walker callback that fires on rule-content edit /
partition flip BEFORE the cursor rewinds, so already-due objects across
the rewritten rule set get caught instead of waiting on meta-log
replay alone. The callback receives engine.RecoveryView(snap) and the
per-shard ID; nil disables it (Phase 4a behavior preserved).

Decoupling the wiring from the implementation: the handler-side
WalkerFunc that drives bootstrap.Walk via the filer is the follow-up
commit, and tests can stub the callback without standing up the full
filer/client/lister harness.

Tests pin: walker fires exactly once on hash mismatch, walker error
propagates and leaves the cursor unchanged, nil Walker is a no-op.

* feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard

Adds dailyrun.WalkBuckets — the composable driver the handler-side
WalkerFunc will call. Iterates a bucket list, wraps the supplied
bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey
for MPU init), and runs bootstrap.Walk per bucket using the supplied
Dispatcher. First bucket error wins; remaining buckets log and run to
completion so one filer flake doesn't kill the shard.

Composable rather than monolithic so callers and tests can swap parts:
production uses a filer-backed ListFunc + WalkerDispatcher; tests use
bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the
next commit.

Tests pin: shard filter routes only matching entries, MPU shard uses
DestKey not the .uploads/<id> path, single-bucket error propagates
while other buckets still run, ctx cancellation short-circuits between
buckets, nil guards on view/list/dispatch.

* feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker

Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that
streams entries under <bucketsPath>/<bucket> by paginated SeaweedList.
Recurses into regular directories; .versions/ and .uploads/ are
skipped at this stage so they don't surface as raw children — the
sibling expansion (versioned NoncurrentDays state, MPU init dispatch)
lands in the next commit.

listAll and isVersionsDir are ported from scheduler/bootstrap.go's
same-named helpers. Phase 5 deletes the scheduler copies along with
the streaming path.

Tests pin: flat listing, recursion through nested directories,
.versions/ and .uploads/ skipped, kill-resume via the start path
contract, nil-client error, attribute propagation (mtime / size /
IsLatest default).

* feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc

Adds the .versions/<key>/ expansion to the daily-run's filer-backed
ListFunc. Each call emits one bootstrap.Entry per sibling (real
version files + the bare null version, when found) with the same
sibling state the streaming bootstrap injects via reader.Event:

  - Path = logical key (not the .versions/<file> physical path), so
    bootstrap.Walk's MatchPath uses the user's intended path.
  - VersionID per sibling (version_id or "null").
  - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back
    to explicit-null-bare, falling back to newest-by-mtime.
  - NoncurrentIndex rank computed against the latest's position.
  - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the
    previous-newer sibling's mtime (legacy derivation).
  - IsDeleteMarker from ExtDeleteMarkerKey.
  - NumVersions = len(siblings).

Two-pass walk so .versions/ dirs run before regular files; the bare
null-version path is recorded in skipBare so pass 2 doesn't emit it
twice.

expandVersionsDir and lookupNullVersion are ported from
scheduler/bootstrap.go. Sort order, latest resolution, and successor
derivation must agree with that path verbatim so streaming and walker
reach the same verdict on the same objects. Phase 5 deletes the
scheduler copy.

MPU init (.uploads/<id>) remains skipped — the dedicated commit emits
it with IsMPUInit and DestKey.

Tests pin: pointer-wins latest resolution, no-pointer newest-sibling
fallback, explicit-null-is-latest with skipBare suppression of the
bare emission, coincidentally-named .versions folder recursing as a
regular subdir, delete-marker propagation.

* feat(s3/lifecycle): emit MPU init records from FilerListFunc

Last gap in the filer-backed ListFunc. A directory at .uploads/<id>
carrying ExtMultipartObjectKey is the MPU init record; emit one
bootstrap.Entry with IsMPUInit=true and DestKey set to the user's
intended path. The walker's MatchPath uses DestKey for prefix
matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's
ObjectPath. .uploads/<id> directories without the extended key are
mid-write before metadata landed and stay skipped.

isMPUInitDir is upgraded from the path-shape-only stub to the full
shape + extended-attr check that mirrors router.mpuInitInfo and
scheduler/bootstrap.go's same-named helper.

Tests pin: valid init record emits with the right DestKey, missing
ExtMultipartObjectKey skips the directory.

* feat(s3/lifecycle): wire walker into executeDailyReplay

Activates the recovery-branch walker. The handler composes the three
Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets
— into a dailyrun.WalkerFunc and passes it via Config.Walker. The
bucket list is derived from the compiled inputs so it matches the
engine snapshot exactly.

Effect on master behavior: when a worker observes a RuleSetHash or
PromotedHash mismatch on its persisted cursor (rule content edited /
partition flip), runShard now walks the live filer tree under the
RecoveryView before rewinding the cursor. Already-due objects across
the rewritten rule set fire immediately instead of waiting on the
sliding meta-log replay.

Still scoped to replay-eligible action kinds because
checkSnapshotForUnsupported continues to reject walker-bound rules
(ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and
scan_only-promoted rules at the top of Run. The follow-up commit
relaxes the gate once the steady-state walker over RulesForShard's
walk view is wired so those rules fire every day, not just on rule
edits.

* feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate

Adds the second walker invocation in runShard. After the recovery
check passes, runShard derives the walk view via snap.RulesForShard
(using the same retentionWindow PromotedHash used, so the partition
is consistent) and runs the walker over it. The view holds
walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) plus any replay-eligible rules promoted to walk by
retention shortage; an empty view skips the call so non-versioned,
replay-only deployments don't pay an O(N) bucket walk per run.

With the walker now servicing every rule kind, checkSnapshotForUnsupported
and its UnsupportedRuleError type are obsolete. router.Route gates
replay on Mode == ModeEventDriven, so walker-bound and scan_only
rules are silently dropped by replay and picked up by the walker
instead — no double-dispatch. Drop the gate, delete replayability.go
+ replayability_test.go, and remove the handler's redundant
IsUnsupportedRule branch.

* fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment

Two PR-review fixes on 9459:

1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return —
   add a defensive nil-response check so the walk halts cleanly
   instead. Spotted by coderabbit.

2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash
   "stays empty" in fallback mode, which gemini correctly pointed out
   is only true once rules are active. During bootstrap (rules
   compiled but IsActive=false) MaxEffectiveTTL is 0 while
   PromotedHash counts every non-disabled rule, so promoted becomes
   non-empty and the next post-activation run hits the recovery
   branch. That's the intended bootstrap walk — rewrite the comment
   to explain it rather than misstate the invariant.

Test: pins nil-response → error path on WalkerDispatcher.

* fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion

Reviewer caught a structural bug in expandVersionsDir's latest
resolution: when ExtLatestVersionIdKey was set but no scanned sibling
carried that id (stale pointer), the code left latestPos at the
default 0 without ever entering the no-pointer fallback. Today the
two paths yield the same value (newest sibling wins), but the
implicit fall-through makes the intent unclear and would break
silently if the no-pointer branch ever did anything more than
latestPos=0.

Track a pointerResolved flag explicitly so the no-pointer branch
(including the explicit-null-bare check) re-runs on a stale pointer.
Behavior unchanged today.

Test pins: stale pointer + two real versions falls back to
newest-sibling (vnew, not vold).

* feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher

Mirrors the Phase 6 instrumentation already on the replay side
(processMatches) onto the walker's Delete dispatch. Every walker
dispatch now bumps S3LifecycleDispatchCounter with the resolved
outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so
streaming, daily_replay's replay drain, and daily_replay's walker
share a single per-(bucket, kind, outcome) counter view.

Lands together with the rest of Phase 4b — no new metric, just an
extra observation site for the existing one.
2026-05-12 11:39:15 -07:00
Chris Lu
644664bbee feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457)
* feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a)

Replace the local replay-content-hash / max-effective-TTL helpers in
dailyrun with the engine package's canonical versions (ReplayContentHash,
MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface.

Adds PromotedHash to the cursor's recovery triggers: a partition flip
(rule moving between replay and walk because retention shifted) now
fires the rule-change branch alongside RuleSetHash mismatch. The
retentionWindow is set to MaxEffectiveTTL today, which keeps the
promoted set empty and the trigger dormant; Phase 4b will plumb the
real meta-log retention boundary so true scan_only promotions are
detected.

Cursor schema is unchanged — PromotedHash was already persisted as
the zero hash in Phase 2.

* docs(s3/lifecycle): note the one-time cursor rewind on hash format change

gemini-code-assist flagged that swapping localReplayContentHash for
engine.ReplayContentHash changes the persisted RuleSetHash byte layout
(sort order + tagged-field encoding). Phase-2 cursors mismatch on first
post-upgrade run and drop into the rule-change branch.

Going with option 3 (document the intentional one-time rewind). The
rewind is bounded to runNow - maxTTL (not time-zero), self-healing on
the next save, and daily_replay is off by default so the affected
population is limited to early adopters of the algorithm flag. A
migration shim or a hash-compat layer would carry the legacy encoder
forever for one bounded re-scan; not worth it.

Comment in runShard makes the trade explicit so a future reader doesn't
hunt for the "why does my cursor rewind once after upgrade" mystery.

* chore(s3/lifecycle): trim verbose comments in dailyrun

Cut multi-paragraph headers and narration that just described what the
code does. Kept the small WHY notes (per-match skip vs per-rule, the
one-time post-upgrade cursor rewind, scan_only rejection rationale).
Same behavior, ~150 fewer lines of comment.

* fix(s3/lifecycle): persist PromotedHash on the successful runShard save

The comment-trim pass dropped the field alongside a "stays empty in
Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b
turns promoted into a real value — and a save that writes zero would
make the next run falsely detect drift and rewind. Spotted by
gemini-code-assist on PR 9457.

Other save paths (recovery, drain-error) already persisted it; the
success path is the only one that was missing it. Now consistent.
2026-05-11 21:18:19 -07:00
Chris Lu
884b0bcbfd feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)

Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.

Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.

Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
  non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
  the input ClusterContext and injects two metadata keys for
  s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
  decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
  config; other job types pass through unchanged.

Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
  on (admin-config field names, metadata keys). Plain strings on the
  admin side keep weed/admin/plugin free of a dependency on the
  s3_lifecycle worker package; the two sets of constants are pinned
  to identical values and a mismatch would silently disable rate
  limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
  builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
  Missing/empty/non-positive values → no limiter (legacy unlimited
  behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
  bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
  (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).

Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
  long each Limiter.Wait blocks before a LifecycleDelete RPC.
  Operators tune the cap by reading p95 — near-zero means the cap
  isn't binding, a long tail at 1/rate means it is.

Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
  pass-through for non-allocator job types, rps=0 / no-executors
  skip, even sharing, burst sharing, burst=0 omit (worker default
  kicks in), burst floor of 1, no mutation of input metadata, nil
  input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
  covering nil/empty/missing metadata, non-positive/invalid rate,
  positive rate builds correctly, burst missing defaults to 2× rate,
  tiny rate clamps burst to ≥1.

Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.

* fix(s3/lifecycle): divide cluster budget by active workers, not all capable

gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.

Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.

Tests swap the SharedEvenly case for two pinned scenarios:
  - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
  - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
  - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4

* feat(s3/lifecycle): drop Worker Count knob from admin config form

The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.

Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).

handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
2026-05-11 19:17:06 -07:00
Chris Lu
3f4cb6d2fb feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine)

Adds the engine-side API the new daily-replay worker reaches for:
per-view snapshot construction (RulesForShard, RecoveryView), the two
cursor hashes that gate recovery (ReplayContentHash, PromotedHash),
and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot
is a stub keyed on a package-level atomic that the worker startup wiring
populates.

Views return new *Snapshot instances holding cloned *CompiledAction
values so per-clone active/Mode never leak across partitions. Replay
clones force Mode=ModeEventDriven to rehabilitate any persistent
ModeScanOnly carried over from PriorState; walk and recovery clones
preserve Mode as-is. Disabled actions are excluded from all views.

No production caller is wired here — Phase 4's walker/dailyrun
integration is the follow-up. dailyrun's local helpers
(localReplayContentHash, localMaxEffectiveTTL) become one-line
redirects to these exports.

API surface:
- CurrentSnapshot() *Snapshot — stub until Phase 4 wiring.
- SetCurrentEngine(*Engine) — Phase 4 wiring entry point.
- Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot)
- RecoveryView(s *Snapshot) *Snapshot — force-active over the full set.
- ReplayContentHash(s *Snapshot) [32]byte — partition-independent.
- PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip.
- MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only.

30 unit tests covering clone isolation, Mode rewrite, partition
membership including the multi-action-kind XML rule split,
RecoveryView activating pre-BootstrapComplete actions,
ReplayContentHash partition-independence, PromotedHash sensitivity to
promotion in either direction, MaxEffectiveTTL aggregation. Build +
race-tests green.

* refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics

Addresses PR #9447 review feedback. Three medium-priority items from
gemini, all code-quality refinements (no behavior change):

1. Duplicated sort comparator between ReplayContentHash and
   PromotedHash. Extract sortHashItems shared helper so the two
   hashes use the same ordering by construction — if one drifted, the
   cursor could see a spurious "rule changed" on a no-op snapshot
   rebuild.

2. Duplicated writeField/writeInt closures. Extract hashWriter struct
   holding the sha256 running hash + lenbuf, with method helpers.
   Same allocation profile (one Hash, one tiny stack buffer per
   helper); just deduplicates ~20 lines.

3. shardID parameter on RulesForShard is unused. Per the design's
   open question, every shard sees every rule today (shard filter
   runs at the entry-iteration site, not view construction). Keep
   the parameter for API stability — removing it now would force
   a breaking change when bucket-shard ownership lands — and update
   the doc comment to explain why it's reserved.

go build ./... clean; engine test suite green.
2026-05-11 18:07:54 -07:00
Chris Lu
122ca7c020 feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446)
* docs(s3lifecycle): design for daily-replay worker

Captures the algorithm and dev plan iterated on in PR #9431 and the
discussion leading up to it: per-shard daily meta-log replay, walker
as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent
plus a recovery branch over engine.RecoveryView(snap), explicit
retention-window input to RulesForShard, two cursor hashes
(ReplayContentHash + PromotedHash) that together detect every
invalidation case. Implementation phases are sequenced so each can
ship independently — Phase 1 (noncurrent_since stamp) just landed.

* feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2)

New weed/s3api/s3lifecycle/dailyrun package implementing the bounded
daily meta-log scan from the design doc. One pass per Execute per
shard: load cursor, scan events forward, route each through router.Route,
dispatch any due Match, advance the cursor on success. Halt-on-failure
keeps the cursor at the last fully-processed event so tomorrow resumes
from the same point — head-of-line blocking is the deliberate failure
signal.

Replay-only in this phase. Phase 4 wires the walker for ExpirationDate,
ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules.
Until then a typed UnsupportedRuleError refuses runs on those buckets:
operators see the rejection in the activity log rather than silently
losing rules.

Behavior:
- Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted
  under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in
  Phase 2; Phase 4 turns it on.
- Rule-change branch rewinds cursor to now - max_ttl when the
  replay-content hash mismatches. Cold start uses the same floor.
- Transport errors retry 3x with exponential backoff capped at 5s;
  server outcomes (RETRY_LATER / BLOCKED) halt the run without retry.
- Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules
  exist, only the hash gates a future addition.

Worker shape:
- New admin config field "algorithm" with enum streaming|daily_replay,
  default streaming. Existing deployments are unaffected.
- handler.Execute branches on the flag: streaming routes through the
  current scheduler.Scheduler, daily_replay routes through
  dailyrun.Run.
- dispatcher.NewFilerSiblingLister exported so both paths share the
  same .versions/ + null-bare lookup.

Engine integration:
- Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase
  4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace
  them with one-line redirects; the local versions hash the same
  fields so the cursor stays valid across the swap.

Tests cover cursor persistence, unsupported-rule rejection,
hash stability under rule reordering, hash sensitivity to TTL edits,
max-TTL aggregation, dispatch retry budget, and request shape
including the identity-CAS witness.

Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers
and future phases share the same spec.

* feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob

The streaming dispatcher hasn't shipped to users yet, so there's no
backward-compat surface to preserve. Flip the algorithm default from
streaming to daily_replay so the new path is the standard from day
one. Streaming stays as an explicit opt-in escape hatch during the
Phase 4 walker rollout; Phase 5 deletes both the flag and the
streaming code.

Buckets whose lifecycle rules require walker-bound dispatch
(ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only)
will fail the daily_replay run with the existing
UnsupportedRuleError until Phase 4 walker integration ships. Operators
hitting that case can set algorithm=streaming until the follow-up
lands.

Updates the test for the default value and renames the
unknown-value-fallback case to reflect the new default.

* fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches

The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}.
That's only safe when each event produces at most one match per
ActionKey with a single deterministic due-time formula —
ExpirationDays and AbortMPU fit that shape because due_time
= ev.TsNs + r.days is monotonic in event TsNs.

But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed
in Phase 2 since it compiles to ActionKindNoncurrentDays) routes
through routePointerTransitionExpand, which emits matches for every
noncurrent sibling — each with its own SuccessorModTime taken from
the demoting event for that specific sibling. A single event can
therefore produce two matches for the same ActionKey on different
objects with wildly different DueTimes.

With the old code, a not-yet-due sibling encountered first would set
done[ActionKey] = true and then the next sibling — even though its
DueTime had already passed — would be skipped. Future events for the
same rule would also be suppressed for the rest of the run. Objects
that should have been deleted weren't.

Fix: drop the early-stop optimization. Process every match
independently. A future-DueTime match is now silently skipped without
affecting any later match. The performance hit is small (Phase 2 is a
single bounded daily pass, and the rate limiter is the real
throughput governor); the correctness gain is non-negotiable.

Also fixes the inverted comment in processMatches that described the
old check as "due_time is past now" when it actually checked
DueTime.After(now) (i.e., NOT yet due).

Adds four targeted tests:
- not-yet-due match first in slice does not suppress two later
  due matches for the same rule;
- reversed slice ordering produces identical dispatch;
- BLOCKED outcome halts the loop before later due matches are sent;
- empty match slice is a no-op.

Phase 4's walker-and-recovery integration can revisit a
per-(rule, object) memoization if profiling argues for it.

* fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency

Addresses PR #9446 review feedback. Eight distinct fixes:

1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the
   persisted cursor to lastOK = TsNs of the last event processed,
   including events whose matches were skipped as not-yet-due. Those
   skipped matches would never be re-scanned, so objects under
   long-TTL rules would never expire.

   Track a "stuck" flag in drainShardEvents: the first event with a
   skipped (future-DueTime) match stops cursorAdvanceTo from rising,
   but the loop keeps processing later events to dispatch any that ARE
   due. The persisted cursor sits at the last fully-processed event so
   tomorrow's run re-scans from the skipped event onward and the
   future-due matches get re-evaluated when they age in.

   processMatches now returns (skippedAny, halted, err) so the drain
   loop can tell apart "event fully drained" from "event had pending
   future-due matches."

2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the
   ActionKind. A replay-eligible kind with Mode != ModeEventDriven
   (e.g. ModeScanOnly via retention promotion) passed the check but
   then got silently ignored by router.Route, which gates dispatch
   on Mode == ModeEventDriven. Reject loudly with the typed error
   so admin sees the rejection in the activity log.

3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard
   goroutines regardless of cfg.Workers. Add a Workers field to
   dailyrun.Config and gate the goroutine fan-out on a semaphore of
   that size; the handler now passes cfg.Workers through.

4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one
   snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per
   shard. Mid-run Compile would let shards process different rule
   sets. Capture snap at the top of Run, pass it down to every shard.

5. FROZEN runNow (coderabbit). drainShardEvents and processMatches
   accepted a `now func() time.Time` and called it multiple times.
   DueTime comparisons would slip as the run wore on. Capture runNow
   once at the top of Run and thread it through as a time.Time value.

6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case
   broke out of the loop and returned nil, marking interrupted runs as
   successful. Return ctx.Err() instead so the caller propagates the
   interrupt; cursorAdvanceTo carries whatever progress was made.

7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently
   accepted empty files, mismatched shard_ids, and hash slices shorter
   than 32 bytes (copy() would zero-pad). Each now returns a typed
   error so the run halts and an operator investigates rather than
   silently re-scanning from time zero or persisting a zero-padded
   hash that masks corruption forever.

8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted"
   guard in runShard was unreachable because drainShardEvents
   initialized lastOK := startTsNs and only ever raised it. Removed
   along with the new cursor-advancement semantics that handle the
   "no events processed" case implicitly.

Plus markdown lint: DESIGN.md fenced code blocks now carry a `text`
language identifier to satisfy MD040.

Skipped from the review:
- gemini's "maxTTL == 0 incorrectly skips immediate expirations":
  actions with Days <= 0 don't compile to a CompiledAction (see
  weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new
  empty-replay sentinel uses `rsh == [32]byte{}` for clarity per
  gemini's suggested form, but the behavior is equivalent.

Tests added/updated:
- TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when
  all matches are past their DueTime.
- TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins
  the new Mode check.
- TestFilerCursorPersister_EmptyFileReturnsError,
  _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin
  the new validation rules.
- Existing process-matches tests reshaped for the
  (skippedAny, halted, err) return tuple.

Full build clean. Dailyrun + worker test packages green.
2026-05-11 18:07:17 -07:00
Chris Lu
46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00
Chris Lu
9a70bbfcc6 feat(s3api): full-chunk gzip pass-through skips volume-side decompress (#9427)
Building on the io.Pipe streaming chunk copy: when a copy operation
covers an entire source chunk (the common case for Harbor's
part-size = chunk-size assemble pattern), ask the source volume for
compressed bytes via Accept-Encoding: gzip and forward them to the
destination as-is.

This trades a Range fetch (where the volume decompresses the chunk
internally to satisfy the byte range) for a full-chunk fetch that
returns whatever wire bytes the chunk is stored as. For gzipped
chunks the source volume avoids the decompression entirely; we never
allocate a chunk-sized decompress buffer.

Implementation: build the source GET directly instead of going
through ReadUrlAsStream, because that helper auto-decompresses gzip
responses (which would defeat the point). Trust the response's
Content-Encoding header over caller hints — for partial ranges the
volume always returns raw bytes regardless of how the chunk is
stored, so labeling those as gzip would corrupt subsequent reads.

End-to-end repro impact (512 MiB src, 6 parallel UploadPartCopy):
  + #9420/#9421/#9422       : 2236 MiB
  + io.Pipe streaming       : 1521 MiB
  + this commit             : 1149 MiB  (round 2 RSS, perfectly flat)

Round 3 now completes (was hitting volume-full before, since
chunks took up uncompressed space on disk; we now store the gzipped
chunks the volume gives us, which fit in the test's 8 GiB volume
budget).

Heap inuse_space (after force GC):
  before all: ~1.5 GiB
  this PR:    266 MiB

Volume-side bytes.Buffer.ReadFrom inuse:
  before:     611 MiB
  streaming:  571 MiB
  this PR:    297 MiB (now in destination-volume parseUpload's
                       size-hint decompression — separate
                       optimization opportunity for a hint header)
2026-05-10 14:55:59 -07:00
Chris Lu
4a04594826 feat(s3api): stream chunk copy via io.Pipe to cut peak working set (#9424)
* fix: cap pool retention so chunk-copy buffers don't hoard memory

Two pool-retention sites kept the runaway-RSS pattern in #6541 visible
even after #9420 and #9421:

* weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into
  sync.Pool regardless of how big it had grown. After a 64 MiB chunk
  upload through volume.PostHandler -> needle.ParseUpload, the pool
  hoarded a 64 MiB byte array per cached entry for the rest of the
  process's lifetime. Cap retention at 4 MiB; oversized buffers are
  dropped so GC can reclaim the backing array.

* weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer
  unset, so operation.upload_content fell back to the package-global
  valyala/bytebufferpool. That pool also retains high-water buffers
  forever, and concurrent UploadPartCopy filled it with one chunk-sized
  buffer per concurrent upload. Provide a fresh per-call bytes.Buffer
  pre-sized to chunk + multipart framing; it's GC'd as soon as the
  upload returns.

Tests:
- weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized
  buffers don't round-trip), the inverse (right-sized buffers do), and
  nil-safety.
- weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption
  and pin that BytesBuffer is always non-nil and pre-sized, and that
  each call gets a distinct buffer.

* feat(s3api): stream chunk copy via io.Pipe to cut peak working set

Final piece for #6541. The buffered chunk-copy path holds two
chunk-sized buffers per copy in flight (download buffer + multipart-
encoded upload buffer). Under concurrent UploadPartCopy that put a
floor on RSS at concurrency × 2 × chunk_size — about 768 MiB for the
6-way / 64 MiB Harbor-style assemble repro, even after the previous
pool/retention fixes.

Replace the buffered path with an io.Pipe between the source GET and
the destination POST: ReadUrlAsStream pumps data into the pipe via a
multipart.Writer, the http.Client reads from the pipe end and POSTs
the body. In-flight per copy is now ~32 KiB (pipe hand-off + http
buffers), regardless of chunk size.

The streaming path is gated by canStreamCopyChunk: only used when no
in-transit transformation is needed (no per-chunk CipherKey, no SSE).
SSE-C / SSE-KMS / SSE-S3 paths still go through the buffered path,
which already handles re-encryption correctly.

Benchmarks (Apple M4, httptest source/dest, B/op = bytes per copy):

  Buffered  1 MiB:   6.0 MB B/op,  443 MB/s
  Streamed  1 MiB:   374 KB B/op,  727 MB/s
  Buffered  8 MiB:    56 MB B/op,  559 MB/s
  Streamed  8 MiB:   379 KB B/op, 1138 MB/s
  Buffered 64 MiB:   455 MB B/op,  718 MB/s
  Streamed 64 MiB:   304 KB B/op, 1387 MB/s

End-to-end repro (512 MiB src, 6 parallel UploadPartCopy):
  pre-#9420 RSS round 2: 3134 MiB
  + #9420/#9421/#9422  : 2236 MiB
  + this PR            : 1521 MiB
  heap inuse_space     :  350 MiB (was 1422 / 1187 MiB)
  HeapSys (MemStats)   : 1.74 GiB (was 2.49 GiB)

* review: surface shouldRetry, add int32 guard, drop redundant drains

Address review on PR 9424:

* coderabbit (HIGH, line 122): ReadUrlAsStream can set shouldRetry=true
  with readErr=nil. Before this fix, that fell through to mw.Close()
  and the destination POST succeeded against a possibly-truncated
  multipart body. Mirror downloadChunkData's explicit check and
  surface shouldRetry as a producer error so the dst POST aborts.
* gemini (line 98): chunk size is int64 but ReadUrlAsStream takes int.
  Reject sizes above MaxInt32 up front so the int(size) cast can't
  truncate negative on 32-bit platforms — same guard downloadChunkData
  uses.
* gemini (line 151): util_http.CloseResponse already drains the body
  (io.Copy(io.Discard, ...) inside the helper) before closing, so the
  manual io.Copy drains we added are redundant. Drop them.

* review: cancel source GET when destination POST fails

Address coderabbit review (line 165 / second pass on PR 9424): when
the POST leg fails or returns an error status, closing pipeReader
only fails the producer's *writes*. ReadUrlAsStream's own read loop
runs under the parent ctx, so it keeps draining the source body in
the background until EOF — wasting source-volume bandwidth and CPU
on a copy that's already failed.

Wrap streamCopyChunkRange in a child context cancelled on return.
ReadUrlAsStream checks ctx.Done() per 256 KiB tick, so the in-flight
read aborts on the next iteration once the function returns. The POST
also moves to streamCtx so the in-flight request can be cancelled the
same way if the producer fails first.

Defer-cancel runs after both legs return, so the success path still
sends EOF cleanly through pipeWriter.Close before cancellation.
2026-05-10 14:29:39 -07:00
Chris Lu
d8bbc1d855 fix: cap pool retention so chunk-copy buffers don't hoard memory (#9422)
Two pool-retention sites kept the runaway-RSS pattern in #6541 visible
even after #9420 and #9421:

* weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into
  sync.Pool regardless of how big it had grown. After a 64 MiB chunk
  upload through volume.PostHandler -> needle.ParseUpload, the pool
  hoarded a 64 MiB byte array per cached entry for the rest of the
  process's lifetime. Cap retention at 4 MiB; oversized buffers are
  dropped so GC can reclaim the backing array.

* weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer
  unset, so operation.upload_content fell back to the package-global
  valyala/bytebufferpool. That pool also retains high-water buffers
  forever, and concurrent UploadPartCopy filled it with one chunk-sized
  buffer per concurrent upload. Provide a fresh per-call bytes.Buffer
  pre-sized to chunk + multipart framing; it's GC'd as soon as the
  upload returns.

Tests:
- weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized
  buffers don't round-trip), the inverse (right-sized buffers do), and
  nil-safety.
- weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption
  and pin that BytesBuffer is always non-nil and pre-sized, and that
  each call gets a distinct buffer.
2026-05-10 13:34:25 -07:00
Chris Lu
926a8e9351 fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup (#9420)
* fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup

downloadChunkData accumulated the streamed chunk into a nil []byte via
`chunkData = append(chunkData, data...)`. ReadUrlAsStream pumps in 256 KiB
ticks, so a 64 MiB chunk grew the slice geometrically (256K → 512K →
1M → ... → 64M), allocating ~2x the chunk size for every transferred
byte. Combined with the 4-way per-request concurrency and any number of
in-flight UploadPartCopy calls (Harbor multipart assemble), this is what
produces the runaway-RSS pattern reported in #6541.

Pre-size the receive buffer to the known sizeInt so the callback fills
in place. Add a regression test that downloads a 16 MiB chunk through
httptest and asserts TotalAlloc stays under 1.5x the chunk size — the
pre-fix code allocates ~5x and trips the bound.

Local repro (weed 4.23, 6 parallel UploadPartCopy on a 512 MiB source):

  before:  baseline 96 MiB → peak 3124 MiB, never reclaimed
  pprof:   650 MiB inuse in bytes.growSlice + 461 MiB in
           downloadChunkData.func1

* test(s3api): assert downloaded chunk content matches payload

Address PR review feedback: the allocation-bound check alone would still
pass if a future regression silently truncated or corrupted the chunk.
Compare the returned bytes against the source payload (after the
TotalAlloc measurement window so bytes.Equal doesn't pollute it).
2026-05-10 12:08:06 -07:00
Chris Lu
82648cca53 test(s3/lifecycle/engine): pin delay-group dedup across buckets (#9418)
Compile a 100-bucket × 5-rule snapshot where the five Days values
include duplicates (1, 1, 7, 7, 30) and assert:

- snap.actions has 500 entries — every (bucket, rule) compiles to its
  own ActionKey, no collapse.
- snap.originalDelayGroups has exactly 3 entries — the routing index
  is keyed by Delay, so same-day rules across all buckets share a
  group. This is the property that lets the dispatcher index by
  delay group rather than per-rule.
- Per-group key count = (rules with that day) × buckets, so every
  action is reachable from its group entry.
2026-05-10 10:36:54 -07:00
Chris Lu
1b1d4aa814 refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate (#9417)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.

* refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate

The metadata-only delete decision (entry.Attributes.TtlSec > 0) was
inlined in lifecycleDispatch with no direct test. Lift it into a
named predicate with the rationale comment moved onto the function
and pin the four edge cases: nil entry, nil attributes, TtlSec=0,
TtlSec>0, plus a defensive check that TtlSec<0 doesn't flip the
path on.
2026-05-10 09:39:05 -07:00
Chris Lu
c7b01c72b2 test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.
2026-05-10 09:30:50 -07:00
Chris Lu
2840980c7d test(s3/lifecycle): final unit-test cleanup before integration suite (#9414)
* test(s3/lifecycle): final unit-test cleanup before integration suite

Closes the residual coverage gaps in the lifecycle packages so the
next track (Layer 3 integration tests) starts from a clean baseline.
Big coverage lifts: lifecycletest 88.7→100.0, engine 81.7→95.1,
s3lifecycle 87.8→95.0, dispatcher 60.3→67.6, router 86.1→88.8,
bootstrap 90.7→92.6. Remaining sub-100% surfaces (reader.Run,
Pipeline.Run, scheduler.Run, multi-step bootstrap orchestration)
need a live filer and belong with the integration suite.

router/helpers_test.go (formerly #9409, now stale on master because
9410-9413 absorbed adjacent surface): direct tests for the pure
helpers Route exercises indirectly — successorModTimeFromContainer
(missing/empty/non-numeric/non-positive/positive round-trip),
logicalKeyFromVersionPath (extracts logical, rejects non-.versions
parent / root-level / no-slashes / bare container), isVersionsContainerKey
(table over container forms), isVersionFolderPath (table over child
forms), isDeleteMarkerEntry (only literal "true" matches),
extractTags (nil/empty, AmzObjectTagging-prefixed only, no-tag
returns nil), hasActiveEventDrivenAction (matches only active+
event-driven, scan-only rejected, unknown skipped). Plus engine
Snapshot accessors: BucketVersioned (compiled flag, unknown bucket
false), BucketActionKeys (full list, unknown nil), Action (unknown
nil), AllActions (every kind), SnapshotID (strictly monotonic).

s3lifecycle/final_cleanup_test.go: ActionKind.String default branch
(unspecified + future-unknown render "unspecified" rather than
empty); HashExtended direct from the lifecycle package (covers it
in this package's coverage report, not just the s3api one) including
nil/empty produces no bytes and identical content hashes the same.

bootstrap/has_prefix_test.go: thin wrapper around strings.HasPrefix
exported by the package; trivial but at 0% pre-fix.

lifecycletest/eventbuilder_old_entry_test.go: pins the OldEntry
fall-through path on Delete events for WithModTime / WithTtlSec /
WithVersionID / WithExtended / WithChunks (existing tests cover
Create events that hit NewEntry only). Adds WithBootstrapVersion
across all three event shapes. Defensive: every With* option is a
no-op on a degenerate event with neither entry populated.

* test(s3/lifecycle): address coderabbit nitpicks on final cleanup

- eventbuilder empty-event test now exercises WithBootstrapVersion
  too, with an honest claim about its scope: it targets the event
  itself (not an entry), so it sets BootstrapVersion regardless of
  whether NewEntry/OldEntry are populated. Renamed the test from
  AllAreNoOpsOnEmptyEvent to NoPanicOnEmptyEvent since the original
  name overstated the contract.
- HashExtended stability check uses a 3-key map with different
  literal orders so the helper's sort path actually does work; a
  single-key check can't catch an iteration-order regression.
- HasPrefix test refactored to table-driven so adding a new edge
  case is one row instead of two assertion lines.
2026-05-09 22:32:49 -07:00
Chris Lu
b740e22e63 test(s3/lifecycle): bundle dispatcher + engine edge-case coverage (#9413)
* test(s3/lifecycle): bundle dispatcher + engine edge-case coverage

Two-package bundle covering uncovered branches in production code that
the existing happy-path tests don't reach. Dispatcher 58.1% → 60.2%
and engine 81.0% → 81.7% (engine lift modest because most branches
were already hit; the nil-rule defensive case is otherwise unreachable
from a Compile flow).

dispatcher (4 tests):
- FilerPersister.Load with nil Store errors with a "nil Store"
  message rather than panicking at the Read call.
- FilerPersister.Save with nil Store same.
- FilerPersister.Load with a non-NotFound transport error wraps the
  shard ID into the message AND keeps the underlying error
  recoverable via errors.Is.
- FilerPersister.Load with successful empty []byte returns an empty
  map, not a JSON-decode error — pinning that an existing-but-empty
  cursor file is treated as "no entries".
- Tick initializes the retries map on first call without panic so a
  freshly-constructed Dispatcher works.
- Tick with already-canceled ctx re-queues the popped Match, returns
  zero, and never invokes the LifecycleDelete client — the Match
  must not be lost across worker restart.

engine (4 tests):
- rulePredicateSensitive(nil) returns false rather than panicking on
  the FilterTags dereference. The non-nil paths run through Compile,
  but a defensive nil-rule arrival isn't reachable that way.
- rule with no FilterTags / empty FilterTags map returns false (the
  check is len(FilterTags) > 0, so empty must classify as
  non-sensitive — pinning catches a flipped >= comparison).
- rule with a populated FilterTags returns true.

* fix(s3/lifecycle): Tick must requeue every drained Match on shutdown

Per codex review on #9413: Tick called Schedule.Drain to pop ALL due
matches at once, then iterated. If ctx canceled mid-loop, only the
current Match was re-added — everything past that index was silently
lost across the worker restart. With N due matches, up to N-1 were
dropped.

Fix: on cancellation, re-add due[i:] (current + remaining) before
returning. Matches already dispatched (due[:i]) stay processed; the
schedule is left exactly as it would be if Drain had returned only
the dispatched prefix.

Strengthen the existing test to enqueue three due matches and assert
sched.Len()==3 after a pre-canceled Tick. Pre-fix the test would have
seen Len()==1 because only the first popped Match was re-added.
2026-05-09 22:02:17 -07:00
Chris Lu
ad77362be3 test(s3/lifecycle): bundle reader + scheduler helper coverage (#9412)
* test(s3/lifecycle): bundle reader + scheduler helper coverage

Bundles direct tests for previously-uncovered helpers in two
packages. Bumps reader 73.2% → 79.2% and scheduler 71.6% → 73.6%.

Reader Event predicates (4):
- IsCreate: NewEntry-only event classifies as create
- IsDelete: OldEntry-only event classifies as delete
- both entries (update): neither IsCreate nor IsDelete (strict
  exclusivity so router routes updates through their own path)
- no entries (degenerate): neither (so a metadata-only filer event
  with no payload doesn't trigger spurious dispatches)

Reader LogStartup (4): exercises both shape branches (single-shard
ShardID vs ShardPredicate), the explicit-StartTsNs override path, and
the Cursor.MinTsNs fallback when StartTsNs=0. Side-effect-only
function; tests pin compile-time shape and visit each code path.

Scheduler pipelineFanout.InjectEvent (5):
- nil event silently absorbed (no follow-up panic in receiving
  pipeline)
- unknown shard returns nil (forward-compat for future shard-mapping
  gaps)
- known shard succeeds
- ctx cancellation propagates when underlying pipeline's buffer fills
- routes to the correct pipeline among multiple, with cross-pipeline
  isolation proven via per-pipeline buffer state

* test(s3/lifecycle): rename canceled to canceledCtx in fanout test

Per gemini review on #9412: a bare 'canceled' identifier reads like
a bool. Rename to canceledCtx so the type is obvious at the call site.
2026-05-09 22:02:09 -07:00
Chris Lu
7996dc1d67 test(s3/lifecycle): bundle dispatcher pipeline helper coverage (#9411)
* test(s3/lifecycle): bundle dispatcher pipeline helper coverage

Five untested helpers in dispatcher/pipeline.go and one accessor in
dispatcher.go that previously sat at 0% coverage. Bumps the dispatcher
package from 58.8% → 64.7%.

observeScheduleDepth: gauge tracks Schedule.Len; pinned for empty,
populated, and post-Drain states with a unique shard label so the
test doesn't bleed into other parallel tests sharing the global
counter.

ensureEventsChan: positive EventBuffer sizes the channel; zero and
negative fall back to defaultEventBuffer so a misconfigured operator
doesn't end up with a zero-cap channel that blocks every InjectEvent;
sync.Once contract holds under 32-goroutine concurrent invocation.

InjectEvent: delivers the same pointer to p.events (no defensive
copy); a canceled ctx propagates context.Canceled rather than
silently dropping the event; allocates the channel via
ensureEventsChan so a caller can inject before Run starts.

isCtxShutdown: nil → false; context.Canceled / DeadlineExceeded →
true; gRPC-wrapped Canceled / DeadlineExceeded codes → true (so
shutdown isn't misclassified as transport failure when filer/S3
RPCs unwrap as status); other errors (Unavailable, Internal, plain)
→ false.

cursorFileName: shard 0 / 1 / 9 / 10 / 15 all pad to two digits so
on-disk filenames stay sorted by shard ID — pinned so a refactor that
drops %02d doesn't break existing cursor files.

* test(s3/lifecycle): drop ScheduleDepthGauge label series at test end

Per gemini review on #9411: even with a unique shard label, the gauge
series persists in the global Prometheus registry across test runs in
the same process. Add a defer to DeleteLabelValues so the registry
stays clean.
2026-05-09 22:02:01 -07:00
Chris Lu
ca95d33092 test(s3/lifecycle): bundle dispatcher + engine accessor coverage (#9410)
* test(s3/lifecycle): bundle dispatcher + engine accessor coverage

Two-package bundle covering pure helpers and snapshot read-side
accessors that the router and dispatcher reach for at runtime. None
were directly tested; regressions previously surfaced only as
downstream Tick / Match / Compile failures.

dispatcher (10 tests):
- keyOf: derives every retryKey field from the Match; equal Match
  values produce equal keys (so the second dispatch hits the first's
  retry counter); distinct VersionIDs and ActionKinds produce
  distinct keys (so a noisy version can't starve a healthy one,
  and two kinds on the same object don't share a budget).
- budget(): configured value when set; defaultRetryBudget when zero
  or negative — pins the >0 guard against a flipped comparison.
- backoff(): same pattern as budget for RetryBackoff.

engine snapshot accessors (8 tests):
- OriginalDelayGroups exposes the compiled per-delay groups; rules
  with multiple kinds at different cadences land in distinct entries;
  scan-only actions don't leak into delay groups so the dispatcher
  doesn't try to drive them event-driven.
- PredicateActions populated for tag-sensitive rules, empty for non-
  tag-sensitive ones (so MatchPredicateChange doesn't route
  irrelevant kinds).
- DateActions surfaces ExpirationDate verbatim for date kinds; empty
  for non-date rules.
- MarkActive on an unknown key is a no-op (durable bootstrap-complete
  write races a recompile that drops the rule; panic here would crash
  the worker).
- MarkActive flips a fresh-no-prior-state action from inactive to
  active.
- BucketActionKeys covers every kind RuleActionKinds reports.

* test(s3/lifecycle): strengthen snapshot accessor content assertions

Per gemini review on #9410: assertions previously only checked counts
and non-empty status. Verify the specific ActionKeys land where
expected so an indexing regression that produces the right number of
items with wrong kinds gets caught.

OriginalDelayGroups: each delay group's slice asserts.Contains the
specific (bucket, rule_hash, kind) ActionKey instead of just
NotEmpty.

PredicateActions: assert.Contains the expected key instead of just
NotEmpty.

BucketActionKeys: every key.Bucket must equal the test bucket (catches
cross-bucket leak), and ElementsMatch pins kinds against
RuleActionKinds.
2026-05-09 22:01:54 -07:00
Chris Lu
0955d1aa08 test(s3/lifecycle): direct prefixMatches + filterAllows coverage (#9408)
Both helpers were exercised indirectly through MatchOriginalWrite /
MatchPath; pinning them directly catches a regression at the helper
level so a Match-test failure isn't the first signal of a broken
filter.

prefixMatches: empty prefix fast path; exact-prefix match; non-match
rejection; path shorter than prefix.

filterAllows: no-filter accepts any event; FilterSizeGreaterThan is
strictly > (boundary value rejected); FilterSizeLessThan is strictly
<; zero-size thresholds mean "not set" (must let any size through —
a regression treating 0 as a real threshold would reject everything);
required tag present accepts; missing key, empty tags map, wrong
value, and missing-among-multiple all reject; size + tag filters are
AND'd so either failing rejects.
2026-05-09 20:47:35 -07:00
Chris Lu
edbe7ab140 test(s3/lifecycle): meta-log Event builder + monotonic clock fixture (#9406)
* test(s3/lifecycle): meta-log Event builder + monotonic clock fixture

Several test files build *reader.Event ad-hoc; consolidate the common
shape into the lifecycletest package as task #12 spec calls out
("fixture meta-log generator"). New tests using the builder don't
have to thread Mtime / ShardID / leaf-name semantics by hand, and
existing helpers can migrate over time without churning this PR.

NewCreate / NewDelete / NewUpdate cover the three event shapes;
WithSize / WithModTime / WithTtlSec / WithVersionID / WithExtended /
WithChunks / WithBootstrapVersion / WithShardID compose deterministic
overrides. ShardID defaults to s3lifecycle.ShardID(bucket, key) so
events route through the same shard the production reader would.

MetaLogClock issues monotonic timestamps with a configurable step
(default 1s); concurrent-safe so fan-out fixtures don't have to lock
externally.

15 unit tests pin every option, the IsCreate/IsDelete/IsUpdate
discriminators, leaf-name extraction for nested keys, ShardID
derivation, option-ordering semantics, the concurrent clock contract
under -race, and a Peek-doesn't-advance check.

* test(s3/lifecycle): address review comments on event builder

- leafOf strips trailing slashes before splitting so directory-key
  fixtures (e.g. "folder/") get the slashless leaf "folder" — pre-fix
  it returned "" which would break router tests for directory markers.
- NewUpdate now seeds OldEntry.Attributes.Mtime with the event ts
  (matching NewDelete), so a downstream router that compares mtimes
  doesn't see a synthetic 1970 epoch on the pre-update state.
- New WithOldSize / WithOldChunks / WithOldModTime options let Update
  events configure pre-update state independently. The unprefixed
  variants still target NewEntry on Update events; the With Old*
  options are no-ops on Create (no OldEntry to mutate) and never bleed
  into NewEntry.

5 new tests pin: directory-key + multi-slash leaf extraction; OldEntry
mtime default on Update; the WithOld* targeting + Create-event
no-bleed contract.
2026-05-09 20:47:27 -07:00
Chris Lu
1aa55f5bf9 test(s3/lifecycle): direct decideMode + RuleMode.String coverage (#9405)
Compile tests cover decideMode indirectly; these direct tests pin
every branch so a regression in the classifier itself can't slip
behind a more elaborate Compile failure.

Pinned: nil rule and Disabled status both → Disabled; ExpirationDate
→ ScanAtDate without consulting retention; metaLogRetention=0 means
unbounded so any horizon → EventDriven; horizon within retention →
EventDriven; horizon exceeding retention → ScanOnly; bootstrapLookback
adds to horizon (not retention) so a near-threshold case is still
gated; zero horizon (rule field unset) skips the gate. RuleMode.String
must render the documented names for every variant; an unknown value
collapses to "unspecified" rather than empty or panic.
2026-05-09 20:35:34 -07:00
Chris Lu
619cb39827 test(s3/lifecycle): pin Schedule edge cases beyond happy path (Phase 15 slice) (#9403)
* test(s3/lifecycle): pin Schedule edge cases beyond happy path

Pre-existing schedule_test covered the happy path (ordered Drain,
empty schedule, duplicates, boundary-inclusive). Five new tests pin
edge cases the dispatcher relies on:

- Drain at a time before any DueTime returns nil and leaves the heap
  intact, so the dispatcher can't accidentally consume future-due
  matches.
- NextDue after partial Drain points to the next earliest, catching
  a Drain that forgets the heap invariant.
- Add after Drain bubbles a fresh earlier DueTime to the front, so
  late-arriving high-priority matches don't sit behind older ones.
- Drain returns Matches in ascending DueTime order regardless of
  insert order — explicit pinning of the documented contract.
- Concurrent Add+Drain across 64 goroutines under -race.

* test(s3/lifecycle): actually exercise Drain in AddAfterDrain test

Per coderabbit review on #9403: the test name promised "after Drain"
but the previous body only Add'd both items without ever calling
Drain in between. Insert a real Drain (popping "drain_me") before
the second Add, so the heap-invariant-across-Drain-then-Add path is
actually pinned. Bumps the after-Drain Match's DueTime out of the
way so the Drain in step 3 returns it deterministically.
2026-05-09 20:35:22 -07:00
Chris Lu
435ef7f94f test(s3/lifecycle): pin toProtoActionKind + toProtoIdentity converters (#9404)
test(s3/lifecycle): pin toProtoActionKind + toProtoIdentity

The two converters are the worker-side wire to LifecycleDelete; a
miss in toProtoActionKind sends ACTION_KIND_UNSPECIFIED that the
server rejects FATAL, and a wrong toProtoIdentity flips the CAS
witness so every dispatch comes back NOOP_RESOLVED with STALE_IDENTITY
even though the entry hasn't changed.

10 tests pin: every listed s3lifecycle.ActionKind maps to its proto
counterpart (table-driven, one subtest per kind); ActionKindUnspecified
and a future unknown kind both collapse to ACTION_KIND_UNSPECIFIED
(forward compat); nil EntryIdentity stays nil (preserves the no-CAS
sentinel); a populated identity copies every field; a zero-valued
identity still produces a non-nil output so the server treats it as
a real CAS witness rather than no-CAS.
2026-05-09 20:35:04 -07:00
Chris Lu
1350e681c9 test(s3/lifecycle): pin Pipeline.Run dependency + shard validation (Phase 15 slice) (#9402)
* test(s3/lifecycle): pin Pipeline.Run dependency + shard validation

Pre-existing TestPipelineRunRequiresDependencies only checked that an
empty Pipeline errors; it didn't pin which specific dependency must be
present. A refactor that makes one nilable accidentally would slip
through.

8 new tests pin every validation branch in Pipeline.Run: missing
Engine / Persister / Client / FilerClient each error with "missing
required dependency"; missing BucketsPath errors with its own
distinct message so operators can spot the missing wiring; ShardID =
-1 / ShardCount errors with a range message (covers the half-open
[0, ShardCount) boundary so a < to <= refactor can't introduce a
one-past-the-end shard); and a multi-shard config with one
out-of-range entry refuses the whole run rather than silently
disabling the rest.

* test(s3/lifecycle): refactor Pipeline.Run validation tests as table-driven

Per gemini review on #9402: collapse the eight per-branch tests into
TestPipelineRunValidation with a slice of (name, mutate, wantErr)
cases. Same coverage, ~30 fewer lines, idiomatic Go pattern that
makes adding a new validation case trivial.
2026-05-09 20:34:51 -07:00
Chris Lu
6f9668c20b test(s3/lifecycle): pin lifecycleDispatch validation early-returns (#9400)
Three pure-validation paths in lifecycleDispatch return BLOCKED before
any filer call; without coverage a refactor could let them fall
through to a real delete. ABORT_MPU at dispatch time is a defensive
catch (the route bypass should never happen, but if it does the
fallthrough must not become a default-case rm). Unknown ActionKind
gets the same treatment for forward-compatibility with new proto
values. Empty version_id on noncurrent / EXPIRED_DELETE_MARKER kinds
must be rejected before deleteSpecificObjectVersion is called, so a
malformed event can't silently delete the latest pointer.
2026-05-09 20:11:08 -07:00
Chris Lu
af2a359e45 feat(s3/lifecycle): metadata_only_total Prometheus counter (#9399)
Operator-visible signal for the metadata-only delete path landed in
PR 9390. Increment seaweedfs_s3_lifecycle_metadata_only_total{bucket,
rule_hash} after each successful unversioned or noncurrent / expired-
marker delete that took the skip-chunk path. Suspended-versioning
null delete is intentionally not counted: that path's nil err can
mean "deleted" or "NotFound", so a count there would over-report.
rule_hash is hex-encoded for label safety; nil bytes collapse to
"". DeleteBucketMetrics tears the new series down alongside the
existing lifecycle counters when a bucket is removed.
2026-05-09 20:02:26 -07:00
Chris Lu
284d37c3b6 test(s3/lifecycle): cover InMemoryPersister deep-copy contract (#9397)
* test(s3/lifecycle): cover InMemoryPersister deep-copy contract

8 tests pin the persister contract other lifecycle tests rely on for
cursor checkpointing: Load on an unknown shard returns an empty map
(not an error); Save then Load roundtrips; Save copies the input so
caller-side mutation doesn't bleed into stored state; Load returns a
copy so caller-side mutation of the snapshot doesn't bleed back; Save
replaces (not merges) prior state so stale resume points don't survive
restart; different shards stay isolated; saving an empty map clears
state; concurrent Save+Load is race-free under -race. A regression on
any of these silently corrupts downstream tests.

* test(s3/lifecycle): assert.NotContains for InMemoryPersister key absence

assert.Empty on a map[K]V index returns true when the value is the
zero value, which would mask a key that leaked through with int64(0).
Use assert.NotContains so the assertion fails on key presence
regardless of the stored value.
2026-05-09 19:47:16 -07:00
Chris Lu
551e700e64 test(s3/lifecycle): cover scheduler configload surface (#9395)
* test(s3/lifecycle): cover scheduler configload surface

LoadCompileInputs is the bridge between the filer's bucket directory
and the engine snapshot the scheduler compiles every refresh; a missed
or misclassified bucket silently disables lifecycle for that prefix
until the next refresh. Tests pin: empty bucket dir, files at the
bucket level skipped, buckets without the lifecycle XML extended key
skipped, empty-bytes XML skipped, valid XML becomes a CompileInput,
versioning attr propagates to CompileInput.Versioned, malformed XML
surfaces as a ParseError without aborting the walk, and pagination
across the 1024 page boundary preserves bucket order.

Also covers the IsBucketVersioned (case + whitespace tolerance,
rejection of garbage values) and AllActivePriorStates (one entry per
(bucket, ruleHash, actionKind), bucket-keyed isolation) helpers.

* test(s3/lifecycle): tighten configload pagination boundary check

Switch the bucket-count check to require.Len so a regression that
returns the wrong number of buckets fails fast before the boundary
asserts panic on out-of-range index. Add explicit assertions on the
last entry of page 1 (b01023) and the first entry of page 2 (b01024)
so a pagination-loop bug that drops or duplicates the seam is caught
directly rather than only via the count check.
2026-05-09 19:46:40 -07:00
Chris Lu
6021a88606 test(s3/lifecycle): cover CompareVersionIds tiebreak surface (#9394)
* test(s3/lifecycle): cover CompareVersionIds tiebreak surface

13 tests pin every documented branch of the version-id comparator and
its helpers (isNewFormatVersionId, getVersionTimestamp): equality and
short-circuit paths, null sorting last, both-new-format with smaller-
=-newer ordering, both-old-format with larger-=-newer ordering, mixed-
format compared by parsed timestamp, mixed-format with synthesized
equal timestamps, length / null / non-hex rejection, the strictly-
greater-than threshold boundary at 0x4000000000000000, and the
inverted-value invariant the comparator relies on. Getting any axis
wrong silently inverts retention rankings, which would resurrect
deleted versions or evict live ones.

* test(s3/lifecycle): use plain assert.Equal in mixed-format compare test

The previous local require := assert.New(t) shadowed testify's require
package while actually returning an assert.Assertions (continue-on-fail
semantics, not fail-fast). Use plain assert.Equal(t, ...) calls so the
behavior matches the variable's name and the rest of the file.
2026-05-09 19:03:31 -07:00
Chris Lu
7781eef429 test(s3/lifecycle): cover dispatcher filerSiblingLister surface (Phase 14 slice) (#9392)
* test(s3/lifecycle): cover dispatcher's filerSiblingLister surface

Tests pin the four routing-critical filer interactions on the
filerSiblingLister: Survivors (count cap, LoneEntry semantics,
null-version detection across regular files and directory-key
markers, error propagation in both list and lookup paths),
ListVersions (NotFound collapse, dir/missing-id filtering,
1024-page boundary, error propagation), LookupNullVersion (regular
file, explicit-null flag, directory-key marker accept, plain-dir
reject, NotFound collapse, error propagation), and LookupVersion
(empty version-id no-op, v_ prefix, NotFound collapse, error
propagation).

The fake SeaweedFilerClient mirrors the real filer's NotFound
shape — gRPC succeeds at stream creation and the first Recv()
surfaces filer_pb.ErrNotFound — which is what the lister's
errors.Is check depends on. NewFullPath strips a trailing slash
before splitting so directory-key markers are stored under their
slashless Name.

* test(s3/lifecycle): gofmt sibling_lister_test.go

Trailing comment alignment.
2026-05-09 18:55:50 -07:00
Chris Lu
8cf42a5abb test(s3/lifecycle): assert per-goroutine errors in fake-server concurrent test (#9393)
test(s3/lifecycle): assert per-goroutine errors in concurrent fake test

The previous TestFake_ConcurrentCallsSerializeWithoutDeadlock dropped
the err return from each LifecycleDelete call, so a regression in the
concurrent path could pass the length-only assertion. Capture each
err on a buffered channel and require.NoError after wg.Wait().
2026-05-09 18:54:15 -07:00
Chris Lu
ddfb219ec3 test(s3/lifecycle): fake LifecycleDelete server (Phase 12 slice) (#9391)
* test(s3/lifecycle): fake LifecycleDelete server for component tests

A reusable double for SeaweedS3LifecycleInternalServer with per-key
FIFO outcome queues, a fallback Default, and recorded request capture.
Tests of the worker pipeline that need to hit the proto boundary can
queue up DONE/NOOP/RETRY/FATAL/SKIPPED_OBJECT_LOCK responses per
(bucket, objectPath, versionId) and assert dispatch order against
Recorded(). SetError flips the server into transport-failure mode
without polluting the request log.

* test(s3/lifecycle): use struct map key for FakeLifecycleServer queues

Bucket / object path / version-id are user-supplied strings that can
contain "/" or "@", which would collide if the queue map were keyed by
"<bucket>/<object>@<version>". Switch to a struct key so the
components stay separate.

* test(s3/lifecycle): deep-copy recorded LifecycleDelete requests

Tests that mutate a Recorded() entry — or a request pointer they
already passed in — were able to corrupt the fake's bookkeeping
because the slice carried shared pointers. Clone with proto.Clone at
both record and read time so the fake holds an independent snapshot
of every arriving request and hands callers an independent snapshot
back. Tightened TestFake_VersionIDPartOfKey error checks while there.
2026-05-09 18:38:52 -07:00
Chris Lu
bb0c7c779f feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0 (Phase 2b) (#9390)
* refactor(s3): thread metadataOnly into delete helpers

Add a metadataOnly bool to deleteUnversionedObjectWithClient and
deleteSpecificObjectVersion. When true the helper sends IsDeleteData=
false to the filer's DeleteEntry RPC so per-chunk DeleteFile RPCs are
skipped — the volume server reclaims chunks on its own at TTL drop.
Non-lifecycle callers (DELETE handlers, batch delete) pass false to
preserve today's eager-chunk-delete behavior; only the lifecycle
handler in the next commit will pass true.

* feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0

Per-write TTL stamping (PR 9377) sets Attributes.TtlSec on every
lifecycle-fitting entry. When the live entry the LifecycleDelete
handler fetched carries TtlSec > 0 the volume server is guaranteed
to reclaim chunks at TTL drop, so the filer can skip per-chunk
DeleteFile RPCs and just remove the entry record. lifecycleDispatch
now computes metadataOnly from the live entry and threads it through
the unversioned, suspended-null, and noncurrent/expired-marker delete
paths. createDeleteMarker is unaffected — it creates a marker, never
deletes chunks.
2026-05-09 18:38:38 -07:00
Chris Lu
255e9cd0f7 test(s3/lifecycle): cover reader cursor + Run validation contracts (#9389)
* test(s3/lifecycle): cover reader cursor + Run validation contracts

Layer 2 tests pinning four reader-package contracts the dispatcher
pipeline depends on: MinTsNs anchors at frozen positions, Snapshot
returns a deep copy in both directions, Restore replaces (not merges),
and Run validates ShardID/Events/BucketsPath before subscribing.

* test(s3/lifecycle): tighten cursor composition assertions

Snapshot deep-copy: also assert cursor doesn't see keys added to the
returned map. Restore replace: freeze before second Restore and assert
IsFrozen returns false after, pinning the contract that Restore wipes
frozen state alongside the value map. Run validation: bound the call
with a 5s context timeout so a regression that lets Run reach the nil
client surfaces as a failure instead of a hang.
2026-05-09 14:32:11 -07:00
Chris Lu
aa280443e7 test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher (#9387)
* test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher

The existing dispatcher unit tests cover individual outcomes
(DONE / RETRY_LATER / BLOCKED / etc.) on a single shard, and
pipeline_test.go has only one end-to-end happy-path assertion.
Multi-shard composition — the contract Pipeline.Run wires up at
runtime — was untested at the component level.

Add four Layer 2 tests in dispatcher/multi_shard_test.go:

  Two events for two shards land in different schedules, dispatch
  independently, and each cursor advances only for its own event
  (no cross-contamination on the action-key map).

  A poison event on shard 0 returns BLOCKED and freezes shard 0's
  cursor; shard 1's normal event continues to dispatch and its
  cursor advances. Per-shard isolation contract.

  Save/Load round-trips a per-shard cursor snapshot through the
  Persister: a fresh dispatcher restores the same TsNs map. Pins
  the contract Pipeline.Run drives on the checkpoint ticker.

  RETRY_LATER respects RetryBackoff against the wall clock — a
  Tick within the backoff window doesn't re-dispatch; a Tick past
  the new DueTime does. Guards against premature retries from
  refresh ticks landing inside the backoff.

Pipeline.Run itself can't run here (it builds a real reader.Reader),
so the tests share the same fakeClient pattern dispatcher_test.go
uses and drive Tick directly.

* test(s3/lifecycle): drop unused snapshot helper and addAndTick parameter
2026-05-09 14:12:21 -07:00
Chris Lu
1854101125 feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386)
* feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8)

scan_only actions only fire from the bootstrap walk: the engine
classifies a rule as scan_only when its retention horizon exceeds the
meta-log retention, so event-driven routing can't be trusted. Today
each bucket walks once per process, so a long-running worker never
revisits — scan_only retention only catches up when the worker
restarts.

Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk
(name -> completion time). KickOffNew now re-walks a bucket whose
last walk completed more than BootstrapInterval ago. Zero interval
preserves the legacy walk-once-per-process behavior so existing
deployments don't change cadence by default. walkBucket re-stamps
on success and clears the stamp on failure (via MarkDirty), so the
next KickOffNew picks failed walks back up.

Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and
a Now func() for testable time travel.

weed shell run-shard grows --bootstrap-interval (cadence knob) and
--force-bootstrap (drop in-memory state at startup so every bucket
walks again immediately, useful when a config change should take
effect without a restart).

Tests: cadence respected (skip inside interval, re-walk past it);
zero interval keeps once-per-process; MarkDirty forces re-walk
under a 24h interval; MarkAllDirty resets every record. The
fakeClock helper guards the test clock with a mutex so race-detector
runs are clean.

* fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag

Three issues with the Phase 8 cadence work as it landed:

1. lastWalk did double duty as both completed-walk timestamp and
   in-flight debounce. A walk that took longer than BootstrapInterval
   would have a fresh KickOffNew start a duplicate goroutine on the
   next refresh tick because the stamp from KickOffNew looked stale
   against the interval. Split into lastCompleted (set on success)
   and inFlight (set on dispatch, cleared after the walk goroutine
   returns success or failure). KickOffNew skips inFlight buckets
   regardless of cadence.

2. The cadence knob existed on `weed shell` but not on the production
   path: scheduler.Scheduler constructed BucketBootstrapper without
   BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had
   no field for it. Add Scheduler.BootstrapInterval, parse
   `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk-
   once-per-process; negative clamps to zero), and forward it from
   the handler. Tests cover default, override, clamp, and explicit-zero.

3. --force-bootstrap was a no-op: BucketBootstrapper is freshly
   allocated at command start, so MarkAllDirty on empty state does
   nothing, and the flag couldn't influence an already-running
   process anyway. Remove it; a real runtime trigger (SIGHUP, control
   RPC) is a separate change.

In-flight regression: a blockingInjector pins the first walk in
progress while the test advances the clock past the interval. The
second KickOffNew is a no-op (inFlight check). After release, the
post-completion KickOffNew within the interval is also a no-op.

* test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock

The cadence test polled listedN to know "the walk happened" — but
that fires once both list passes are issued, while the success-stamp
lands later, after walkBucketDir returns. A clock.Advance(30m)
between those two events would record the stamp at clock+30m
instead of T0; the next assertion would then see now.Sub(last) < 1h
and skip the expected re-walk. Tight in practice but exposed under
-race / load.

Add a waitForCompleted helper that polls b.lastCompleted directly,
and use it before each clock advance in both the cadence and zero-
interval tests.

* fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks

Two follow-ups on Phase 8.

The worker config descriptor had no bootstrap_interval_minutes field,
so the production operator UI couldn't enable the cadence — only the
internal ParseConfig + Scheduler wiring knew about it. Add the field
to the cadence section (MinValue=0 since 0 is the legacy default) and
include the default in DefaultValues so existing deployments see the
knob with the right preset.

MarkDirty / MarkAllDirty silently lost their effect when a walk was
in flight: the methods cleared lastCompleted, but the walk's success
path then wrote a fresh timestamp, hiding the operator's invalidation.
Track a pendingDirty set; the walk goroutine consumes the flag on
exit and skips the success stamp, so the next KickOffNew picks the
bucket up immediately.

Regression: pin a walk in progress with a blockingInjector, MarkDirty
the bucket, release the walk, and assert lastCompleted stayed empty
plus the next KickOffNew triggers a new walk inside the
BootstrapInterval window.

* refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty

These methods were the operator-hook half of Phase 8, but the only
caller (--force-bootstrap on the shell command) was removed when it
turned out to be a no-op against a freshly-allocated bootstrapper.
Nothing in production calls them anymore.

Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty
set, the dirty-suppression branch in walkBucket, and the three tests
that only exercised those methods. BootstrapInterval-driven
re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP,
control RPC) is a separate change with a real call site.
2026-05-09 13:42:31 -07:00
Chris Lu
edfa1ce210 feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4) (#9385)
* feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4)

Bootstrap covers existing versions, but a live PUT that creates a new
.versions/<v-new> file and updates the parent's ExtLatestVersionIdKey
didn't fire NoncurrentDays / NewerNoncurrent on the displaced prior
version until the next bootstrap. Close that runtime gap.

The meta-log already emits an Update event for the .versions/
directory itself when the latest pointer changes; the router was
dropping it because buildObjectInfo returns nil for directories. New
branch in Route detects that shape (versioned bucket, NewEntry +
OldEntry both directories with the .versions/ suffix, ExtLatestVersionIdKey
changed, ID different from the new ID) and emits a Match against the
LOGICAL key with VersionID=oldID. Match.Identity comes from a single
LookupVersion RPC for the displaced version file; SuccessorModTime is
the directory update's mtime, which is the moment the displaced
version became noncurrent.

SiblingLister grows LookupVersion(bucket, key, versionID) for that
single-RPC fetch. filerSiblingLister implements it; routing path
treats NotFound as "displaced version was hard-deleted in the
meantime, suppress" rather than an error.

The router gates the lookup on at least one active event-driven
NoncurrentDays / NewerNoncurrent rule for the bucket, so most buckets
pay nothing per directory update.

Tests: pointer-flip fires NoncurrentDays with displaced version_id;
unchanged pointer skips; empty old pointer skips (first-PUT scenario);
displaced-version NotFound suppresses; no-rule skips lookup;
NewerNoncurrentVersions retains rank-0; unversioned bucket skips.

* fix(s3/lifecycle): SuccessorModTime cache + NewerNoncurrent expansion

Two correctness gaps in pointer-transition routing.

The .versions/ directory's own Attributes.Mtime is preserved across
pointer updates by updateLatestVersionInDirectory: it's a stale clock
relative to the freshly-written latest version. Using it as the
displaced version's SuccessorModTime made NoncurrentDays compute
due = staleMtime + days, which fires immediately on a fresh PUT
into an old .versions/ container. Read ExtLatestVersionMtimeKey
written by setCachedListMetadata; suppress (return no matches) when
the cache is missing rather than fall back to dir mtime.

Single-oldID lookup is only enough for pure
NoncurrentVersionExpirationDays. Any rule with NewerNoncurrentVersions
> 0 cares about the noncurrent ranks, and a pointer flip shifts every
prior noncurrent's index by one — the version that just crossed the
keep-count threshold needs to be evaluated too. When any matching
rule needs ranks, list the full .versions/ container, sort newest-
first with mtime + version-id tiebreak, and route every noncurrent
with its real index. Identity-CAS dedups against earlier schedules.

SiblingLister grows ListVersions(bucket, key); filerSiblingLister's
implementation paginates the container fully.

Two regression tests: stale dir mtime + correct cached mtime
schedules ~30 days out (not immediate); NewerNoncurrentVersions=2
with 4 versions fires on the rank-2 entry that just crossed the
threshold while rank-0/1 are retained.

* fix(s3/lifecycle): bound pointer-transition expansion to threshold crossings

routePointerTransitionExpand emitted a Match for every eligible
noncurrent on every PUT. Schedule.Add doesn't dedup, so identity-CAS
at dispatch only saved the wasted RPC, not the heap slot. A hot key
with many already-eligible versions and a count rule would push
O(versions) entries per flip, repeatedly, until dispatch caught up.

Bound the emission to versions that newly entered eligibility on
this specific flip: rank 0 (the displaced version, for the
NoncurrentDays clock) plus rank == rule.NewerNoncurrentVersions
for each active count-gated rule (the version that just crossed
from kept to expired). Bootstrap still owns full backfill for
versions that were already over-threshold.

Adds a regression with 6 versions and NewerNoncurrentVersions=2:
asserts only the rank-2 entry that just crossed fires, not the
already-over-threshold rank-3/rank-4 entries.

* fix(s3/lifecycle): suppress pointer-transition expansion when newID missing

routePointerTransitionExpand defaulted latestPos to 0 if newID wasn't
found in the listing. That made the actual newest sibling latest
against the pointer's intent, then misranked every other version. A
race between the pointer write and the version write could land us
there.

Default latestPos to -1, set it only on a real match, and suppress
the expansion when the search misses. Bootstrap repairs state on
the next walk.

The NewerNoncurrentVersions retention test was setting only
lookupEntry, so Route never reached the expansion path it claimed to
exercise. Repoint to listVersions and assert ListVersions was
consulted while LookupVersion was not. Adds a regression covering
the missing-newID suppression directly.

* fix(s3/lifecycle): include bare null version in pointer-transition routing

Bootstrap models the bare-key object as a "null" sibling alongside
.versions/ children, but the live pointer-transition path didn't.
Two cases lost:

1. oldID == "" was treated as "nothing displaced". A pre-versioning
   bare object becomes noncurrent when the first versioned PUT lands
   and the pointer flips to a real id, but live routing skipped it
   and waited for the next bootstrap.

2. The expansion path's ListVersions returned only .versions/
   children. With a bare null in the picture, the noncurrent ranks
   were wrong, so NewerNoncurrentVersions could keep the wrong
   versions and delete the right ones (or vice versa).

SiblingLister grows LookupNullVersion(bucket, key) returning the
bare entry plus an explicit-null flag (matches the bootstrap shape).
filerSiblingLister implements it via util.NewFullPath +
filer_pb.LookupEntry.

routePointerTransitionDisplaced: oldID == "" now consults
LookupNullVersion. When the bare entry exists, route it as
VersionID="null" against the LOGICAL key.

routePointerTransitionExpand: collect .versions/ children plus the
null entry into one sibling slice before sorting and ranking. The
threshold-crossing logic now sees the same N-version set that
bootstrap would compute.

Three new tests: oldID == "" with no null is a no-op (one null
lookup, no version lookup); oldID == "" with a bare null schedules
NoncurrentDays as VersionID="null"; expansion with a bare null
between .versions/ siblings places null at its mtime-correct rank
and only that rank-N entry fires.

* fix(s3/lifecycle): atomic listPageSize so test cleanup doesn't race

KickOffNew dispatches walks via `go b.walkBucket(...)`. A test that
finishes before its goroutines drain leaves them running into the
next test's t.Cleanup, which mutates listPageSize. -race spots the
read/write collision intermittently. Convert listPageSize to
atomic.Uint32; tests use Load/Store. No production semantics change.

* fix(s3/lifecycle): null becomes latest when suspended PUT clears pointer

The router treated newID == "" as if the cached
ExtLatestVersionMtimeKey were still authoritative — but that cache
holds the displaced version's mtime, written by setCachedListMetadata
when the prior version became latest. Using it as SuccessorModTime
made NoncurrentDays=30 immediately fire on a 100-day-old displaced
version even though it became noncurrent today.

When newID == "" the bare null is the new latest. Look it up,
substitute its mtime as the successor clock, and substitute "null"
as the latestPos target for the expansion path's id match. Both
displaced and expand paths now derive the right clock.

updateIsLatestFlagsForSuspendedVersioning was the upstream cause of
the staleness — it cleared ExtLatestVersionIdKey and FileNameKey but
left the cached size/mtime/etag/owner/delete-marker behind. Call
clearCachedVersionMetadata so the .versions/ container is consistent
with "null is latest". The router-side guard is still needed for
older deployments that ran the buggy code, but new writes won't
exercise the workaround.

Two regressions: 100-day-old displaced under NoncurrentDays=30 with
a today-null PUT schedules ~30d out (not immediate); same shape with
NewerNoncurrentVersions=2 ranks the null at latest and only the
rank-2 entry fires.
2026-05-09 12:21:35 -07:00
Chris Lu
2f7ac1d664 feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3) (#9383)
* feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3)

Bootstrap now expands every <key>.versions/ directory into one event
per version with sibling state pre-computed. The router fires
NoncurrentDays / NewerNoncurrent off these events using
SuccessorModTime as the noncurrent clock; previously these rules
never ran on a versioned bucket because buildObjectInfo couldn't
classify version-folder events without the latest pointer.

Mechanics

walkBucketDir treats a directory ending in .versions and carrying
ExtLatestVersionIdKey as a SeaweedFS .versions container — emit it
once and skip the recursion. Coincidentally-named directories without
the latest pointer recurse normally.

BucketBootstrapper.expandVersionsDir lists the children, sorts
newest-first by mtime, resolves the latest position from the pointer,
and injects a synthesized reader.Event per version with
BootstrapVersion populated. NoncurrentIndex is 0-based among
noncurrents in newest-first order; SuccessorModTime is the immediate
newer sibling's mtime (zero for the latest). Pointer naming a missing
or absent version falls back to the newest-by-mtime sibling so a
race window can't flag every entry as noncurrent.

routeBootstrapVersion uses BootstrapVersion to build ObjectInfo
directly (bypassing the version-folder skip in buildObjectInfo) and
runs the standard match loop. ABORT_MPU is excluded by kind-shape
gate. The schedule clock uses SuccessorModTime for noncurrents and
ModTime for the latest, so the dispatcher fires when the rule's days
threshold is met. Match.ObjectKey is the LOGICAL key,
Match.VersionID is the marker's stored version_id — the dispatcher
reaches deleteSpecificObjectVersion or createDeleteMarker correctly.

Layer 2 tests cover both sides. Router: latest fires ExpirationDays;
noncurrent fires NoncurrentDays; NewerNoncurrentVersions retains the
N newest noncurrents; ABORT_MPU never matches. Bootstrap: .versions
dir emitted once and not recursed; missing latest pointer falls back
to newest; backdated PUT (latest pointer is older by mtime) keeps
the right noncurrent index; delete-marker flag propagates.

* fix(s3/lifecycle): no VersionID for latest expirations, child-based dir disambig

Two correctness gaps in Phase 5b/3.

Bootstrap was pinning the version_id on every Match. For
EXPIRATION_DAYS / EXPIRATION_DATE on the latest version this is
unsafe: between schedule and dispatch a fresh PUT can land, the
dispatcher would still identity-match against the original version's
bytes (it still exists at that path) and the resulting delete marker
would hide the new latest. Drop VersionID for those kinds; an empty
VersionID makes the dispatcher fetch the current latest, where
identity-CAS resolves to STALE_IDENTITY and bootstrap re-schedules
with the new latest's identity. NoncurrentDays / NewerNoncurrent /
EXPIRED_DELETE_MARKER still pin the version_id since those are
version-targeted.

isVersionsDir gating on ExtLatestVersionIdKey lost a race window:
createDeleteMarker writes the version file before updating the
parent's Extended pointer, so a walk between those two steps would
see a .versions/ dir without the pointer, recurse into it, and emit
raw version files that the router drops. Match the suffix only and
let expandVersionsDir disambiguate by child inspection: if any child
carries ExtVersionIdKey it's a real .versions container and we expand;
otherwise it's a coincidentally-named user folder and we recurse via
the bucket-walk's own callback so nested entries still flow through.

Tests: latest-expiration assertion flipped to expect empty VersionID;
new tests cover the coincidentally-named-folder recursion and the
race-window expansion (children present, pointer absent).

* fix(s3/lifecycle): filter directory + missing-version-id children at listing

expandVersionsDir's listing callback collected every child with
attributes; subdirectories or entries without ExtVersionIdKey would
make it past the empty-id skip in the inner loop but still inflate
NumVersions and skew NoncurrentIndex (the rank derives from the
filtered slice's position, which was wrong when the unfiltered slice
was sorted). Drop directories at listing time and partition the
file children into a versions slice that's the actual rank source.

Test cleanups: out-of-order-mtime test now sets v1 older than v2 so
latestPos > 0 actually exercises the rank-skip branch in
expandVersionsDir; bootstrapVersionEntry preserves nanosecond
precision via MtimeNs to match markerLoneEntry's pattern; drop a
leftover unused idx variable.

* fix(s3/lifecycle): null version + canonical version-id tiebreak

Two correctness gaps in Phase 5b/3 bootstrap.

Null versions live at the bare logical path, not under .versions/.
Bootstrap previously expanded only .versions/<key>/ children, so:
  - pre-versioning objects with newer .versions/ history never had
    their null version expired by NoncurrentDays
  - suspended-bucket writes (which clear the .versions/ latest pointer
    so null becomes current) had every .versions/ child wrongly
    classified as latest by the buildObjectInfo fallback

expandVersionsDir now looks up the bare key via NewFullPath +
LookupEntry, accepts a regular file or an explicit S3 directory-key
marker (Mime set), and folds it into the sibling set with
VersionID="null". Latest resolution: pointer present + names a real
id wins; pointer absent + null exists makes null latest; otherwise
falls back to newest sibling. The walker's regular emission for the
bare entry would otherwise duplicate, so walkBucketDir now does a
two-pass walk per directory level — .versions/ first, then everything
else with a per-walk skipBare set keyed by bucket-relative path that
expandVersionsDir populates when it claims a bare null sibling.

Sort tiebreak: PUTs only set second-level Mtime, so two versions
written in the same second tied. The unstable secondary order let
old-format version filenames sort oldest-first and corrupt
NoncurrentIndex under NewerNoncurrentVersions retention. Add
CompareVersionIds to s3lifecycle/version_time.go (mirrors the
canonical comparator in s3api/s3api_version_id.go to avoid the
import cycle) and use it as a secondary key after mtime equality.

Tests: pre-versioning null-as-noncurrent, suspended null-as-current,
directory-key marker as null version, end-to-end claim through
walkBucketDir's two-pass ordering, and same-second tiebreak via
canonical version-id ordering. fakeFilerClient grows a
LookupDirectoryEntry implementation backed by the same in-memory tree.

* fix(s3/lifecycle): only treat explicit-null bare entries as current

The pointer-missing branch in expandVersionsDir made null latest as
soon as a bare object was found. That's correct for suspended-bucket
writes (s3api_object_handlers_put.go writes the bare entry with
ExtVersionIdKey="null") but wrong for the pre-versioning race window:
a brand-new version under .versions/<file> exists before the parent's
ExtLatestVersionIdKey update lands, and a pre-versioning bare object
has no version-id marker. Marking that older bare object latest hides
the real new version and skips noncurrent expiration of the null
until the next process restart/bootstrap.

Distinguish the two: lookupNullVersion now returns whether the bare
entry's Extended map carries ExtVersionIdKey="null" (the suspended
write marker). expandVersionsDir's pointer-missing branch only
promotes null to latest when explicit; otherwise it falls back to
newest-sibling, which is safe for the race window since the new
version's mtime is fresher than the bare object's.

The existing suspended-null test now uses a new helper that adds the
explicit marker. New regression test covers the race window: bare
entry without the marker + a fresh .versions/<v1> file + missing
parent pointer must keep v1 as latest and the null as noncurrent.

* fix(s3/lifecycle): only the newest item can be the explicit-null latest

The pointer-missing branch in expandVersionsDir scanned every item for
an explicit null and promoted it to latest. After a suspended->enabled
transition that's the wrong call: createVersion writes the version
file before updating ExtLatestVersionIdKey, so a bootstrap that lands
in the race window sees an older bare null with ExtVersionIdKey="null"
plus a newer .versions/<v-new> child and no parent pointer. Promoting
the null misclassifies v-new as noncurrent and skips both the new
version's current-version expiration and the null's noncurrent
scheduling until the next bootstrap.

Constrain the explicit-null branch to items[0]: if the suspended-null
write is genuinely current it'll be the newest by mtime AND tagged.
Anything else falls through to the newest-sibling default.

Adds a regression test for the suspended->re-enabled race.

* fix(s3/lifecycle): paginate bootstrap directory listings

SeaweedList(..., limit=0) is a single-page request: the filer caps
limit=0 at DirListingLimit (1000 by default) and returns whatever fits
in one round trip. expandVersionsDir and walkBucketDir both relied on
that, so any directory bigger than the cap silently truncated. For
noncurrent retention this is correctness, not just scale — a hot key
with more versions than the cap had its rank/sort math computed off
the first page only, NumVersions, NoncurrentIndex, SuccessorModTime,
and the latest-fallback all wrong, with the older versions never
scheduled until a future bootstrap.

Add a listAll helper that drives pagination via StartFromFileName +
inclusive=false, looping until a page returns fewer entries than the
configured page size. Use it in both call sites. Page size is a var
(listPageSize, default 1024) so tests can shrink it without
generating thousands of entries.

The fake filer client now mirrors the real semantics: sort children
by name, honor StartFromFileName/InclusiveStartFrom, cap at Limit.
New regression tests force a small page size and assert the full
result set is processed and the call count matches what pagination
should drive.

* perf(s3/lifecycle): stream bucket walk in two passes instead of buffering

walkBucketDir was paginating into a children slice and then iterating
twice (pass 1: .versions/, pass 2: everything else). For flat buckets
with millions of entries the buffer is a real memory spike. Drop the
materialization: each pass now drives its own listAll over the same
directory and acts on entries as they stream in. The skipBare ordering
contract is preserved — pass 2 still runs after pass 1 finishes — and
the per-pass paging keeps memory bounded by listPageSize.

Tradeoff: each directory level is listed twice. For workloads where
that matters more than the memory headroom, we can revisit; the
correctness/scale dial here is what the noncurrent rules need.

Updated three tests for the new call count: each walk now records 2
listings per directory (pass 1 + pass 2). The KickOffNew dedup tests
expect 2 calls per bucket; the pagination test expects 6 instead of 3.
2026-05-09 10:48:32 -07:00