10 Commits

Author SHA1 Message Date
Chris Lu
3f4cb6d2fb feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine)

Adds the engine-side API the new daily-replay worker reaches for:
per-view snapshot construction (RulesForShard, RecoveryView), the two
cursor hashes that gate recovery (ReplayContentHash, PromotedHash),
and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot
is a stub keyed on a package-level atomic that the worker startup wiring
populates.

Views return new *Snapshot instances holding cloned *CompiledAction
values so per-clone active/Mode never leak across partitions. Replay
clones force Mode=ModeEventDriven to rehabilitate any persistent
ModeScanOnly carried over from PriorState; walk and recovery clones
preserve Mode as-is. Disabled actions are excluded from all views.

No production caller is wired here — Phase 4's walker/dailyrun
integration is the follow-up. dailyrun's local helpers
(localReplayContentHash, localMaxEffectiveTTL) become one-line
redirects to these exports.

API surface:
- CurrentSnapshot() *Snapshot — stub until Phase 4 wiring.
- SetCurrentEngine(*Engine) — Phase 4 wiring entry point.
- Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot)
- RecoveryView(s *Snapshot) *Snapshot — force-active over the full set.
- ReplayContentHash(s *Snapshot) [32]byte — partition-independent.
- PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip.
- MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only.

30 unit tests covering clone isolation, Mode rewrite, partition
membership including the multi-action-kind XML rule split,
RecoveryView activating pre-BootstrapComplete actions,
ReplayContentHash partition-independence, PromotedHash sensitivity to
promotion in either direction, MaxEffectiveTTL aggregation. Build +
race-tests green.

* refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics

Addresses PR #9447 review feedback. Three medium-priority items from
gemini, all code-quality refinements (no behavior change):

1. Duplicated sort comparator between ReplayContentHash and
   PromotedHash. Extract sortHashItems shared helper so the two
   hashes use the same ordering by construction — if one drifted, the
   cursor could see a spurious "rule changed" on a no-op snapshot
   rebuild.

2. Duplicated writeField/writeInt closures. Extract hashWriter struct
   holding the sha256 running hash + lenbuf, with method helpers.
   Same allocation profile (one Hash, one tiny stack buffer per
   helper); just deduplicates ~20 lines.

3. shardID parameter on RulesForShard is unused. Per the design's
   open question, every shard sees every rule today (shard filter
   runs at the entry-iteration site, not view construction). Keep
   the parameter for API stability — removing it now would force
   a breaking change when bucket-shard ownership lands — and update
   the doc comment to explain why it's reserved.

go build ./... clean; engine test suite green.
2026-05-11 18:07:54 -07:00
Chris Lu
82648cca53 test(s3/lifecycle/engine): pin delay-group dedup across buckets (#9418)
Compile a 100-bucket × 5-rule snapshot where the five Days values
include duplicates (1, 1, 7, 7, 30) and assert:

- snap.actions has 500 entries — every (bucket, rule) compiles to its
  own ActionKey, no collapse.
- snap.originalDelayGroups has exactly 3 entries — the routing index
  is keyed by Delay, so same-day rules across all buckets share a
  group. This is the property that lets the dispatcher index by
  delay group rather than per-rule.
- Per-group key count = (rules with that day) × buckets, so every
  action is reachable from its group entry.
2026-05-10 10:36:54 -07:00
Chris Lu
c7b01c72b2 test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.
2026-05-10 09:30:50 -07:00
Chris Lu
b740e22e63 test(s3/lifecycle): bundle dispatcher + engine edge-case coverage (#9413)
* test(s3/lifecycle): bundle dispatcher + engine edge-case coverage

Two-package bundle covering uncovered branches in production code that
the existing happy-path tests don't reach. Dispatcher 58.1% → 60.2%
and engine 81.0% → 81.7% (engine lift modest because most branches
were already hit; the nil-rule defensive case is otherwise unreachable
from a Compile flow).

dispatcher (4 tests):
- FilerPersister.Load with nil Store errors with a "nil Store"
  message rather than panicking at the Read call.
- FilerPersister.Save with nil Store same.
- FilerPersister.Load with a non-NotFound transport error wraps the
  shard ID into the message AND keeps the underlying error
  recoverable via errors.Is.
- FilerPersister.Load with successful empty []byte returns an empty
  map, not a JSON-decode error — pinning that an existing-but-empty
  cursor file is treated as "no entries".
- Tick initializes the retries map on first call without panic so a
  freshly-constructed Dispatcher works.
- Tick with already-canceled ctx re-queues the popped Match, returns
  zero, and never invokes the LifecycleDelete client — the Match
  must not be lost across worker restart.

engine (4 tests):
- rulePredicateSensitive(nil) returns false rather than panicking on
  the FilterTags dereference. The non-nil paths run through Compile,
  but a defensive nil-rule arrival isn't reachable that way.
- rule with no FilterTags / empty FilterTags map returns false (the
  check is len(FilterTags) > 0, so empty must classify as
  non-sensitive — pinning catches a flipped >= comparison).
- rule with a populated FilterTags returns true.

* fix(s3/lifecycle): Tick must requeue every drained Match on shutdown

Per codex review on #9413: Tick called Schedule.Drain to pop ALL due
matches at once, then iterated. If ctx canceled mid-loop, only the
current Match was re-added — everything past that index was silently
lost across the worker restart. With N due matches, up to N-1 were
dropped.

Fix: on cancellation, re-add due[i:] (current + remaining) before
returning. Matches already dispatched (due[:i]) stay processed; the
schedule is left exactly as it would be if Drain had returned only
the dispatched prefix.

Strengthen the existing test to enqueue three due matches and assert
sched.Len()==3 after a pre-canceled Tick. Pre-fix the test would have
seen Len()==1 because only the first popped Match was re-added.
2026-05-09 22:02:17 -07:00
Chris Lu
ca95d33092 test(s3/lifecycle): bundle dispatcher + engine accessor coverage (#9410)
* test(s3/lifecycle): bundle dispatcher + engine accessor coverage

Two-package bundle covering pure helpers and snapshot read-side
accessors that the router and dispatcher reach for at runtime. None
were directly tested; regressions previously surfaced only as
downstream Tick / Match / Compile failures.

dispatcher (10 tests):
- keyOf: derives every retryKey field from the Match; equal Match
  values produce equal keys (so the second dispatch hits the first's
  retry counter); distinct VersionIDs and ActionKinds produce
  distinct keys (so a noisy version can't starve a healthy one,
  and two kinds on the same object don't share a budget).
- budget(): configured value when set; defaultRetryBudget when zero
  or negative — pins the >0 guard against a flipped comparison.
- backoff(): same pattern as budget for RetryBackoff.

engine snapshot accessors (8 tests):
- OriginalDelayGroups exposes the compiled per-delay groups; rules
  with multiple kinds at different cadences land in distinct entries;
  scan-only actions don't leak into delay groups so the dispatcher
  doesn't try to drive them event-driven.
- PredicateActions populated for tag-sensitive rules, empty for non-
  tag-sensitive ones (so MatchPredicateChange doesn't route
  irrelevant kinds).
- DateActions surfaces ExpirationDate verbatim for date kinds; empty
  for non-date rules.
- MarkActive on an unknown key is a no-op (durable bootstrap-complete
  write races a recompile that drops the rule; panic here would crash
  the worker).
- MarkActive flips a fresh-no-prior-state action from inactive to
  active.
- BucketActionKeys covers every kind RuleActionKinds reports.

* test(s3/lifecycle): strengthen snapshot accessor content assertions

Per gemini review on #9410: assertions previously only checked counts
and non-empty status. Verify the specific ActionKeys land where
expected so an indexing regression that produces the right number of
items with wrong kinds gets caught.

OriginalDelayGroups: each delay group's slice asserts.Contains the
specific (bucket, rule_hash, kind) ActionKey instead of just
NotEmpty.

PredicateActions: assert.Contains the expected key instead of just
NotEmpty.

BucketActionKeys: every key.Bucket must equal the test bucket (catches
cross-bucket leak), and ElementsMatch pins kinds against
RuleActionKinds.
2026-05-09 22:01:54 -07:00
Chris Lu
0955d1aa08 test(s3/lifecycle): direct prefixMatches + filterAllows coverage (#9408)
Both helpers were exercised indirectly through MatchOriginalWrite /
MatchPath; pinning them directly catches a regression at the helper
level so a Match-test failure isn't the first signal of a broken
filter.

prefixMatches: empty prefix fast path; exact-prefix match; non-match
rejection; path shorter than prefix.

filterAllows: no-filter accepts any event; FilterSizeGreaterThan is
strictly > (boundary value rejected); FilterSizeLessThan is strictly
<; zero-size thresholds mean "not set" (must let any size through —
a regression treating 0 as a real threshold would reject everything);
required tag present accepts; missing key, empty tags map, wrong
value, and missing-among-multiple all reject; size + tag filters are
AND'd so either failing rejects.
2026-05-09 20:47:35 -07:00
Chris Lu
1aa55f5bf9 test(s3/lifecycle): direct decideMode + RuleMode.String coverage (#9405)
Compile tests cover decideMode indirectly; these direct tests pin
every branch so a regression in the classifier itself can't slip
behind a more elaborate Compile failure.

Pinned: nil rule and Disabled status both → Disabled; ExpirationDate
→ ScanAtDate without consulting retention; metaLogRetention=0 means
unbounded so any horizon → EventDriven; horizon within retention →
EventDriven; horizon exceeding retention → ScanOnly; bootstrapLookback
adds to horizon (not retention) so a near-threshold case is still
gated; zero horizon (rule field unset) skips the gate. RuleMode.String
must render the documented names for every variant; an unknown value
collapses to "unspecified" rather than empty or panic.
2026-05-09 20:35:34 -07:00
Chris Lu
05d31a04b6 fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite

The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.

Three changes that work together:

- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
  the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
  the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
  tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
  Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
  land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
  -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
  shell command runs the full pipeline (reader + scheduler + dispatcher) for the
  duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
  out for now — versioned-bucket expiration via the worker still needs its own
  pass.

Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (8b87ceb0d updated `mini_plugin_test.go` for the s3_lifecycle
plugin but missed this twin test).

Two retention-gate engine tests `t.Skip` under the s3tests build because they
rely on absolute lookback-vs-retention math the day-rescale collapses; the prod
build still covers them.

* review: harden lifecycle worker spawn + assert handler identity

- Workflow: aliveness check on the backgrounded `weed shell` (a bad command
  exits in <1s and the suite would otherwise just opaque-timeout); move
  worker/server teardown into a `trap cleanup EXIT` so failure paths still
  print the worker log and reap the data dir.
- worker_test: check the actual job-type set by name, not just the count.

* fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet

The s3-tests CI runs the worker BEFORE any test creates a bucket, so
LoadCompileInputs returns empty and the shell command was bailing out
with "no buckets with enabled lifecycle rules found" within ~1s. The
aliveness check then fired exit 1 before tox ever started.

Two changes:

- Don't early-exit on empty inputs. Compile against the empty set, log a
  one-liner, and let the pipeline run normally — the meta-log subscription
  is already up, so events for buckets created later DO arrive; they just
  need the engine to know about them when they do.
- Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that
  periodically re-runs LoadCompileInputs + engine.Compile so rules added
  after startup land in the snapshot the dispatcher reads on its next
  tick. Production deployments keep the 5m default; only the CI workflow
  drops to 2s.

Workflow passes `-refresh 2s` in both basic and SQL blocks.

* fix(shell): backfill pre-rule entries via bootstrap walker

The reader-driven path only sees meta-log events created AFTER its
engine snapshot knows the rule. The s3-tests CI scenario PUTs objects
first, then PUTs the lifecycle config, so by the time the engine
refresh picks up the new bucket the object events have already been
seen-and-dropped (BucketActionKeys returned empty for the bucket).

Wire bootstrap.Walk into the shell command:

- bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one
  loop goroutine per fresh bucket.
- Each goroutine re-walks the bucket every walkInterval (defaults to
  the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and
  feeds each entry through bootstrap.Walk; due actions dispatch via a
  direct LifecycleDelete RPC. Not-yet-due entries are silently skipped
  and picked up on a later iteration once they age past their (rescaled
  or real) threshold.
- LifecycleDelete is called with no expected_identity; the server-side
  identityMatches treats nil as "skip CAS", which is the right call
  for bootstrap (the bootstrap entry doesn't carry chunk fid /
  extended hash anyway).

The dispatcher's pkg-private toProtoActionKind is duplicated in the
shell file rather than exported, since the shape is six lines and the
reverse import would pull a proto dep into the s3lifecycle root.

* refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg

The shell command got the backfill in the previous commit but the worker
plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run
directly and missed it — same root cause: the reader-driven path only
sees events created after the rule lands, so a daily cron picking up a
freshly-PUT rule wouldn't expire any pre-rule object.

Move the looping bucket walker into scheduler.BucketBootstrapper:

- Scheduler.Run now constructs one and calls KickOffNew on every engine
  refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval
  (defaults to RefreshInterval — 5m in prod, 2s under s3tests).
- The shell command consumes the same struct instead of its own copy
  so the two paths can't drift in semantics.

* refactor(s3/lifecycle): walk-once + schedule via event injection

Previous per-bucket walker re-listed every WalkInterval forever. For a
bucket with N objects under a long rule, the worker did O(N * runtime /
walkInterval) listings even when nothing was newly due — way too much
for production-scale buckets.

New approach: walk each bucket exactly once on first sight, synthesize
one *reader.Event per existing entry, push it onto Pipeline.events.
Router.Route builds a Match with DueTime=mtime+delay; future-due matches
sit in the per-shard Schedule and fire when their DueTime arrives.
Currently-due matches fire on the very next dispatch tick.

Wiring:

- dispatcher.Pipeline lifts its events channel into a struct field
  with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no
  longer closes the channel — the dispatch goroutine exits on runCtx
  cancellation, which works the same as channel-close did.
- scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew
  spawns one walker goroutine per fresh bucket; the goroutine lists,
  synthesizes events, then exits.
- scheduler.Scheduler builds its pipelines up front and exposes a
  pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi-
  worker scheduler routes each synthesized event to the pipeline that
  owns its shard.
- Shell command's single-pipeline path passes pipeline.InjectEvent
  directly.

Synthesized events carry TsNs=0; dispatcher.advance treats that as a
no-op so the reader's persisted cursor isn't ratcheted past unprocessed
meta-log events. Identity (HeadFid + ExtendedHash) is still computed
from the real filer entry, so the server's identity-CAS catches an
overwrite between bootstrap and dispatch.

* debug(s3tests): make lifecycle worker progress visible in CI logs

The previous CI failure dumped an empty $LC_LOG even though the worker
was running. Two reasons:

1. weed shell suppresses glog by default (logtostderr / alsologtostderr
   set to false). Pass `-debug` so the bootstrapper's V(0) lines reach
   stderr instead of disappearing into /tmp/weed.*.log.
2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first
   with a 1s grace, then SIGKILL the holdout, then read the log.

While here: bump the bootstrap walker's two informational logs to V(0)
so the diagnosis from CI doesn't require -v=1 on the worker.

* fix(s3/lifecycle/dispatcher): refresh snap on every event

Pipeline.Run captured snap at startup and only refreshed it on the
dispatch tick. With bootstrap event injection, the walker pushes events
seconds after engine.Compile sees the bucket — typically WITHIN the
same dispatch interval. Routing against the cached (empty) snap then
silently dropped every match because BucketActionKeys returned nil for
the bucket-not-yet-in-snapshot case.

Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so
the cost is negligible. The dispatch-tick branch keeps using a fresh
local read for its own loop, so its semantics are unchanged.
2026-05-08 17:29:47 -07:00
Chris Lu
8425c42858 feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355)
feat(s3/lifecycle): event router + DueTime schedule

Router consumes per-shard reader events, looks up matching ActionKeys via
the engine's BucketActionKeys index, and emits Matches with DueTime =
event_time + action.Delay. Evaluation runs at DueTime so the age gate
passes for fresh events; the dispatcher's identity-CAS catches drift.

Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles
the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor
added to engine.Snapshot.
2026-05-07 15:43:27 -07:00
Chris Lu
7f2b20d577 feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348)
* feat(s3/lifecycle): XML lifecycle config to canonical Rule

LifecycleToCanonical takes a parsed *Lifecycle and returns
[]*s3lifecycle.Rule, the flat shape the engine compiles against.
Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags +
size filters) flatten into the canonical Rule's individual fields;
single <Tag> filter populates FilterTags with one entry; <Prefix>
filter takes precedence over the rule's top-level <Prefix>.

Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on
the same XML <Rule>) populate every action field they declare.
RuleActionKinds expands the canonical rule into its compiled actions
downstream.

* feat(s3/lifecycle): engine snapshot skeleton + ActionKey type

Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's
primary identity, and adds the engine package's Snapshot type.
Snapshot is immutable after Compile (atomic-swapped on rebuild) and
holds the ActionKey-keyed routing indexes:

  - originalDelayGroups: map[time.Duration][]ActionKey
  - predicateActions:    []ActionKey
  - dateActions:         map[ActionKey]time.Time
  - actions:             map[ActionKey]*CompiledAction

CompiledAction.engineState is an atomic.Uint32 so MarkActive (called
after the durable bootstrap_complete + mode write commits) is visible
to in-flight reader passes without a recompile. The reader filters on
IsActive() before dispatching, so stale-snapshot dispatches are
prevented.

No callers yet; downstream commits add Compile, decideMode, and the
Match functions.

* feat(s3/lifecycle): decideMode + retention gate

decideMode picks the scheduling mode for one (rule, kind) compiled
action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE;
reader-driven kind whose eventLogHorizon + bootstrapLookbackMin
exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The
gate runs per (rule, kind), so a 90d ExpirationDays sibling can
degrade to scan_only while its 7d AbortMPU sibling stays active.

MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS
default (Phase 0 verified that meta-log files are written without
TtlSec by default), so the gate doesn't trip until an operator opts
in to volume-TTL pruning of /topics/.system/log/.

RuleMode is a Go-level enum here, separate from the wire-form
LifecycleState.RuleMode in the proto package; the worker maps between
them when reading/writing the durable state file.

* feat(s3/lifecycle): Compile builds the engine snapshot per-action

Compile produces a fresh Snapshot from per-bucket canonical rules.
Each input rule expands into N CompiledActions via RuleActionKinds;
mode comes from decideMode; activation requires both
bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN.

Routing indexes are populated by mode:
- SCAN_AT_DATE: always indexed in dateActions (detector schedules at
  rule.date regardless of bootstrap status; the action runs once on
  the date and is then done).
- EVENT_DRIVEN + active: indexed in originalDelayGroups (and in
  predicateActions when the rule has tag/size filters).
- SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan
  tick or operator action handle these.

snapshot_id is monotonic per process; pending writes stamp it. The
new snapshot replaces the engine's atomic pointer; in-flight reader
passes continue against their loaded snapshot.

Tests cover: single-action rule, multi-action expansion (one rule ->
three CompiledActions with three distinct delay groups), pending
bootstrap exclusion from indexes, retention gate, sibling actions
degrading independently under partial retention, ExpirationDate path,
disabled rule, MarkActive flipping IsActive(), Compile producing
monotonic snapshot ids.

* feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath

The reader feeds events through the engine's match functions to find
the active ActionKeys whose filter applies. The minimal Event shape
the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker,
IsMPUInit) keeps engine free of filer_pb dependencies; the reader
extracts these fields from the persisted *filer_pb.LogEntry payload
in Phase 3.

- MatchOriginalWrite: per-delay-group sweep entry. Filters on shape =
  EventShapeOriginalWrite, prefix, tag, size, then per-kind shape
  gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on
  IsLatest+IsDeleteMarker).
- MatchPredicateChange: single near-now sweep. Returns only the
  predicate-sensitive subset of active ActionKeys.
- MatchPath: bucket-level walker entry. Returns every active action
  whose filter matches; bootstrap iterates these per object and calls
  EvaluateAction per kind.

All filter on a.IsActive() at routing time so MarkActive flips become
visible without recompile.

* fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile

Three findings on the engine PR addressed:

1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind}
   only. Two buckets with rules whose XML is identical produce the same
   RuleHash; the second bucket's Compile would overwrite the first
   bucket's CompiledAction in snap.actions. Add Bucket to ActionKey
   so the engine's identity matches the on-disk path layout
   /etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression
   test pins it.

2. Major (immutability leak): OriginalDelayGroups, PredicateActions,
   DateActions returned the snapshot's internal maps/slices by
   reference, letting an external caller mutate routing state and
   break the documented immutability contract. Return defensive
   copies.

3. Minor (redundant condition): mode==EVENT_DRIVEN already implies
   kind != EXPIRATION_DATE because decideMode routes the date kind
   to SCAN_AT_DATE. Drop the redundant check.

Tests updated to construct ActionKey with the new Bucket field.

* fix(s3/lifecycle): drop size filters from rulePredicateSensitive

An object's size is immutable once written: any content change is a
fresh write that flows through the original-write stream, not the
predicate-change one. Tagging rules really can flip post-PUT
(operator adds/removes a tag without rewriting), so they belong; size
filters do not.

Including size filters here was adding rules to predicateActions for
no purpose — every predicate-change sweep would waste cycles
re-evaluating size predicates that physically can't have changed.

* perf(s3/lifecycle): pre-sort AllActions at Compile time

Snapshot is immutable after Compile (engineState bit-flips don't
change membership), so the (bucket, rule_hash, action_kind) ordering
is stable for the snapshot's lifetime. Build the sorted slice once
and serve every AllActions() call from it; drop the per-call
sort.Slice. The bootstrap walker is the primary caller and may
iterate this on every task entry.

* docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity

Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan>
explicitly excludes 0-byte objects, but with the int64 zero value as
the unset sentinel we can't distinguish that from omitted-and-default.
Document the limitation inline so a future deployment that needs the
distinction can switch to *int64 (or a paired set-bool) and update
the matchers / RuleHash accordingly. Not fixing now: the explicit-zero
configuration is unusual, the canonical Rule shape mirrors the same
zero-as-unset convention as s3api.Filter, and a structural fix
touches every filter-using site (evaluator, due_at, match, RuleHash).

* fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex *int

The previous int field had a zero-value collision: 0 is both "newest
non-current version" (a valid index) and "uninitialised by ObjectInfo{}
literal." A caller who built &ObjectInfo{IsLatest: false} without
explicitly setting NoncurrentIndex would have it implicitly read as
"newest non-current," and the count-based NewerNoncurrent retention
would use that bogus 0 to decide eligibility.

Switch to *int so nil is explicitly "not a non-current version /
index not yet computed." The evaluator's NoncurrentDays and
NewerNoncurrent paths conservatively return ActionNone when the
index is nil — the safety scan will revisit once the index is
supplied. This removes a class of latent footguns in test setup and
in any future code path that constructs ObjectInfo without a
versioning-aware builder.

idx() helper added in tests to keep the call sites a one-liner.

* refactor(s3/lifecycle): trim narration from engine + helpers

Drop "what" comments where well-named identifiers already say it
(IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph
"why" docs to one-liners where the design rationale is already in
the design doc. Keep WHY comments only at non-obvious load-bearing
spots: the routing-index activation predicate, the *int rationale on
NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay
horizon rule.

Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go,
min_trigger_age.go, event_log_horizon.go, engine/engine.go,
engine/compile.go, engine/match.go, engine/mode.go.

No behavior change; tests untouched and pass.

* fix(s3/lifecycle): durable PriorState.Mode wins over decideMode

PriorState.Mode was declared but never read; Compile recomputed mode
via decideMode and stored that on every CompiledAction. Effect: an
action durably persisted as SCAN_ONLY (lag fallback or operator
pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the
next engine rebuild as soon as decideMode's XML+retention predicate
said so. Defeats the durability of mode state.

Use prior.Mode when set; fall through to decideMode only for new
actions (no prior at all) and for legacy entries persisted before
Mode existed (zero value). Regression test pins both branches.

* fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key

MarkActive's documented contract was "flip visible without a
recompile," but the routing indexes (originalDelayGroups,
predicateActions) were only populated when active && mode ==
EVENT_DRIVEN at compile time. So a key compiled with
BootstrapComplete=false would never enter the indexes; a later
MarkActive flipped engineState but MatchOriginalWrite /
MatchPredicateChange iterated the indexes and never saw the key.
Only MatchPath (which walks bi.actionKeys) and DateActions worked.

Index every EVENT_DRIVEN key regardless of `active`. The runtime
IsActive() filter inside filterMatching already gates dispatch, so
inactive entries are matched-but-not-fired; flipping MarkActive
makes them routable without recompile, matching the documented
contract.

Tests updated: TestCompile_BootstrapPendingIndexedButInactive
asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable
asserts a MarkActive flip routes the next match.

* test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior

Two regression tests for the *int pointer migration: nil index
combined with NewerNoncurrent (either paired with NoncurrentDays or
standalone) must short-circuit to ActionNone rather than guess at
the version's position in the keep-N window.

* refactor(s3/lifecycle): trim follow-up narration on engine + helpers

Comments accumulated since the last sweep — the durable-Mode rationale,
the MarkActive routability note, the routing-index doc, the
NoncurrentIndex pointer rationale, and the EvaluateAction docblock.
Trimmed each to one or two terse lines; the underlying contracts live
in the design doc.

* docs(s3/lifecycle): note CompileInput one-per-bucket invariant
2026-05-07 15:00:49 -07:00