mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-28 12:41:15 +00:00
master
10 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
3f4cb6d2fb |
feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) Adds the engine-side API the new daily-replay worker reaches for: per-view snapshot construction (RulesForShard, RecoveryView), the two cursor hashes that gate recovery (ReplayContentHash, PromotedHash), and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot is a stub keyed on a package-level atomic that the worker startup wiring populates. Views return new *Snapshot instances holding cloned *CompiledAction values so per-clone active/Mode never leak across partitions. Replay clones force Mode=ModeEventDriven to rehabilitate any persistent ModeScanOnly carried over from PriorState; walk and recovery clones preserve Mode as-is. Disabled actions are excluded from all views. No production caller is wired here — Phase 4's walker/dailyrun integration is the follow-up. dailyrun's local helpers (localReplayContentHash, localMaxEffectiveTTL) become one-line redirects to these exports. API surface: - CurrentSnapshot() *Snapshot — stub until Phase 4 wiring. - SetCurrentEngine(*Engine) — Phase 4 wiring entry point. - Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot) - RecoveryView(s *Snapshot) *Snapshot — force-active over the full set. - ReplayContentHash(s *Snapshot) [32]byte — partition-independent. - PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip. - MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only. 30 unit tests covering clone isolation, Mode rewrite, partition membership including the multi-action-kind XML rule split, RecoveryView activating pre-BootstrapComplete actions, ReplayContentHash partition-independence, PromotedHash sensitivity to promotion in either direction, MaxEffectiveTTL aggregation. Build + race-tests green. * refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics Addresses PR #9447 review feedback. Three medium-priority items from gemini, all code-quality refinements (no behavior change): 1. Duplicated sort comparator between ReplayContentHash and PromotedHash. Extract sortHashItems shared helper so the two hashes use the same ordering by construction — if one drifted, the cursor could see a spurious "rule changed" on a no-op snapshot rebuild. 2. Duplicated writeField/writeInt closures. Extract hashWriter struct holding the sha256 running hash + lenbuf, with method helpers. Same allocation profile (one Hash, one tiny stack buffer per helper); just deduplicates ~20 lines. 3. shardID parameter on RulesForShard is unused. Per the design's open question, every shard sees every rule today (shard filter runs at the entry-iteration site, not view construction). Keep the parameter for API stability — removing it now would force a breaking change when bucket-shard ownership lands — and update the doc comment to explain why it's reserved. go build ./... clean; engine test suite green. |
||
|
|
82648cca53 |
test(s3/lifecycle/engine): pin delay-group dedup across buckets (#9418)
Compile a 100-bucket × 5-rule snapshot where the five Days values include duplicates (1, 1, 7, 7, 30) and assert: - snap.actions has 500 entries — every (bucket, rule) compiles to its own ActionKey, no collapse. - snap.originalDelayGroups has exactly 3 entries — the routing index is keyed by Delay, so same-day rules across all buckets share a group. This is the property that lets the dispatcher index by delay group rather than per-rule. - Per-group key count = (rules with that day) × buckets, so every action is reachable from its group entry. |
||
|
|
c7b01c72b2 |
test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters
First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.
Five new tests:
- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
versioned bucket must produce a delete marker (latest after worker
runs is a marker) AND keep the original version directly addressable
by versionId. ListObjectVersions confirms IsLatest=true on the
marker.
- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
backdate v1, run worker. v1 must be gone, v2 still current.
- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
DELETE (creates marker), backdate both, run worker. Every version
AND marker must be gone for the key.
- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
must not produce dispatches even on a backdated match. Negative
test for the engine's enabled-status gate.
- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
objects carrying the tag. Two backdated objects (one tagged, one
not) — only the tagged one is removed.
Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).
* test(s3/lifecycle): tighten noncurrent expiration diagnostics
Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:
- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
markers, with IsLatest) so the next failure shows which side the
bug is on rather than just NotFound
* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload
Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).
Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.
Helpers added: putAbortMPULifecycle, backdateUploadDir.
* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions
NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.
Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.
Helper added: putNewerNoncurrentLifecycle.
* test(s3/lifecycle): integration coverage for suspended-versioning Expiration
Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.
Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.
Helper added: putVersioningSuspended.
* test(s3/lifecycle): integration coverage for multi-bucket sweep
A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.
Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.
* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter
ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.
Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.
* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup
Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:
- A later test's shard sweep would still load the prior test's
lifecycle config (the worker reads every bucket's XML from filer
state, and DeleteBucket alone doesn't drop lifecycle config
cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
ListObjectsV2 can't see, so the existing best-effort empty-then-
delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.
Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.
* test(s3/lifecycle): backdate both versions for NoncurrentDays clock
Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.
Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.
* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion
Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.
* test(s3/lifecycle): gofmt size-filter test
* test(s3/lifecycle): integration coverage for Object Lock skip
Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.
Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.
Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.
* test(s3/lifecycle): integration coverage for config update between sweeps
An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.
Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.
Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.
* test(s3/lifecycle): integration coverage for ExpirationDate (past)
Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.
Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.
* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects
Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.
Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.
* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching
Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.
Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.
* test(s3/lifecycle): integration coverage for empty bucket sweep no-op
A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).
Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.
* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate
ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.
ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.
* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker
The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.
EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.
Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.
* test(s3/lifecycle): integration coverage for multiple rules per bucket
A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.
Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.
Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.
* fix(s3/lifecycle): mark ScanAtDate actions active in Compile
Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in
|
||
|
|
b740e22e63 |
test(s3/lifecycle): bundle dispatcher + engine edge-case coverage (#9413)
* test(s3/lifecycle): bundle dispatcher + engine edge-case coverage Two-package bundle covering uncovered branches in production code that the existing happy-path tests don't reach. Dispatcher 58.1% → 60.2% and engine 81.0% → 81.7% (engine lift modest because most branches were already hit; the nil-rule defensive case is otherwise unreachable from a Compile flow). dispatcher (4 tests): - FilerPersister.Load with nil Store errors with a "nil Store" message rather than panicking at the Read call. - FilerPersister.Save with nil Store same. - FilerPersister.Load with a non-NotFound transport error wraps the shard ID into the message AND keeps the underlying error recoverable via errors.Is. - FilerPersister.Load with successful empty []byte returns an empty map, not a JSON-decode error — pinning that an existing-but-empty cursor file is treated as "no entries". - Tick initializes the retries map on first call without panic so a freshly-constructed Dispatcher works. - Tick with already-canceled ctx re-queues the popped Match, returns zero, and never invokes the LifecycleDelete client — the Match must not be lost across worker restart. engine (4 tests): - rulePredicateSensitive(nil) returns false rather than panicking on the FilterTags dereference. The non-nil paths run through Compile, but a defensive nil-rule arrival isn't reachable that way. - rule with no FilterTags / empty FilterTags map returns false (the check is len(FilterTags) > 0, so empty must classify as non-sensitive — pinning catches a flipped >= comparison). - rule with a populated FilterTags returns true. * fix(s3/lifecycle): Tick must requeue every drained Match on shutdown Per codex review on #9413: Tick called Schedule.Drain to pop ALL due matches at once, then iterated. If ctx canceled mid-loop, only the current Match was re-added — everything past that index was silently lost across the worker restart. With N due matches, up to N-1 were dropped. Fix: on cancellation, re-add due[i:] (current + remaining) before returning. Matches already dispatched (due[:i]) stay processed; the schedule is left exactly as it would be if Drain had returned only the dispatched prefix. Strengthen the existing test to enqueue three due matches and assert sched.Len()==3 after a pre-canceled Tick. Pre-fix the test would have seen Len()==1 because only the first popped Match was re-added. |
||
|
|
ca95d33092 |
test(s3/lifecycle): bundle dispatcher + engine accessor coverage (#9410)
* test(s3/lifecycle): bundle dispatcher + engine accessor coverage Two-package bundle covering pure helpers and snapshot read-side accessors that the router and dispatcher reach for at runtime. None were directly tested; regressions previously surfaced only as downstream Tick / Match / Compile failures. dispatcher (10 tests): - keyOf: derives every retryKey field from the Match; equal Match values produce equal keys (so the second dispatch hits the first's retry counter); distinct VersionIDs and ActionKinds produce distinct keys (so a noisy version can't starve a healthy one, and two kinds on the same object don't share a budget). - budget(): configured value when set; defaultRetryBudget when zero or negative — pins the >0 guard against a flipped comparison. - backoff(): same pattern as budget for RetryBackoff. engine snapshot accessors (8 tests): - OriginalDelayGroups exposes the compiled per-delay groups; rules with multiple kinds at different cadences land in distinct entries; scan-only actions don't leak into delay groups so the dispatcher doesn't try to drive them event-driven. - PredicateActions populated for tag-sensitive rules, empty for non- tag-sensitive ones (so MatchPredicateChange doesn't route irrelevant kinds). - DateActions surfaces ExpirationDate verbatim for date kinds; empty for non-date rules. - MarkActive on an unknown key is a no-op (durable bootstrap-complete write races a recompile that drops the rule; panic here would crash the worker). - MarkActive flips a fresh-no-prior-state action from inactive to active. - BucketActionKeys covers every kind RuleActionKinds reports. * test(s3/lifecycle): strengthen snapshot accessor content assertions Per gemini review on #9410: assertions previously only checked counts and non-empty status. Verify the specific ActionKeys land where expected so an indexing regression that produces the right number of items with wrong kinds gets caught. OriginalDelayGroups: each delay group's slice asserts.Contains the specific (bucket, rule_hash, kind) ActionKey instead of just NotEmpty. PredicateActions: assert.Contains the expected key instead of just NotEmpty. BucketActionKeys: every key.Bucket must equal the test bucket (catches cross-bucket leak), and ElementsMatch pins kinds against RuleActionKinds. |
||
|
|
0955d1aa08 |
test(s3/lifecycle): direct prefixMatches + filterAllows coverage (#9408)
Both helpers were exercised indirectly through MatchOriginalWrite / MatchPath; pinning them directly catches a regression at the helper level so a Match-test failure isn't the first signal of a broken filter. prefixMatches: empty prefix fast path; exact-prefix match; non-match rejection; path shorter than prefix. filterAllows: no-filter accepts any event; FilterSizeGreaterThan is strictly > (boundary value rejected); FilterSizeLessThan is strictly <; zero-size thresholds mean "not set" (must let any size through — a regression treating 0 as a real threshold would reject everything); required tag present accepts; missing key, empty tags map, wrong value, and missing-among-multiple all reject; size + tag filters are AND'd so either failing rejects. |
||
|
|
1aa55f5bf9 |
test(s3/lifecycle): direct decideMode + RuleMode.String coverage (#9405)
Compile tests cover decideMode indirectly; these direct tests pin every branch so a regression in the classifier itself can't slip behind a more elaborate Compile failure. Pinned: nil rule and Disabled status both → Disabled; ExpirationDate → ScanAtDate without consulting retention; metaLogRetention=0 means unbounded so any horizon → EventDriven; horizon within retention → EventDriven; horizon exceeding retention → ScanOnly; bootstrapLookback adds to horizon (not retention) so a near-threshold case is still gated; zero horizon (rule field unset) skips the gate. RuleMode.String must render the documented names for every variant; an unknown value collapses to "unspecified" rather than empty or panic. |
||
|
|
05d31a04b6 |
fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite
The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.
Three changes that work together:
- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
-runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
shell command runs the full pipeline (reader + scheduler + dispatcher) for the
duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
out for now — versioned-bucket expiration via the worker still needs its own
pass.
Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (
|
||
|
|
8425c42858 |
feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355)
feat(s3/lifecycle): event router + DueTime schedule Router consumes per-shard reader events, looks up matching ActionKeys via the engine's BucketActionKeys index, and emits Matches with DueTime = event_time + action.Delay. Evaluation runs at DueTime so the age gate passes for fresh events; the dispatcher's identity-CAS catches drift. Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor added to engine.Snapshot. |
||
|
|
7f2b20d577 |
feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348)
* feat(s3/lifecycle): XML lifecycle config to canonical Rule
LifecycleToCanonical takes a parsed *Lifecycle and returns
[]*s3lifecycle.Rule, the flat shape the engine compiles against.
Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags +
size filters) flatten into the canonical Rule's individual fields;
single <Tag> filter populates FilterTags with one entry; <Prefix>
filter takes precedence over the rule's top-level <Prefix>.
Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on
the same XML <Rule>) populate every action field they declare.
RuleActionKinds expands the canonical rule into its compiled actions
downstream.
* feat(s3/lifecycle): engine snapshot skeleton + ActionKey type
Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's
primary identity, and adds the engine package's Snapshot type.
Snapshot is immutable after Compile (atomic-swapped on rebuild) and
holds the ActionKey-keyed routing indexes:
- originalDelayGroups: map[time.Duration][]ActionKey
- predicateActions: []ActionKey
- dateActions: map[ActionKey]time.Time
- actions: map[ActionKey]*CompiledAction
CompiledAction.engineState is an atomic.Uint32 so MarkActive (called
after the durable bootstrap_complete + mode write commits) is visible
to in-flight reader passes without a recompile. The reader filters on
IsActive() before dispatching, so stale-snapshot dispatches are
prevented.
No callers yet; downstream commits add Compile, decideMode, and the
Match functions.
* feat(s3/lifecycle): decideMode + retention gate
decideMode picks the scheduling mode for one (rule, kind) compiled
action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE;
reader-driven kind whose eventLogHorizon + bootstrapLookbackMin
exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The
gate runs per (rule, kind), so a 90d ExpirationDays sibling can
degrade to scan_only while its 7d AbortMPU sibling stays active.
MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS
default (Phase 0 verified that meta-log files are written without
TtlSec by default), so the gate doesn't trip until an operator opts
in to volume-TTL pruning of /topics/.system/log/.
RuleMode is a Go-level enum here, separate from the wire-form
LifecycleState.RuleMode in the proto package; the worker maps between
them when reading/writing the durable state file.
* feat(s3/lifecycle): Compile builds the engine snapshot per-action
Compile produces a fresh Snapshot from per-bucket canonical rules.
Each input rule expands into N CompiledActions via RuleActionKinds;
mode comes from decideMode; activation requires both
bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN.
Routing indexes are populated by mode:
- SCAN_AT_DATE: always indexed in dateActions (detector schedules at
rule.date regardless of bootstrap status; the action runs once on
the date and is then done).
- EVENT_DRIVEN + active: indexed in originalDelayGroups (and in
predicateActions when the rule has tag/size filters).
- SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan
tick or operator action handle these.
snapshot_id is monotonic per process; pending writes stamp it. The
new snapshot replaces the engine's atomic pointer; in-flight reader
passes continue against their loaded snapshot.
Tests cover: single-action rule, multi-action expansion (one rule ->
three CompiledActions with three distinct delay groups), pending
bootstrap exclusion from indexes, retention gate, sibling actions
degrading independently under partial retention, ExpirationDate path,
disabled rule, MarkActive flipping IsActive(), Compile producing
monotonic snapshot ids.
* feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath
The reader feeds events through the engine's match functions to find
the active ActionKeys whose filter applies. The minimal Event shape
the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker,
IsMPUInit) keeps engine free of filer_pb dependencies; the reader
extracts these fields from the persisted *filer_pb.LogEntry payload
in Phase 3.
- MatchOriginalWrite: per-delay-group sweep entry. Filters on shape =
EventShapeOriginalWrite, prefix, tag, size, then per-kind shape
gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on
IsLatest+IsDeleteMarker).
- MatchPredicateChange: single near-now sweep. Returns only the
predicate-sensitive subset of active ActionKeys.
- MatchPath: bucket-level walker entry. Returns every active action
whose filter matches; bootstrap iterates these per object and calls
EvaluateAction per kind.
All filter on a.IsActive() at routing time so MarkActive flips become
visible without recompile.
* fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile
Three findings on the engine PR addressed:
1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind}
only. Two buckets with rules whose XML is identical produce the same
RuleHash; the second bucket's Compile would overwrite the first
bucket's CompiledAction in snap.actions. Add Bucket to ActionKey
so the engine's identity matches the on-disk path layout
/etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression
test pins it.
2. Major (immutability leak): OriginalDelayGroups, PredicateActions,
DateActions returned the snapshot's internal maps/slices by
reference, letting an external caller mutate routing state and
break the documented immutability contract. Return defensive
copies.
3. Minor (redundant condition): mode==EVENT_DRIVEN already implies
kind != EXPIRATION_DATE because decideMode routes the date kind
to SCAN_AT_DATE. Drop the redundant check.
Tests updated to construct ActionKey with the new Bucket field.
* fix(s3/lifecycle): drop size filters from rulePredicateSensitive
An object's size is immutable once written: any content change is a
fresh write that flows through the original-write stream, not the
predicate-change one. Tagging rules really can flip post-PUT
(operator adds/removes a tag without rewriting), so they belong; size
filters do not.
Including size filters here was adding rules to predicateActions for
no purpose — every predicate-change sweep would waste cycles
re-evaluating size predicates that physically can't have changed.
* perf(s3/lifecycle): pre-sort AllActions at Compile time
Snapshot is immutable after Compile (engineState bit-flips don't
change membership), so the (bucket, rule_hash, action_kind) ordering
is stable for the snapshot's lifetime. Build the sorted slice once
and serve every AllActions() call from it; drop the per-call
sort.Slice. The bootstrap walker is the primary caller and may
iterate this on every task entry.
* docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity
Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan>
explicitly excludes 0-byte objects, but with the int64 zero value as
the unset sentinel we can't distinguish that from omitted-and-default.
Document the limitation inline so a future deployment that needs the
distinction can switch to *int64 (or a paired set-bool) and update
the matchers / RuleHash accordingly. Not fixing now: the explicit-zero
configuration is unusual, the canonical Rule shape mirrors the same
zero-as-unset convention as s3api.Filter, and a structural fix
touches every filter-using site (evaluator, due_at, match, RuleHash).
* fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex *int
The previous int field had a zero-value collision: 0 is both "newest
non-current version" (a valid index) and "uninitialised by ObjectInfo{}
literal." A caller who built &ObjectInfo{IsLatest: false} without
explicitly setting NoncurrentIndex would have it implicitly read as
"newest non-current," and the count-based NewerNoncurrent retention
would use that bogus 0 to decide eligibility.
Switch to *int so nil is explicitly "not a non-current version /
index not yet computed." The evaluator's NoncurrentDays and
NewerNoncurrent paths conservatively return ActionNone when the
index is nil — the safety scan will revisit once the index is
supplied. This removes a class of latent footguns in test setup and
in any future code path that constructs ObjectInfo without a
versioning-aware builder.
idx() helper added in tests to keep the call sites a one-liner.
* refactor(s3/lifecycle): trim narration from engine + helpers
Drop "what" comments where well-named identifiers already say it
(IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph
"why" docs to one-liners where the design rationale is already in
the design doc. Keep WHY comments only at non-obvious load-bearing
spots: the routing-index activation predicate, the *int rationale on
NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay
horizon rule.
Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go,
min_trigger_age.go, event_log_horizon.go, engine/engine.go,
engine/compile.go, engine/match.go, engine/mode.go.
No behavior change; tests untouched and pass.
* fix(s3/lifecycle): durable PriorState.Mode wins over decideMode
PriorState.Mode was declared but never read; Compile recomputed mode
via decideMode and stored that on every CompiledAction. Effect: an
action durably persisted as SCAN_ONLY (lag fallback or operator
pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the
next engine rebuild as soon as decideMode's XML+retention predicate
said so. Defeats the durability of mode state.
Use prior.Mode when set; fall through to decideMode only for new
actions (no prior at all) and for legacy entries persisted before
Mode existed (zero value). Regression test pins both branches.
* fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key
MarkActive's documented contract was "flip visible without a
recompile," but the routing indexes (originalDelayGroups,
predicateActions) were only populated when active && mode ==
EVENT_DRIVEN at compile time. So a key compiled with
BootstrapComplete=false would never enter the indexes; a later
MarkActive flipped engineState but MatchOriginalWrite /
MatchPredicateChange iterated the indexes and never saw the key.
Only MatchPath (which walks bi.actionKeys) and DateActions worked.
Index every EVENT_DRIVEN key regardless of `active`. The runtime
IsActive() filter inside filterMatching already gates dispatch, so
inactive entries are matched-but-not-fired; flipping MarkActive
makes them routable without recompile, matching the documented
contract.
Tests updated: TestCompile_BootstrapPendingIndexedButInactive
asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable
asserts a MarkActive flip routes the next match.
* test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior
Two regression tests for the *int pointer migration: nil index
combined with NewerNoncurrent (either paired with NoncurrentDays or
standalone) must short-circuit to ActionNone rather than guess at
the version's position in the keep-N window.
* refactor(s3/lifecycle): trim follow-up narration on engine + helpers
Comments accumulated since the last sweep — the durable-Mode rationale,
the MarkActive routability note, the routing-index doc, the
NoncurrentIndex pointer rationale, and the EvaluateAction docblock.
Trimmed each to one or two terse lines; the underlying contracts live
in the design doc.
* docs(s3/lifecycle): note CompileInput one-per-bucket invariant
|