seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-23 02:01:32 +00:00

Author	SHA1	Message	Date
Chris Lu	b4289abb0a	admin: convert filer address to gRPC form before dispatch (#9523 ) The master returns each registered filer in pb.ServerAddress dual-port form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's plugin context builder forwarded that string verbatim as filer_grpc_address, so workers calling grpc.DialContext on it failed every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port". Run each entry through pb.ServerAddress.ToGrpcAddress before populating ClusterContext.FilerGrpcAddresses. The lifecycle integration test now pins filer.port.grpc to a value that breaks the FILER_PORT+10000 assumption, and a new dispatch test drives the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and asserts the dispatched job both reaches the filer and deletes the backdated object.	2026-05-17 11:33:54 -07:00
Chris Lu	b1d59b04a8	fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477 ) * fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 00:19:05 -07:00
Chris Lu	46bb70d93e	feat(s3): stamp noncurrent_since on versioned demotions (#9431 ) * feat(s3): stamp noncurrent_since on versioned demotions A version's noncurrent TTL clock starts when the next version is written, not at its own mtime. Today the lifecycle engine derives that moment from the next-newer sibling's mtime — a heuristic that drifts if the sibling is later modified and is unavailable when the demoting event sits outside meta-log retention. Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at the two places where a PUT flips the latest pointer: updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning. Timestamp source is time.Now().UnixNano() captured once per demotion — the documented Phase 1 fallback until the filer write API surfaces its own TsNs. Engine reads the stamp on both the bootstrap walker path and the event-driven router; missing/zero falls back to the legacy sibling-mtime derivation, so pre-stamp entries keep working. Prerequisite for the daily-replay lifecycle worker (Phase 2+). * fix(s3): address CI failure and PR review feedback - Backdating tests must move both clocks: the lifecycle integration tests backdate version mtimes to simulate aging, but my earlier commit made the engine prefer the explicit demotion stamp over sibling mtime, so a real-now stamp dominated a backdated mtime and the rule never fired. Update backdateVersionedMtime to also rewrite Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it. This is a test simplification — production stamps record when the successor was written, not the demoted version's own mtime — but the resulting clock is correctly old enough. - Refactor stamp parsing into one shared helper. Per gemini-code-assist: the parsing logic for ExtNoncurrentSinceNsKey was duplicated in router/router.go and scheduler/bootstrap.go. Move it to a new weed/s3api/s3lifecycle/noncurrent_since.go as exported SuccessorFromEntryStamp; both call sites now go through it. - Make the parser ordering test deterministic. Per coderabbitai: time.Now().UnixNano() drops the monotonic clock component, so two back-to-back calls can decrease if the wall clock steps backward — the prior test was exercising OS clock behavior rather than the parser. Replace with fixed nanosecond values. - Close a suspended-versioning race. Per coderabbitai: the prior putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning after putToFiler returned, i.e. after the object write lock released. A concurrent PUT could promote a newer latest version, which we'd then wipe — leaving the older "null" object incorrectly current. Move the cleanup into the afterCreate callback so the null write and the .versions pointer clear (including the new demotion stamp) run atomically under the same lock. Best-effort logging is preserved. * fix(s3/lifecycle): clear noncurrent_since stamp on test backdate Backdating a version's mtime in tests is not a coherent claim about when it became noncurrent — production stamps record the successor's PUT time, which the test doesn't manipulate. The prior commit rewrote the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent that creates an inconsistent state: v3's stamp says "demoted 30 days ago" while v4's mtime (the supposed demoter) is real-now. With both NewerNoncurrentVersions and NoncurrentDays in the same rule, the NoncurrentDays floor passes against the backdated stamp and the rank-based check then deletes v3 via the meta-log historical replay that misranks against current state. Clearing the stamp instead lets the lifecycle engine fall back to the sibling-mtime derivation the tests were originally written against: the legacy code path is preserved end-to-end while the new explicit- stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go and the bootstrap-walker integration in scheduler/bootstrap_test.go. The deeper interaction — historical meta-log replay ranking against current state inside routePointerTransitionExpand — is pre-existing and is no longer masked by the freshly-PUT successor's mtime once the stamp is read. Tracked separately; not blocking this PR. * fix(s3): stamp noncurrent_since before the .versions/ pointer flip The pointer-flip on the .versions/ directory emits a meta-log event that the lifecycle router consumes via routePointerTransition. The router then calls LookupVersion on the demoted version's id. With the prior ordering — pointer flip first, stamp second — the router could read the demoted entry before markVersionNoncurrent landed and fall back to the legacy sibling-mtime derivation. Versioned COPY is the clean break: the new latest version keeps the source object's mtime instead of recording the moment v_old was demoted, so the fallback's successor clock can be arbitrarily wrong. Reorder both updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning so the stamp is written first; the pointer flip then emits an event into a state where the stamp is already present. Failure of the stamp write remains non-fatal — lifecycle still falls back to the legacy derivation in that case, with the same caveats as before the PR but no race window.	2026-05-11 13:41:33 -07:00
Chris Lu	c7b01c72b2	test(s3/lifecycle): integration coverage for versioning + filters (#9415 ) * test(s3/lifecycle): integration coverage for versioning + filters First integration-test bundle building on the existing single-test backdating harness. Each scenario follows the same shape: create bucket, set lifecycle, PUT object, backdate mtime via filer UpdateEntry, run the shell command for one shard sweep, assert S3-side state. Five new tests: - TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a versioned bucket must produce a delete marker (latest after worker runs is a marker) AND keep the original version directly addressable by versionId. ListObjectVersions confirms IsLatest=true on the marker. - TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent), backdate v1, run worker. v1 must be gone, v2 still current. - TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent + expired-delete-marker) cleans up a sole-survivor marker. PUT v1, DELETE (creates marker), backdate both, run worker. Every version AND marker must be gone for the key. - TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled must not produce dispatches even on a backdated match. Negative test for the engine's enabled-status gate. - TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches objects carrying the tag. Two backdated objects (one tagged, one not) — only the tagged one is removed. Helpers extracted to keep each test focused: putVersioningEnabled, putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle, backdateVersionedMtime (ages a specific .versions/v_<id> entry), runLifecycleShard (one-shot shell invocation with FATAL guard). * test(s3/lifecycle): tighten noncurrent expiration diagnostics Local run showed TestLifecycleNoncurrentVersionExpiration failing with a bare 404 on HEAD(latest), not enough to tell whether v2 was deleted, the bare-key pointer was removed, or a delete marker was synthesized. Strengthen the test to: - HEAD by versionId=v2 first, so we pin "v2 file still on disk" separately from "the latest pointer resolves to v2" - on HEAD(latest) failure, log ListObjectVersions output (versions + markers, with IsLatest) so the next failure shows which side the bug is on rather than just NotFound * test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload Exercises the lifecycleAbortMPU handler path that the prefix-based expiration tests can't reach — routing keys off of .uploads/<id>/ directory events, not regular object events, and the dispatcher uses a different RPC path (rm on the .uploads/<id>/ folder). Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1, CreateMultipartUpload, UploadPart (so the directory carries the right shape), backdate the .uploads/<uploadID>/ directory entry 30 days, run the worker. The upload must drop out of ListMultipartUploads. Helpers added: putAbortMPULifecycle, backdateUploadDir. * test(s3/lifecycle): integration coverage for NewerNoncurrentVersions NewerNoncurrentVersions=N keeps the N most recent noncurrent versions and expires the rest. Distinct from per-version NoncurrentDays — depends on per-version rank, not just per-version age — and routes through routePointerTransition's "needs full expansion" path. Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3 noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1 floor, run the worker. Expect v1+v2 expired (older noncurrent), v3 (newest noncurrent within keep=1) and v4 (current) preserved. Helper added: putNewerNoncurrentLifecycle. * test(s3/lifecycle): integration coverage for suspended-versioning Expiration Suspended versioning takes a distinct code path in lifecycleDispatch: the VersioningSuspended branch first deletes the null version (via deleteSpecificObjectVersion(versionId="null")) and then writes a fresh delete marker on top. Other branches (Enabled → only writes a marker; Off → straight rm) miss this two-step. Setup: enable versioning, PUT v1 (real versionId), suspend versioning, PUT again (creates the null version, demotes v1 to noncurrent), set the Expiration rule, backdate the null at the bare path. Expect: latest is now a fresh delete marker, the "null" version is gone from ListObjectVersions, and v1 (noncurrent under Enabled) still addressable directly — suspended Expiration must only touch the null, not other versions. Helper added: putVersioningSuspended. * test(s3/lifecycle): integration coverage for multi-bucket sweep A single shell-driven shard sweep must process every bucket carrying lifecycle config, not just the first one alphabetically. Pinned because the scheduler iterates the buckets directory and a regression that returns early after the first match would silently disable lifecycle for every later bucket. Two buckets, each with their own prefix-expiration rule and a backdated object. Both must be expired after the same sweep. * test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter ObjectSizeGreaterThan is a strict > gate (filterAllows uses ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the boundary: an object whose size equals the threshold must remain; only an object strictly larger expires. Catches a > vs >= flip. Two backdated objects on the same prefix, sizes 100 and 150 with threshold=100 — boundary survives, larger expires. * test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup Tests share one weed mini server. Two pollution modes were producing order-dependent failures: - A later test's shard sweep would still load the prior test's lifecycle config (the worker reads every bucket's XML from filer state, and DeleteBucket alone doesn't drop lifecycle config cleanly on this codebase). - Versioned-bucket tests left versions + delete markers behind that ListObjectsV2 can't see, so the existing best-effort empty-then- delete didn't actually empty those buckets. - The AbortMPU test intentionally leaves an in-flight upload; without an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty. Cleanup now runs DeleteBucketLifecycle, ListObjectVersions → DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload, then DeleteBucket. Best-effort throughout so a half-torn-down bucket doesn't fail the cleanup chain. * test(s3/lifecycle): backdate both versions for NoncurrentDays clock Per codex review: NoncurrentDays is clocked from the SUCCESSOR version's mtime (when the displaced version became noncurrent), not from the displaced version's own mtime. Backdating only v1 left the clock (v2's mtime) at "now" and the rule never fired — the test was wrong, not the production path. Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold relative to v2, the noncurrent rule fires, and v2 stays current. * test(s3/lifecycle): assert specific NotFound on multi-bucket deletion Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any HeadObject error as "deleted", which lets a transport failure or dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404 specifically via a small isS3NotFound helper so the assertion actually proves deletion happened, not just that the call broke. * test(s3/lifecycle): gofmt size-filter test * test(s3/lifecycle): integration coverage for Object Lock skip Object Lock retention must override the lifecycle rule. The handler's enforceObjectLockProtections check (s3api_internal_lifecycle.go:47) returns an error when retention is active; the dispatcher then classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays. No existing integration test reaches that outcome. Setup: bucket created with ObjectLockEnabledForBucket=true, expiration rule on prefix "lock/", two backdated objects under the same prefix — one with GOVERNANCE retention until 1h from now, one without. After the worker runs, the unlocked object expires (positive control); the locked one survives. Custom cleanup uses BypassGovernanceRetention so the test can drop the locked version when the test finishes — otherwise the retention window keeps the bucket from being deleted. * test(s3/lifecycle): integration coverage for config update between sweeps An operator changes the lifecycle rule between two shell-driven sweeps. The second sweep must respect the NEW rule, not a cached copy of the old one. Each runLifecycleShard invocation spawns a fresh weed shell subprocess, so cached engine state from a previous sweep doesn't persist — but a regression that caches rules across PutBucketLifecycleConfiguration calls within the S3 server itself would still surface here. Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker → firstKey expires. Update rule to prefix="second/", PUT + backdate secondKey AND a new key under the OLD prefix ("first/post-update.txt"). Sweep 2 must expire only the second-prefix object; the post-update old- prefix one must survive — config replacement, not merge. * test(s3/lifecycle): integration coverage for ExpirationDate (past) Rules with Expiration{Date: <past>} route through ScanAtDate in the engine (decideMode's ActionKindExpirationDate case) — a separate compile + dispatch branch from the EventDriven delay-group path the Days-based tests exercise. Past date + in-prefix object → must expire. Out-of-prefix object → must remain. Object also backdated as defense-in-depth so the assertion doesn't depend on whether the dispatcher consults MinTriggerAge for date kinds. * test(s3/lifecycle): integration coverage for bootstrap walk on existing objects Production scenario: operator enables lifecycle on a bucket that already holds objects from before the policy. The worker must discover them via the bootstrap walk (BucketBootstrapper) — there were no meta-log events to observe because the objects predate the rule. Without the bootstrap path, only NEW writes would ever match. Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix survivor, backdate all, THEN set the Expiration rule, run the worker. Every in-prefix pre-existing object must be expired; the out-of-prefix one must remain. * test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching Operator UX: after DeleteBucketLifecycle, the worker must observe the removal on the next sweep and stop expiring objects under the now-gone rule. A regression that caches old configs across PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep silently dropping objects. Setup: positive control (rule active, backdated obj expires) → DeleteBucketLifecycle → PUT + backdate a fresh object → second sweep. The fresh object must remain. * test(s3/lifecycle): integration coverage for empty bucket sweep no-op A bucket carrying lifecycle config but no objects must produce a successful sweep — no hangs, no errors, no dispatches. Pinned because the bootstrap walker iterates bucket directories, and an empty directory is a corner of that traversal that's easy to break (slice-bounds bug on the first listing returning zero entries). Asserts: worker logs "loaded lifecycle for" and "shards 0-15 complete", no FATAL output, bucket still exists after the sweep. * test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate ObjectLock: enabling Object Lock on a bucket implicitly enables versioning, so PUT objects land at .versions/v_<id>, not at the bare key. The test was calling backdateMtime (bare path) and failing in the helper with "filer: no entry is found". Switch to backdateVersionedMtime with the versionId returned by PutObject. ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard shell command yet — the bootstrap walker explicitly skips actions in ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date- triggered bootstrap" but no such bootstrap exists in the scheduler or shell). Skip with a t.Skip + explanation so the test activates the moment the date-triggered path lands. * fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker The walker explicitly skipped ModeScanAtDate actions on the comment "SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such bootstrap exists in the scheduler or shell layer. The result: rules with Expiration{Date: ...} compiled correctly, populated the snapshot's dateActions map, and were never dispatched. ExpirationDate is silently a no-op in production. EvaluateAction already handles ActionKindExpirationDate correctly (rejects when now.Before(rule.ExpirationDate), otherwise emits ActionDeleteObject). The walker just needed to fall through instead of skipping. Pre-date walks become no-ops via EvaluateAction's date check; post-date walks expire eligible objects. Un-skip TestLifecycleExpirationDateInThePast — it now exercises the fixed path end-to-end. * test(s3/lifecycle): integration coverage for multiple rules per bucket A single bucket carries two independent Expiration rules with disjoint prefix filters and different Days thresholds. Each rule must fire only on its prefix; objects outside both prefixes must survive. Pinned because Compile builds one CompiledAction per rule per kind all sharing the same bucket index — a bug that lets one rule's prefix or threshold leak into another (e.g. last-write-wins on a shared map) would silently expire wrong objects. Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/ Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin, data/keep.bin. After the worker runs, logs/ + tmp/ are gone; data/ — outside both rule prefixes — survives. * fix(s3/lifecycle): mark ScanAtDate actions active in Compile Two layers were silently filtering ScanAtDate actions out of routing: the walker's mode skip (fixed in `e785f59d6`) and Compile only marking ModeEventDriven actions active. MatchPath / MatchOriginalWrite both require IsActive() to emit a key, so a ScanAtDate action that's never marked active never reaches a dispatch path even after the walker falls through. ScanAtDate's only dispatch path is the bootstrap walk's MatchPath call — there's no bootstrap-completion rendezvous to wait on. Make the active flag include ModeScanAtDate alongside the EventDriven+BootstrapComplete combination. ExpirationDate-based rules now actually fire end-to-end. The TestLifecycleExpirationDateInThePast integration test exercises this. * fix(s3/lifecycle): route date kinds via ComputeDueAt ExpirationDate has MinTriggerAge=0, so router computed dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's now.Before(rule.ExpirationDate) check returned ActionNone and the date rule never fired through the event-driven path. ComputeDueAt already knows the per-kind shape — rule.ExpirationDate for date kinds, ModTime+Days for the rest — so use it as the single source of truth for dueTime in Route's main loop. * test(s3/lifecycle): pin bootstrap walker date dispatch The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6 behavior that the regular walker skipped ExpirationDate. That walker was rewired to fire date rules whose date has passed (the SCAN_AT_DATE bootstrap was never wired); update the test to match. Split into two: post-date entries dispatch, pre-date entries don't. * test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup constructs a combined noncurrent + expired-marker rule inline, which the helper doesn't cover. The blank-assignment workaround was just hiding dead code; remove both. * test(s3/lifecycle): tighten HeadObject termination check to typed not-found Generic err != nil also passes on transport/auth/timeouts, letting the test go green without proving the lifecycle action actually fired. Switch the three Eventuallyf HeadObject predicates to isS3NotFound, matching the pattern already in the multi-bucket and expiration-date tests. * test(s3/lifecycle): guard ListObjectVersions diagnostic against nil When ListObjectVersions errors, listOut is nil and the diagnostic log path panics on listOut.Versions before the real assertion fires. Branch on (listErr != nil \|\| listOut == nil) so the failure log is robust whatever ListObjectVersions returned.	2026-05-10 09:30:50 -07:00
Chris Lu	85abf3ca88	feat(shell): s3.lifecycle.run-shard + integration test (#9361 ) * feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var	2026-05-08 09:59:10 -07:00

5 Commits