seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-31 14:06:20 +00:00

Author	SHA1	Message	Date
Chris Lu	2a4923e7e8	ObjectTransaction: filer-side forwarding via route_key (#9659 ) A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.	2026-05-24 14:21:06 -07:00
Chris Lu	1f0c366583	s3: route metadata-only self-copy off the distributed lock (#9638 ) A non-versioned metadata-only self-copy (CopyObject with source == destination and the REPLACE directive) is a read-modify-write of one entry, which is why it held the distributed lock. It now routes to the owner as a serialized PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements, delete the dropped keys) onto a fresh read of the entry under its per-path lock, so a concurrent change to non-managed keys (legal hold, retention, version id) is preserved instead of clobbered, and bumps mtime. PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended self-copies create a new version (already routed via the copy finalize) and the no-owner bootstrap keep the lock.	2026-05-24 12:32:57 -07:00
Chris Lu	fa7056dc6f	s3: route object-lock version-specific deletes off the distributed lock (#9657 ) A version-specific DELETE (real version or the null version, including object-lock WORM-checked ones and governance-bypass) now runs as one routed transaction on the object's owner instead of holding the distributed lock. For a real version: recompute the .versions pointer excluding the version (repoint-before-delete, so a crash leaves a recoverable orphan rather than a dangling pointer), then delete the version file, under the object's per-path lock. The null version is the regular object entry, deleted directly (no pointer). Object-lock buckets gate the delete on the version's WORM guards evaluated on the owner: legal hold (always) + retention (while not elapsed). Governance bypass scopes the retention guard to COMPLIANCE mode, so the filer allows a governance-mode delete while still denying compliance and legal hold — the gateway never reads the version. Three primitives make this expressible: - ObjectTransaction.condition_key: evaluate the condition against a named entry (the version) while the lock stays on lock_key (the object). - Recompute.exclude_name: omit a child from the scan, to repoint before delete. - WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a mode, expressing governance bypass without a gateway-side read.	2026-05-24 11:41:08 -07:00
Chris Lu	eeda7181aa	s3: route multipart-upload completion off the distributed lock (#9632 ) completeMultipartUpload routes its writes to the object's owner filer when an owner is known, off the distributed lock. Idempotent replay is handled gateway-side in prepareMultipartCompletionState (it returns the existing result when the object already carries this UploadId), so the lock is not needed to dedupe retries; with no owner yet, the lock remains as the bootstrap path. Versioned completion flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via routedMkFile (a routed PUT) so the write serializes with concurrent writes to the same key on the owner's per-path lock. The version file itself is a unique path and stays a plain mkFile.	2026-05-24 11:07:39 -07:00
Chris Lu	4b9d46b5ad	s3: route versioned COPY and delete-marker off the DLM (#9633 ) s3: route versioned/suspended delete markers and versioned COPY off the lock createDeleteMarker flips the .versions pointer via routedVersionedFinalize (RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or Suspended DeleteObject takes its pointer flip off the distributed lock; the delete marker file is written first and the owner re-derives the pointer. DeleteObjectHandler routes a versioned/suspended delete with no specific version straight to the owner, off the lock. A specific-version delete and object-lock buckets keep the lock (the former needs a recompute-after-delete handled separately; the latter needs gateway-side enforcement). CopyObject into a versioned bucket finalizes the new version through the same routed pointer flip.	2026-05-24 07:22:27 -07:00
Chris Lu	5bac8b9281	s3: route object-lock object writes off the distributed lock (#9635 ) routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT (which creates a new version, never overwriting a locked one) and a non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the owner filer like any other write. routedObjectOwner still excludes object-lock: an unversioned object-lock delete enforces WORM under the lock, so it stays there rather than routing past the check. Version-specific deletes likewise stay on the lock — routing them needs the WORM check (on the version entry) and the latest-pointer recompute (on the object) under one transaction, which the current single condition target cannot express.	2026-05-24 07:20:44 -07:00
Chris Lu	db954b5503	s3: route versioned PutObject finalize off the DLM (#9631 ) s3: route versioned PutObject finalize off the distributed lock A versioned write's finalize (flip the .versions pointer to the newest version, demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction on the object's owner filer, under its per-path lock, instead of the unserialized updateLatestVersionInDirectory. The version file is written first; the owner re-derives the pointer by scanning the directory. RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's size and mtime on the pointer, and demote_key / demote_value to stamp the displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves. Falls back to updateLatestVersionInDirectory when no owner is known yet.	2026-05-24 03:10:30 -07:00
Chris Lu	32aa70ab59	s3: serialize bucket config writes with field-level filer patches (#9655 ) PutBucketVersioning and PutBucketEncryption ran concurrently each did a whole-entry read-modify-write of the bucket entry, so one could overwrite the other's field with a stale copy. Each config write is now a field-level PATCH_EXTENDED (extended attributes) or set_content (the metadata blob) ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh read under its per-path lock. Disjoint fields no longer clobber each other.	2026-05-24 02:30:26 -07:00
Chris Lu	f9bc6adf98	s3: route single-entry object writes to the owner filer, off the DLM (#9629 ) s3: route non-versioned object PUT and DELETE off the distributed lock A non-versioned, non-object-lock object write now goes straight to the key's owner filer as a single-mutation ObjectTransaction, which serializes it with the owner's per-path lock and evaluates the precondition, instead of taking a cluster-wide lock. PUT and DELETE use the object's full path as the lock key, so a concurrent create and delete of the same key serialize against each other. The fast path is taken only when the precondition reduces to clauses the filer can evaluate (existence and a single strong-ETag match); time-based conditions, ETag lists, weak ETags, post-create hooks, and an unknown owner fall back to the lock. A routed mutation error other than a failed precondition also falls back, so the lock path stays the authority for the cases it alone covers. PrimaryForKey returns "" until the ring view arrives, keeping writes on the lock until routing is known.	2026-05-24 02:10:32 -07:00
Chris Lu	f037fc4dce	s3: dial the object lock's primary filer directly (#9626 ) * s3: dial the object lock's primary filer directly The S3 object write lock builds a fresh short-lived lock per write, each starting at the seed filer. When the seed isn't the key's hash-ring primary the filer forwards the request to the primary, and in multi-cluster setups that forward crosses clusters on every write. Give the lock client a view of the filer lock ring, fed by the master's LockRingUpdate broadcasts the gateway already receives, so it dials the primary directly. The view tracks filer membership by version; a stale view stays correct because the filer still forwards as a fallback. Also send the initial ring snapshot to S3 clients, not just filers. * s3: subscribe to lock-ring updates before starting the master loop The master delivers the initial LockRingUpdate once, on connect. Registering the callback after KeepConnectedToMaster started left a window where that first update could arrive before the handler was set and be dropped, delaying the ring view until the next membership change. Build the lock client and register the callback in the masters block before launching the loop; the filers block reuses that client (or creates a plain one when no masters are configured). * lock_manager: build the hash ring in a deterministic server order rebuildRing ranged over the server set (a map), whose iteration order is randomized per process. On a vnode hash collision the last writer into vnodeToServer wins, so two nodes holding the same server set could resolve the collision to different servers and disagree on the primary for keys near that slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement would route the same key to different filers and defeat per-path serialization. Iterate the servers in sorted order so the ring is identical on every node with the same set, regardless of discovery order. * lock_manager: skip redundant ring rebuilds, trim comments SetRing now ignores a non-zero version at or below the current one once a ring exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the ring. * s3: hold the lock-ring client on the server for route-by-key Store the object-write lock client on S3ApiServer so handlers can resolve a key's owner filer via PrimaryForKey.	2026-05-24 00:40:43 -07:00
Aleksey	917a87928c	fix(s3api/list): cancel ListEntries stream in hasChildren (#9617 ) * fix(s3api/list): cancel ListEntries stream in hasChildren * fix(s3api): use filer_pb.List in hasChildren filer_pb.List already wraps the ListEntries stream in a cancellable context, so the single-entry probe needs no separate helper or manual context plumbing to avoid the leaked gRPC stream goroutine. * fix(s3api): propagate request context into hasChildren Thread r.Context() through listFilerEntries and hasChildren so the implicit-directory probe cancels when the client disconnects, instead of running on context.Background(). --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-21 15:48:47 -07:00
Chris Lu	fbdcec1cba	fix(s3): list empty directories as directory markers (#9615 ) * fix(s3): list empty directories as directory markers A real but empty directory created out of band (mount, mkdir, filer API) carries no MIME, so it was hidden from S3 listings. hadoop-aws getFileStatus probes LIST prefix=dir/ &delimiter=/ and reads an empty result as a missing path, which breaks Spark's eventLog.dir when it points at an empty directory. Surface such directories as directory markers, matching directories created via PutObject with a trailing "/". Emptiness comes from the recursion result, and the marker MIME is set only on the in-memory listing entry, so empty directories stay eligible for empty-folder cleanup. * fix(s3): only surface empty directory markers for explicit dir probes Restrict the empty-directory marker to a trailing-slash prefix probe (prefix=dir/), the pattern hadoop-aws getFileStatus uses. Plain listings are left as before, so an empty directory left behind by deleted objects (e.g. after lifecycle expiration) is no longer shown as a phantom key.	2026-05-21 14:05:16 -07:00
Chris Lu	d82b3a8d6a	refactor(s3): drop unused source path in copy ETag check ETagEntry derives the tag from chunks/Md5/remote-etag, never the entry path, so the conditional-copy check no longer builds a bogus FullPath.	2026-05-21 09:51:50 -07:00
Chris Lu	83b7ea5e7b	fix(s3): keep server-side copy data in the bucket collection (#9607 ) * fix(s3): keep server-side copy data in the bucket collection UploadPartCopy and SSE-C CopyObject assigned destination volumes against r.URL.Path, the S3 request URI. The filer derives a bucket's collection only when the assign path sits under its buckets folder, so an S3 URI routed copied bytes to the default collection instead of the destination bucket's. Assign against the destination's real filer path. * refactor(s3): centralize copy-part path and thread dstPath into SSE-C copy Extract copyPartLocation so the fast path and writeEmptyCopyPart share one definition of the .uploads/<id>/<n>_copy.part location. Pass the destination filer path into copyChunksWithSSEC instead of re-deriving it from the request, and thread it through key rotation so re-encrypt copies also assign in the destination bucket's collection.	2026-05-21 09:35:42 -07:00
Mmx233	9b9fdb5b76	fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577 ) * fix(s3): sync IAM policies to advanced IAM Manager policy engine * test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync * fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload Sync the policies already loaded from the credential store into the IAM Manager's engine from SetIAMIntegration itself, instead of re-running a full LoadS3ApiConfigurationFromCredentialManager after setup. This covers both startup orderings without a second filer round-trip or racing the async loader goroutine: if the load won, the policies are in memory to push; if SetIAMIntegration won, the load's own sync runs afterward. Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock so the per-request auth RLock path isn't blocked by the policy recompile. * fix(s3): serialize IAM manager policy resync to avoid stale snapshots SyncRuntimePolicies replaces the manager's full policy set, so applying a policy view captured before a later mutation can resurrect a deleted policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy, SetIAMIntegration, and the credential-manager load) through a single resyncIAMManagerPolicies that serializes on a dedicated mutex and reads iam.policies fresh at apply time, so the live map always wins regardless of interleaving. The load now installs the config into iam.policies before resyncing, closing the window where the manager held policies the map didn't yet have. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-05-21 00:39:42 -07:00
Chris Lu	c00aa90990	fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581 ) Authentication records the identity with r.WithContext, which returns a request copy. Handlers that log their own audit entry (PUT, DELETE, tagging) see it, but GET/HEAD object and IAM operations rely on track()'s fallback entry, which is built from the original request the auth copy never reached - so requester came out empty. Install a mutable identity holder on the request before authentication and have SetIdentityNameInContext record into it. The holder is shared by pointer across every request copy, so the fallback entry recovers the authenticated requester. The per-request context value still takes precedence, so nothing changes for handlers that see the auth copy.	2026-05-20 10:13:33 -07:00
Chris Lu	cc5ef1b741	feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572 ) * feat(s3): add TagUser, UntagUser, ListUserTags IAM actions Adds AWS IAM-compatible user tag operations on the embedded IAM endpoint. Tags persist in the Identity proto as a repeated UserTag field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits are enforced. Pagination is stubbed (IsTruncated=false) since the 50-tag cap means all tags fit in a single response. * review: validate UntagUser TagKeys entries parseTagKeysParams now rejects empty keys and keys past MaxUserTagKeyLength; UntagUser additionally requires at least one TagKeys.member.N entry to match AWS validation behavior. * review: pre-allocate user-tag merge and filter slices mergeUserTags now allocates the combined existing+incoming capacity up front; UntagUser builds the filtered slice via make with the full ident.Tags capacity instead of ident.Tags[:0:0], which forced a reallocation on every append. * review: cover duplicate-in-request and invalid TagKeys cases Regression tests assert TagUser rejects two members with the same key in one request, and UntagUser rejects missing/empty/oversized TagKeys entries.	2026-05-19 17:35:44 -07:00
Chris Lu	37b6a14b0d	feat(s3): add four bucket configuration handlers (#9570 ) * feat(s3): add four bucket configuration handlers - GetBucketPolicyStatus: computes IsPublic from the existing bucket policy - PutBucketRequestPayment: companion writer to the existing GET; accepts only BucketOwner - GetBucketAccelerateConfiguration: returns <Status>Suspended</Status> - GetBucketLogging: returns an empty BucketLoggingStatus Lets AWS SDK probes succeed instead of returning MethodNotAllowed. * review: route GetBucketPolicyStatus through checkBucket Mirrors the existence/auth gating used by other bucket handlers and drops the bespoke filer_pb lookup so NoSuchBucket precedence is consistent across the API surface. * review: cap PutBucketRequestPayment body with MaxBytesReader The body is unmarshalled as RequestPaymentConfiguration, which is a handful of bytes; reject excessively large payloads up front and defer Close immediately after wrapping. * review: gate static getters on checkBucket GetBucketAccelerateConfiguration and GetBucketLogging now run the standard bucket existence check before returning the static Suspended / empty-status response so a missing bucket cannot appear to have valid configuration. * review: share cache helper across misc tests; check io.ReadAll error Accelerate and Logging tests now run through newMiscTestServer like the others so the checkBucket guard sees a cached bucket; the ReadAll error is explicitly checked.	2026-05-19 17:35:08 -07:00
Chris Lu	cee2bf697c	feat(s3): stub bucket configuration list endpoints (#9571 ) * feat(s3): stub bucket configuration list endpoints Adds Get and List handlers for Analytics, Inventory, IntelligentTiering, and Metrics bucket configurations. List returns an empty result with IsTruncated=false; single-get returns NoSuchConfiguration so SDK error parsing remains predictable. * review: gate stubs on bucket existence All eight stub handlers now call checkBucket via stubBucketGuard so NoSuchBucket takes precedence over NoSuchConfiguration / empty-list responses, matching AWS S3 precedence. Tests provide a cached bucket so the guard sees it as present.	2026-05-19 17:34:51 -07:00
Chris Lu	285025eb73	s3api: support group inline policies + Condition enforcement (#9569 ) * test(s3api): cover IAM inline policy aws:SourceIp + group inline gap Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload → VerifyActionPermission with a synthetic 127.0.0.1 request and assert that the policy's IpAddress condition flips the outcome. The user-policy cases pass on master (hydrateRuntimePolicies already routes inline docs through the policy engine, so Condition blocks are honored end- to-end). The group-policy case fails: PutGroupPolicy still returns NotImplemented, so a group inline doc never lands in the engine. Integration counterparts live under test/s3/iam/ and exercise the same paths against a live SeaweedFS S3+IAM endpoint. * s3api: support group inline policies + Condition enforcement PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to return NotImplemented in embedded IAM mode, so anything attached to a group as an inline doc — including aws:SourceIp or any other Condition — was simply unreachable. Wire the four endpoints to the credential-store methods that were already in place (memory, postgres, filer_etc all implement GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies now also walks LoadGroupInlinePolicies, registers each doc in the IAM policy engine under __inline_group_policy__/<group>/<policy>, and appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are added to the ReloadConfiguration trigger list in DoActions. Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too. Without it iam.groups never repopulated on a memory-store reload, so group policy evaluation silently no-op'd whether the policy was inline or attached. The existing tests didn't notice because no test reloaded through cm after creating a group. The NotImplemented unit test is inverted to drive the new round-trip. * s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy DoActions already triggers ReloadConfiguration for both actions via the explicit reload list, so calling refreshIAMConfiguration inline runs the load twice per request. Per PR review. * s3api: scope group-policy resource names per test; tighten deny polling - Integration test resource names get a per-test suffix so retried or parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists. - Deny-path Eventually loops gate on AccessDenied via a typed helper rather than any non-nil error; transient setup errors no longer end the wait prematurely. - ListGroupPolicies returns ServiceFailure when the credential manager is nil, matching Put/Get/DeleteGroupPolicy. * test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs CI runners with happy-eyeballs resolve `localhost` to ::1 first, in which case a 127.0.0.0/8-only allow would silently never match and the deny-driven enforcement test would hang for the allow case. Add ::1/128 to every loopback-matching policy so the allow path works regardless of which loopback family the SDK lands on.	2026-05-19 16:03:45 -07:00
Chris Lu	f72983c1fd	fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566 ) * fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" The S3 Tables REST endpoints share top-level paths with the regular S3 API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for GetTable). They are registered first on the same router as the bucket subrouter, so a path-style request such as GET /buckets?list-type=2 on a bucket actually named "buckets" matched ListTableBuckets and returned JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with "Unexpected character '{' (code 123) in prolog". Disambiguate by requiring the AWS V4 credential scope to name the s3tables service on the colliding routes. Regular S3 SDKs sign with service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is present in both the Authorization header and the X-Amz-Credential query parameter for presigned URLs, so the matcher works for both flavors. ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.) already cannot collide because colons are not valid in bucket names, so they are left untouched. * fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal The Iceberg catalog integration tests send unsigned PUT /buckets with Content-Type: application/x-amz-json-1.1 to create table buckets. With only the credential-scope check, those requests fell through to the regular S3 CreateBucket handler and the suite went red on this branch. Extend the matcher so a request is recognized as S3 Tables when either: - its AWS V4 credential scope names SERVICE=s3tables; or - it carries the canonical AWS JSON RPC 1.1 content type and is unsigned (a request explicitly signed for SERVICE=s3 still wins). The regular S3 SDKs do not send application/x-amz-json-1.1, so the signal is safe for the colliding paths (/buckets, /get-table). Also add an AWS SDK V2 for Go integration test under test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer against a bucket literally named "buckets" and "get-table" — the SDK errors before the test asserts if the server returns the wrong body shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml, mirroring the etag/acl workflow. * s3api: extend service matcher to all S3 Tables routes; simplify scope check - Apply serviceMatcher to every S3 Tables route, not just the bare-path ones. ARN-bearing paths could otherwise be hit by an S3 object key that starts with arn:aws:s3tables:..., inside a bucket named "buckets", "namespaces", "tables", or "tag". One matcher everywhere closes both collision classes. - Replace strings.Split + index lookup with strings.Contains for the credential-scope check. The scope shape is fixed at AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components, and access keys are alphanumeric — so /s3tables/ matches iff SERVICE is exactly s3tables. Existing unit cases (including the access-key-substring case) still pass. - Read the GetObject body in the SDK v2 routing test with io.ReadAll; the single Read could return short and make the equality check flaky. * s3api: drop content-type fallback; sign s3 tables harness traffic instead The content-type fallback in isS3TablesSignedRequest let an anonymous regular-S3 request whose body type is application/x-amz-json-1.1 hit an S3 Tables route when the path-style object key happened to be shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets" with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher back to the AWS V4 credential scope so only requests signed for SERVICE=s3tables match the S3 Tables routes. Update the Iceberg catalog test harness — the only caller still sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The mini instance runs in default-allow mode, so the signature itself is not verified; only the credential scope matters for the route match. Drop the stale unit cases for the JSON-RPC content-type signal and the routing test that exercised unsigned harness traffic.	2026-05-19 14:24:25 -07:00
Chris Lu	d57de6dc20	fix(s3): keep anonymous access working with EnableIam default (fixes #9557 ) (#9567 ) fix(s3): keep anonymous access working with EnableIam default `docker run seaweedfs` (and `weed mini` with no config) start with EnableIam=true but no IAM config file and no identities. The advanced-IAM init path was failing in 4.25 because of the missing STS signing key, which masked a latent bug: SetIAMIntegration unconditionally flipped isAuthEnabled to true, and isEnabled() also treated a non-nil iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the STS fallback started succeeding, the integration got installed end to end, and every anonymous S3 request bounced as AccessDenied. Separate the two concerns: SetIAMIntegration just plumbs in the OIDC / embedded-IAM machinery, and a new EnableAuthEnforcement opts in to enforcement. The startup path calls it only when -s3.iam.config is actually provided, so operators with explicit IAM configs still get auth (preserves #7726). isEnabled() now reads isAuthEnabled only.	2026-05-19 13:03:30 -07:00
Chris Lu	c61d227613	s3api: verify source permission on CopyObject and UploadPartCopy (#9555 ) * s3api: verify source permission on CopyObject and UploadPartCopy The Auth middleware only authorized the destination because routes key on the request URL. The source from X-Amz-Copy-Source was never evaluated, so an STS session token scoped to one prefix could copy from any other prefix in the same bucket. Add AuthorizeCopySource on IdentityAccessManagement to run the full bucket-policy + IAM/identity flow against the source, using a synthetic GetObject request so action resolution lands on s3:GetObject (or s3:GetObjectVersion when a source versionId is supplied). Both CopyObjectHandler and CopyObjectPartHandler now invoke it before reading the source. * s3api: preserve presigned-URL session token on copy-source check Presigned CopyObject / UploadPartCopy requests carry the STS session token in the query string (X-Amz-Security-Token), not in a header. Rebuilding the synthetic source URL from scratch dropped that token, so the source authorization would fall through to non-STS paths and miss session policy enforcement. Forward X-Amz-Security-Token from the original query (alongside versionId), still excluding unrelated params like uploadId/partNumber that would steer ResolveS3Action away from s3:GetObject.	2026-05-18 21:35:53 -07:00
Chris Lu	58c3fa802c	fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540 ) When s3.domainName is set, all bucket-prefix routes were gated on a matching Host header. Requests that arrive via an IP, an unlisted hostname, or a reverse proxy that rewrites Host hit no router and bounce back as 405/404 (and 503 once a proxy maps the upstream error). Register the path-style catch-all unconditionally, after the host-specific routers, so it only fires when no Host matcher applies.	2026-05-18 19:44:19 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Konstantin Lebedev	7d1b16fbcd	fix: ListBucketsHandler for pathStyleDomains (#9510 )	2026-05-15 13:12:55 -07:00
Chris Lu	e9bcb8f4ad	docs(s3/lifecycle): refresh DESIGN.md as-built (#9491 ) * docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages DESIGN.md was written as a phased implementation plan ("Phase 2 will ship X, Phase 4 will ship Y"). All phases are now merged, plus the post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially changed the worker model (single subscription, walker throttle, observability gauges). Rewrite the doc in present tense describing what's actually there. Net changes vs the prior plan-style doc: - Algorithm pseudo-code reflects the single-subscription fan-out plus walkedThisPass within-pass guard. - Walker invocation table replaces the implicit "two distinct calls" prose with three call sites (recovery / steady-state / empty-replay) and their throttle gates. - New section on the subscription model (one Reader, ShardPredicate, fan-out by ev.ShardID). - New section on cursor.LastWalkedNs and the WalkerInterval throttle. - Observability section: gauges, heartbeat tokens, what each means. - "Implementation history" table maps phases to merged PRs. - "Future work" lists the four optimizations we deferred (long-lived subscription, bucket-coordinated walker, per-bucket lag metric, filer meta-log retention). Drop the "Phase N — ..." narrative from the bottom; the PR history table is the durable artifact now. Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for the operator-facing docs. README explains the sync workflow with the external seaweedfs.wiki.git repo. Five pages: - Home.md — landing page, supported rule shapes, what the worker does - Operator-Guide.md — config knobs, when to change each, walker interval recommendations by cluster size - Monitoring.md — Prometheus metric reference + heartbeat token table + suggested PromQL alerts - Troubleshooting.md — stuck cursor, walker stuck, failure outcomes, cursor schema for manual inspection - Architecture.md — high-level overview for newcomers; sits between Home.md (operator) and DESIGN.md (developer) * docs(s3/lifecycle): address PR review feedback on docs Coderabbit + gemini findings on #9491: - Monitoring.md: clarify the "matches all dispatched" phrasing; note that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value (shouldn't appear in healthy systems); filter PromQL alerts to ignore zero-valued gauges so fresh-install heartbeats don't trip. - Operator-Guide.md, Troubleshooting.md: clarify weed shell -master format as host:http_port.grpc_port (SeaweedFS ServerAddress). - Troubleshooting.md: pause the s3_lifecycle job in the admin UI before manually editing a cursor file, otherwise the worker's save races with the operator's edit. - Architecture.md, Home.md, Operator-Guide.md, Monitoring.md, Troubleshooting.md, DESIGN.md: add language tags (`text`) to fenced code blocks for markdownlint MD040 compliance. - DESIGN.md: standardize on the S3 spec rule names (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`, `AbortIncompleteMultipartUpload`) and add a one-line note mapping them to the engine's `ActionKind` constants. - README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the sync workflow so the `cp` commands' repo-root-relative paths work whether the operator's shell is at the repo root or at docs/wiki/s3-lifecycle/. - Home.md: was lagging the wiki-repo merged version (had the older pre-merge content). Re-sync from the wiki repo so source matches. docs(s3/lifecycle): remove wiki pages from PR The wiki pages belong in seaweedfs.wiki.git, not the main repo. The source-of-truth concern that motivated adding them here is real but the cost — every code-review touchpoint requires reviewers to load operator-facing pages too — outweighs it. The wiki pages are already pushed locally (~/dev/seaweedfs.wiki); they'll publish on the operator-side workflow. This PR remains scoped to DESIGN.md (the developer-facing reference that does belong with the code). * docs(s3/lifecycle): drop Implementation history section git log is the durable record of what shipped when; the prose table duplicates it and goes stale faster than commit metadata. * docs(s3/lifecycle): soften 'exactly once per run' in Goal The prior phrasing overstated the guarantee versus the failure model documented later in the same file. Reword to: 'process due objects each pass; retryable/blocked outcomes get retried from the cursor on later runs.' Surfaces the head-of-line-blocking semantics up front so the rest of the doc reads consistently. Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those pages live in the wiki repo, not the main repo.	2026-05-13 17:06:14 -07:00
Chris Lu	d5e54f217d	feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486 ) Operator visibility was the last item on the daily-replay must-have list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing ever set it — leftover from the streaming worker that got deleted. Wire it up and add a parallel one for the walker so a single PromQL query answers "is this thing working?": - `cursor_min_ts_ns{shard}` set after each cursor save. Operators read `now - cursor_min_ts_ns` as the per-shard replay lag. - `daily_run_last_walked_ns{shard}` new — set in parallel so operators can confirm WalkerInterval is actually being honored. A stuck value means the scheduler isn't invoking the worker, the throttle is too long, or the walker is failing. - saveCursorAndPublish wraps every Save call site in runShard so the gauges and the persisted state stay aligned (gauges only advance on successful saves). - Enhance the `daily_run: status=... duration=...` heartbeat with `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep. Existing tokens stay positional-stable; new ones append at the end. Marker `cold` distinguishes "not started" from "0s caught up." Tests pin the summary line: cold-start state, max-across-shards selection, and partial-fill (some shards drained, others walked). Stacked on #9485.	2026-05-13 14:18:35 -07:00
Chris Lu	c6582228b8	feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484 ) * feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end.	2026-05-13 14:09:13 -07:00
Chris Lu	79859fc21d	feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468 ) * feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd Three diagnostic additions on top of #9460, all aimed at making the next production incident faster to triage than the one we just spent hours on. 1. [versioning-heal] grep prefix on every heal-related log line, with a small fixed event vocabulary (produced / surfaced / healed / enqueue / drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed / teardown_failed / queue_full). One grep gives operators a single event stream across the produce-to-drain lifecycle. 2. Escalate the "scanned N>0 entries but no valid latest" case in updateLatestVersionAfterDeletion from V(1) Infof to a Warning that names the orphan entries it saw. This is the listing-after-rm inconsistency signature that pinned down 259064a8's failure — it should not be invisible at default log levels. 3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]` that walks .versions/ directories under a prefix and reports the stranded population. With -heal it clears the latest-version pointer in place on stranded directories so subsequent reads return a clean NoSuchKey instead of replaying the 10-retry self-heal loop. * fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry Address PR review: 1. s3.versions.audit walked only the first 1024-entry page of each .versions/ directory, false-positiving "stranded" on large dirs. Loop until the page returns < 1024 entries, advancing startName. 2. clean and orphan-only categories double-counted when a directory had no pointer and at least one orphan: incremented both. Make them mutually exclusive so report totals sum to versionsDirs. 3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep, non-interruptible by ctx. A server shutdown / client disconnect would wait out the budget per in-flight delete. Thread ctx through deleteSpecificObjectVersion -> repointLatestBeforeDeletion / updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context(); gRPC lifecycle handlers pass the stream ctx. New test pins the behavior: cancelling ctx mid-backoff returns ctx.Err() in <500ms instead of blocking ~6.3s. * fix(s3/versioning): clearStale outcome + escape grep-able log fields Two coderabbit follow-ups: 1. Successful pointer clear should suppress `produced`. updateLatestVersionAfterDeletion's transient-rm fallback called clearStaleLatestVersionPointer best-effort, then unconditionally returned retryErr. The caller (deleteSpecificObjectVersion) saw the error and emitted `event=produced` + enqueued the reconciler, even though clearStaleLatestVersionPointer had just driven the pointer to consistency and the next reader would get NoSuchKey via the clean-miss path. Make clearStaleLatestVersionPointer return cleared bool; on success the caller returns nil so neither produced nor the reconciler enqueue fires. Concurrent-writer aborts, re-scan errors, and CAS mismatches still report false so genuinely stranded state keeps surfacing. 2. Escape user-controlled fields in heal log lines. versioningHealInfof / Warningf / Errorf interpolated raw bucket / key / filename / err text into a single-space-separated line. An S3 key (or error string from gRPC) containing whitespace, newlines, or `event=...` could split one event into multiple tokens and spoof fake fields downstream. Sanitize each arg in the helper: safe values pass through; anything with whitespace, quotes, control chars, or backslashes is replaced with its strconv.Quote form. No caller changes — the format strings remain unchanged. Tests pin both behaviors: sanitization table covers the field boundary cases; an end-to-end shape test confirms a key containing `event=spoof` stays inside a single quoted token.	2026-05-13 10:48:58 -07:00
Chris Lu	f5a4bfb514	fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460 ) * fix(s3/versioning): repair dangling latest-version pointer after partial delete deleteSpecificObjectVersion did two non-atomic filer ops: rm the version blob, then update the .versions/ pointer. Step 2 failures were silently logged and the client got 204 OK, so any transient blip (filer timeout, process restart between RPCs, lock contention) left the .versions/ directory naming a missing file. Subsequent GETs paid the 10-retry self-heal cost and returned NoSuchKey — surfacing as "Storage not found" to Veeam, which is what triggered this investigation. Three changes: 1. Pre-roll the pointer for the singleton / multi-version-deleting-latest cases. The pointer is repointed (multi) or cleared (singleton) before the blob rm. A failure between leaves a recoverable orphan blob — pointer is consistent, GETs succeed or correctly miss without entering the stale-pointer self-heal path. 2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion with bounded retries (~6.3s worst case). When retries are exhausted the function now returns a non-nil error instead of swallowing it. The caller logs at Error level and queues the path for the reconciler. 3. Background reconciler drains stranded .versions/ pointer-to-missing states off the hot path. Bounded in-memory queue with capped retries; read-path heal remains as a last-resort safety net. * fix(s3/versioning): address review on #9460 Four fixes addressing review on PR #9460. All four are correctness; no behavioural change for the happy path. 1. repointLatestBeforeDeletion: discriminate NotFound from transient errors when re-fetching the .versions/ entry. Previously any error returned rolled=true,nil — a transient filer hiccup at that point would cause the caller to skip the post-delete reconciliation AND proceed with the blob rm, producing exactly the dangling pointer state the PR aims to prevent. NotFound stays "vacuously consistent" (directory already gone); other errors surface so the caller aborts before removing the blob. 2. Move the singleton .versions/ teardown out of repointLatestBeforeDeletion (where it ran BEFORE the blob rm and always failed with "non-empty folder") into deleteSpecificObjectVersion AFTER the blob rm. Adds a wasSingleton return value so the caller knows when to run the teardown. Without this, every singleton-version delete in a versioned bucket leaked an empty .versions/ directory. 3. Wrap the list, getEntry, and mkFile calls inside repointLatestBeforeDeletion with retryFilerOp so the pre-roll has the same transient-failure resilience as the post-roll path. Without retries, a single transient blip causes the caller to fall back to the legacy non-atomic flow even when the filer recovers immediately. 4. healVersionsPointer in the reconciler: same NotFound-vs-transient discrimination on both the .versions/ getEntry and the latest-file presence probe. Previously a transient filer error would silently evict the candidate from the queue as "healed", leaving the real stranded state until a client read happened to surface it. Also fixes the gemini-flagged consistency nit: the queued-for-reconciler error log now uses normalizedObject instead of object so it matches the queue entry's key. * fix(s3/versioning): short-circuit terminal errors in retryFilerOp Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound, gRPC NotFound, context.Canceled, and context.DeadlineExceeded. retryFilerOp now bails immediately on a terminal error and returns it unwrapped, so callers like repointLatestBeforeDeletion.getEntry and updateLatestVersionAfterDeletion.rm see the raw NotFound instead of paying the ~6.3 s retry-budget delay AND parsing it out of an "exhausted N retries" wrapper. errors.Is and status.Code already walk the %w chain so today's call sites still work, but the delay was real on the hot DELETE path whenever a key was genuinely absent. Test added covering all five terminal-error shapes — each must run the wrapped fn exactly once and return in under 50 ms.	2026-05-13 10:14:27 -07:00
Chris Lu	3f1eaf9724	fix(s3/audit): emit audit log for successful GET/HEAD (#9467 ) * fix(s3/audit): emit audit log for successful GET/HEAD Successful GET/HEAD object requests never produced a fluent audit entry because those handlers write the response directly (streaming for GET, WriteHeader for HEAD) and never reach a PostLog call site. The wiki advertises GET as an audited verb, so the asymmetry surprises operators who rely on the log for read-access auditing. Move the safety net into the track() middleware: tag each request with an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it, and emit a single fallback entry after the handler returns when nothing fired. The recorder's status flows into the fallback so the audit row still reflects 200/206 vs 404 etc. No double logging for handlers that already emit (write helpers, error paths, bulk delete). Refs #9463 * fix(s3/audit): defensive nil checks on audit-tracking helpers Address PR review: guard against nil request and nil atomic.Bool stored under the audit-tracking key. The conditions are unreachable today (the key is private and we only ever store new(atomic.Bool)), but the checks are free and keep the helpers safe if a future caller misbehaves. test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469) test(s3/audit): cover track() fallback wiring + cleanup Adds two unit tests in weed/s3api/stats_test.go that exercise the audit-tracking flag set up by track(): one verifies the fallback path fires when a handler writes the response directly (the GET/HEAD object regression in #9463), the other verifies the flag is set when a handler emits PostLog itself so the fallback is skipped. To make the wiring observable without standing up fluent, PostLog now marks the audit flag before short-circuiting on a nil Logger; production behavior is unchanged (no logger, no posting) but the flag stays consistent. Also drops two stale comments in s3api_object_handlers.go that still referenced proxyToFiler — that helper was removed when GET/HEAD started streaming from volume servers directly. Stacks on #9467.	2026-05-13 09:24:59 -07:00
Chris Lu	d5372f9eb7	feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471 ) Phase 4b shipped the walker without plugging it into the cluster rate.Limiter that processMatches honors. A walker hitting a large bucket on the recovery branch could burst LifecycleDelete RPCs past the cluster_deletes_per_second cap that streaming-replay respects. WalkerDispatcher now takes a *rate.Limiter and waits on it before each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds just like processMatches does. The handler passes the same limiter to both paths so replay + walk share one budget; nil disables throttling (unchanged default). Tests pin: the limiter actually delays a dispatch when the burst token is drained, and a ctx cancellation in Limiter.Wait surfaces as an error without sending the RPC.	2026-05-13 09:24:50 -07:00
Chris Lu	37e505b8fd	refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481 ) * refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards) even though the same gRPC stream could serve every shard in a worker process. Replace with one SubscribeMetadata stream covering all shards in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and a fan-out goroutine routes events to per-shard channels by ev.ShardID. drainShardEvents now reads from a passed-in channel; shards whose persisted cursor is fresher than the global floor (runNow - maxTTL) filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader when the first ev.TsNs > runNow arrives — meta-log order means the rest of the stream is past the pass boundary too. cfg.Workers no longer gates shard concurrency: with the shared subscription, every shard goroutine must be live to drain its channel, or the fan-out stalls. The field is retained for back-compat and ignored. Dispatch throttling still goes through cfg.Limiter. Filer load: 16x -> 1x SubscribeMetadata streams per pass. * fix(s3/lifecycle): shared subscription floor is min(per-shard cursor) The shared subscription used runNow - maxTTL as its starting TsNs, but that's the cold-start floor. For shards whose persisted cursor sits below the floor — exactly the case a rule with TTL == maxTTL produces, where a pending event's PUT TsNs ends up at runNow - maxTTL — events that the per-shard drain still needs are filtered out before the Reader even forwards them. Same regression I fixed in `6796ab6db` for the per-shard subscription; now applied at the shared level. computeGlobalStartTsNs loads every shard's cursor and picks the minimum, falling back to the cold-start floor only for shards with no persisted cursor.	2026-05-13 02:13:11 -07:00
Chris Lu	b1d59b04a8	fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477 ) * fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 00:19:05 -07:00
Chris Lu	5004b4e542	feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466 ) * feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run.	2026-05-12 12:54:52 -07:00
Chris Lu	2f682303fb	fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464 ) Follow-up to PR #9459 (merged before this fix landed). The walker dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and "NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay drain (processMatches in run.go via #9462) use "RPC_ERROR" for the same condition. Aligning so a single Prometheus query covers all three delete paths. Folds nil-response under RPC_ERROR rather than a separate label — operationally it's the same class of failure (server returned no usable response).	2026-05-12 12:38:52 -07:00
Chris Lu	495632730c	feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462 ) * feat(s3/lifecycle): daily-replay observability metrics + per-run summary log Operators have no Prometheus signal today for the daily_replay path beyond the cluster-rate-limiter wait histogram. Phase 6 adds the three baseline questions: how long does a shard take, how many events did it scan, and what did dispatch produce. - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard): wall-clock per shard. p95 climbing toward MaxRuntime means the shard is brushing its budget. - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log events drainShardEvents processed. Pairs with the duration so a spike in events-per-shard correlates with a slow shard. - S3LifecycleDispatchCounter (existing, reused): processMatches now increments this with the outcome label, so streaming and daily_replay paths share one outcome view. Transport errors are counted under outcome="TRANSPORT_ERROR". dailyrun.Run logs a per-run summary at V(0): status / shards / errors / duration. The summary is the at-a-glance line operators read in /var/log to confirm a run completed. Test pins the dispatch-counter increment with a unique bucket/kind/outcome triple so a refactor that drops the instrumentation call surfaces as a test failure. * fix(s3/lifecycle): align dispatch error label + clean test labels Two PR-9462 review fixes from gemini: 1. processMatches' transport-failure label was "TRANSPORT_ERROR"; streaming's dispatcher uses "RPC_ERROR" for the same condition (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so the same Prometheus query covers both delete paths. 2. The dispatch-counter assertion test now deletes its label row on exit so the in-process Prometheus registry doesn't accumulate per-test state across the suite.	2026-05-12 12:15:20 -07:00
Chris Lu	f954781169	feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459 ) * feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config Adds a Config.RetentionWindow field that runShard threads into engine.PromotedHash. Zero (the default) falls back to maxTTL, which matches Phase 4a behavior — PromotedHash stays empty and the partition-flip recovery trigger stays dormant. Pure plumbing. The handler still passes zero so nothing changes at runtime. The walker work (Phase 4b proper) sets a real retention from the meta-log boundary and the partition-flip trigger starts firing. * feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient so the same LifecycleDelete RPC drives both the meta-log replay path and the walker. No CAS witness — the server's identityMatches treats nil ExpectedIdentity as a bootstrap call and rebuilds the witness from the live entry, which is the right contract for a full-tree walk. Adds VersionID to bootstrap.Entry so versioned-bucket walks address the right version. MPU init uses DestKey for ObjectPath (matching the prefix-match contract); rejecting empty DestKey keeps malformed init records out of the dispatch path. Not wired yet — runShard still doesn't invoke the walker. Follow-up commits add the ListFunc adapter and the recovery-branch wiring. * feat(s3/lifecycle): wire Walker hook into runShard's recovery branch Adds a Config.Walker callback that fires on rule-content edit / partition flip BEFORE the cursor rewinds, so already-due objects across the rewritten rule set get caught instead of waiting on meta-log replay alone. The callback receives engine.RecoveryView(snap) and the per-shard ID; nil disables it (Phase 4a behavior preserved). Decoupling the wiring from the implementation: the handler-side WalkerFunc that drives bootstrap.Walk via the filer is the follow-up commit, and tests can stub the callback without standing up the full filer/client/lister harness. Tests pin: walker fires exactly once on hash mismatch, walker error propagates and leaves the cursor unchanged, nil Walker is a no-op. * feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard Adds dailyrun.WalkBuckets — the composable driver the handler-side WalkerFunc will call. Iterates a bucket list, wraps the supplied bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey for MPU init), and runs bootstrap.Walk per bucket using the supplied Dispatcher. First bucket error wins; remaining buckets log and run to completion so one filer flake doesn't kill the shard. Composable rather than monolithic so callers and tests can swap parts: production uses a filer-backed ListFunc + WalkerDispatcher; tests use bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the next commit. Tests pin: shard filter routes only matching entries, MPU shard uses DestKey not the .uploads/<id> path, single-bucket error propagates while other buckets still run, ctx cancellation short-circuits between buckets, nil guards on view/list/dispatch. * feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that streams entries under <bucketsPath>/<bucket> by paginated SeaweedList. Recurses into regular directories; .versions/ and .uploads/ are skipped at this stage so they don't surface as raw children — the sibling expansion (versioned NoncurrentDays state, MPU init dispatch) lands in the next commit. listAll and isVersionsDir are ported from scheduler/bootstrap.go's same-named helpers. Phase 5 deletes the scheduler copies along with the streaming path. Tests pin: flat listing, recursion through nested directories, .versions/ and .uploads/ skipped, kill-resume via the start path contract, nil-client error, attribute propagation (mtime / size / IsLatest default). * feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc Adds the .versions/<key>/ expansion to the daily-run's filer-backed ListFunc. Each call emits one bootstrap.Entry per sibling (real version files + the bare null version, when found) with the same sibling state the streaming bootstrap injects via reader.Event: - Path = logical key (not the .versions/<file> physical path), so bootstrap.Walk's MatchPath uses the user's intended path. - VersionID per sibling (version_id or "null"). - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back to explicit-null-bare, falling back to newest-by-mtime. - NoncurrentIndex rank computed against the latest's position. - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the previous-newer sibling's mtime (legacy derivation). - IsDeleteMarker from ExtDeleteMarkerKey. - NumVersions = len(siblings). Two-pass walk so .versions/ dirs run before regular files; the bare null-version path is recorded in skipBare so pass 2 doesn't emit it twice. expandVersionsDir and lookupNullVersion are ported from scheduler/bootstrap.go. Sort order, latest resolution, and successor derivation must agree with that path verbatim so streaming and walker reach the same verdict on the same objects. Phase 5 deletes the scheduler copy. MPU init (.uploads/<id>) remains skipped — the dedicated commit emits it with IsMPUInit and DestKey. Tests pin: pointer-wins latest resolution, no-pointer newest-sibling fallback, explicit-null-is-latest with skipBare suppression of the bare emission, coincidentally-named .versions folder recursing as a regular subdir, delete-marker propagation. * feat(s3/lifecycle): emit MPU init records from FilerListFunc Last gap in the filer-backed ListFunc. A directory at .uploads/<id> carrying ExtMultipartObjectKey is the MPU init record; emit one bootstrap.Entry with IsMPUInit=true and DestKey set to the user's intended path. The walker's MatchPath uses DestKey for prefix matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's ObjectPath. .uploads/<id> directories without the extended key are mid-write before metadata landed and stay skipped. isMPUInitDir is upgraded from the path-shape-only stub to the full shape + extended-attr check that mirrors router.mpuInitInfo and scheduler/bootstrap.go's same-named helper. Tests pin: valid init record emits with the right DestKey, missing ExtMultipartObjectKey skips the directory. * feat(s3/lifecycle): wire walker into executeDailyReplay Activates the recovery-branch walker. The handler composes the three Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets — into a dailyrun.WalkerFunc and passes it via Config.Walker. The bucket list is derived from the compiled inputs so it matches the engine snapshot exactly. Effect on master behavior: when a worker observes a RuleSetHash or PromotedHash mismatch on its persisted cursor (rule content edited / partition flip), runShard now walks the live filer tree under the RecoveryView before rewinding the cursor. Already-due objects across the rewritten rule set fire immediately instead of waiting on the sliding meta-log replay. Still scoped to replay-eligible action kinds because checkSnapshotForUnsupported continues to reject walker-bound rules (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and scan_only-promoted rules at the top of Run. The follow-up commit relaxes the gate once the steady-state walker over RulesForShard's walk view is wired so those rules fire every day, not just on rule edits. * feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate Adds the second walker invocation in runShard. After the recovery check passes, runShard derives the walk view via snap.RulesForShard (using the same retentionWindow PromotedHash used, so the partition is consistent) and runs the walker over it. The view holds walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) plus any replay-eligible rules promoted to walk by retention shortage; an empty view skips the call so non-versioned, replay-only deployments don't pay an O(N) bucket walk per run. With the walker now servicing every rule kind, checkSnapshotForUnsupported and its UnsupportedRuleError type are obsolete. router.Route gates replay on Mode == ModeEventDriven, so walker-bound and scan_only rules are silently dropped by replay and picked up by the walker instead — no double-dispatch. Drop the gate, delete replayability.go + replayability_test.go, and remove the handler's redundant IsUnsupportedRule branch. * fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment Two PR-review fixes on 9459: 1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return — add a defensive nil-response check so the walk halts cleanly instead. Spotted by coderabbit. 2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash "stays empty" in fallback mode, which gemini correctly pointed out is only true once rules are active. During bootstrap (rules compiled but IsActive=false) MaxEffectiveTTL is 0 while PromotedHash counts every non-disabled rule, so promoted becomes non-empty and the next post-activation run hits the recovery branch. That's the intended bootstrap walk — rewrite the comment to explain it rather than misstate the invariant. Test: pins nil-response → error path on WalkerDispatcher. * fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion Reviewer caught a structural bug in expandVersionsDir's latest resolution: when ExtLatestVersionIdKey was set but no scanned sibling carried that id (stale pointer), the code left latestPos at the default 0 without ever entering the no-pointer fallback. Today the two paths yield the same value (newest sibling wins), but the implicit fall-through makes the intent unclear and would break silently if the no-pointer branch ever did anything more than latestPos=0. Track a pointerResolved flag explicitly so the no-pointer branch (including the explicit-null-bare check) re-runs on a stale pointer. Behavior unchanged today. Test pins: stale pointer + two real versions falls back to newest-sibling (vnew, not vold). * feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher Mirrors the Phase 6 instrumentation already on the replay side (processMatches) onto the walker's Delete dispatch. Every walker dispatch now bumps S3LifecycleDispatchCounter with the resolved outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so streaming, daily_replay's replay drain, and daily_replay's walker share a single per-(bucket, kind, outcome) counter view. Lands together with the rest of Phase 4b — no new metric, just an extra observation site for the existing one.	2026-05-12 11:39:15 -07:00
Chris Lu	644664bbee	feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457 ) * feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) Replace the local replay-content-hash / max-effective-TTL helpers in dailyrun with the engine package's canonical versions (ReplayContentHash, MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface. Adds PromotedHash to the cursor's recovery triggers: a partition flip (rule moving between replay and walk because retention shifted) now fires the rule-change branch alongside RuleSetHash mismatch. The retentionWindow is set to MaxEffectiveTTL today, which keeps the promoted set empty and the trigger dormant; Phase 4b will plumb the real meta-log retention boundary so true scan_only promotions are detected. Cursor schema is unchanged — PromotedHash was already persisted as the zero hash in Phase 2. * docs(s3/lifecycle): note the one-time cursor rewind on hash format change gemini-code-assist flagged that swapping localReplayContentHash for engine.ReplayContentHash changes the persisted RuleSetHash byte layout (sort order + tagged-field encoding). Phase-2 cursors mismatch on first post-upgrade run and drop into the rule-change branch. Going with option 3 (document the intentional one-time rewind). The rewind is bounded to runNow - maxTTL (not time-zero), self-healing on the next save, and daily_replay is off by default so the affected population is limited to early adopters of the algorithm flag. A migration shim or a hash-compat layer would carry the legacy encoder forever for one bounded re-scan; not worth it. Comment in runShard makes the trade explicit so a future reader doesn't hunt for the "why does my cursor rewind once after upgrade" mystery. * chore(s3/lifecycle): trim verbose comments in dailyrun Cut multi-paragraph headers and narration that just described what the code does. Kept the small WHY notes (per-match skip vs per-rule, the one-time post-upgrade cursor rewind, scan_only rejection rationale). Same behavior, ~150 fewer lines of comment. * fix(s3/lifecycle): persist PromotedHash on the successful runShard save The comment-trim pass dropped the field alongside a "stays empty in Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b turns promoted into a real value — and a save that writes zero would make the next run falsely detect drift and rewind. Spotted by gemini-code-assist on PR 9457. Other save paths (recovery, drain-error) already persisted it; the success path is the only one that was missing it. Now consistent.	2026-05-11 21:18:19 -07:00
Chris Lu	884b0bcbfd	feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456 ) * feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) Admin computes a per-worker share of cluster_deletes_per_second at ExecuteJob time and ships it to the worker via ClusterContext.Metadata. The worker reads the share, constructs a golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the streaming path; until then streaming ignores the cap. Why allocate at admin: the cluster cap is a single knob operators care about. Dividing it locally per worker would either need out-of-band coordination or accept N× the configured budget. Admin is the only party that knows how many execute-capable workers there are, so it owns the math. Admin side (weed/admin/plugin): - Registry.CountCapableExecutors(jobType) returns the number of non-stale workers with CanExecute=true. - New file cluster_rate_limit.go: decorateClusterContextForJob clones the input ClusterContext and injects two metadata keys for s3_lifecycle. cloneClusterContext duplicates Metadata so per-job decoration doesn't race shared base state. - executeJobWithExecutor calls the decorator after loading the admin config; other job types pass through unchanged. Worker side (weed/worker/tasks/s3_lifecycle): - New cluster_rate_limit.go declares the constants both sides agree on (admin-config field names, metadata keys). Plain strings on the admin side keep weed/admin/plugin free of a dependency on the s3_lifecycle worker package; the two sets of constants are pinned to identical values and a mismatch would silently disable rate limiting. - handler.go executeDailyReplay reads ClusterContext.Metadata, builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}. Missing/empty/non-positive values → no limiter (legacy unlimited behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a bucket that never refills. - Admin form gains two fields under "Scope": cluster_deletes_per_second (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate). Metric: - New S3LifecycleDispatchLimiterWaitSeconds histogram observes how long each Limiter.Wait blocks before a LifecycleDelete RPC. Operators tune the cap by reading p95 — near-zero means the cap isn't binding, a long tail at 1/rate means it is. Tests: - weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering pass-through for non-allocator job types, rps=0 / no-executors skip, even sharing, burst sharing, burst=0 omit (worker default kicks in), burst floor of 1, no mutation of input metadata, nil input. - weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases covering nil/empty/missing metadata, non-positive/invalid rate, positive rate builds correctly, burst missing defaults to 2× rate, tiny rate clamps burst to ≥1. Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the parents; this branch stacks on Phase 2 since it consumes dailyrun.Config{Limiter} which lands there. * fix(s3/lifecycle): divide cluster budget by active workers, not all capable gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1 (handler.go:189) — it's a singleton job, only one worker is ever active. Dividing the cluster_deletes_per_second budget by the count of capable executors gave the single active worker just 1/N of the configured cap. Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor is now min(executors, maxJobsPerDetection), clamped to >=1. For s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a hypothetical parallel-dispatch job (maxJobs>1) the budget divides across the running-set. Tests swap the SharedEvenly case for two pinned scenarios: - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1 - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4 * feat(s3/lifecycle): drop Worker Count knob from admin config form The "Worker Count" admin field controlled in-process pipeline goroutines across the 16-shard space — per-worker tuning, not a cluster-wide scope concern. Operators looking at the form alongside Cluster Delete Rate reasonably misread it as the number of workers in the cluster. Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the plumbing through dailyrun.Config.Workers stays so a future need can re-introduce it as a worker-local knob (or just bump the constant). handler_test.go pins that "workers" must NOT appear in the form so the removal doesn't silently regress.	2026-05-11 19:17:06 -07:00
Chris Lu	3f4cb6d2fb	feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447 ) * feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) Adds the engine-side API the new daily-replay worker reaches for: per-view snapshot construction (RulesForShard, RecoveryView), the two cursor hashes that gate recovery (ReplayContentHash, PromotedHash), and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot is a stub keyed on a package-level atomic that the worker startup wiring populates. Views return new Snapshot instances holding cloned CompiledAction values so per-clone active/Mode never leak across partitions. Replay clones force Mode=ModeEventDriven to rehabilitate any persistent ModeScanOnly carried over from PriorState; walk and recovery clones preserve Mode as-is. Disabled actions are excluded from all views. No production caller is wired here — Phase 4's walker/dailyrun integration is the follow-up. dailyrun's local helpers (localReplayContentHash, localMaxEffectiveTTL) become one-line redirects to these exports. API surface: - CurrentSnapshot() Snapshot — stub until Phase 4 wiring. - SetCurrentEngine(Engine) — Phase 4 wiring entry point. - Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk Snapshot) - RecoveryView(s Snapshot) Snapshot — force-active over the full set. - ReplayContentHash(s Snapshot) [32]byte — partition-independent. - PromotedHash(s Snapshot, retentionWindow) [32]byte — partition-flip. - MaxEffectiveTTL(s Snapshot) time.Duration — over active replay only. 30 unit tests covering clone isolation, Mode rewrite, partition membership including the multi-action-kind XML rule split, RecoveryView activating pre-BootstrapComplete actions, ReplayContentHash partition-independence, PromotedHash sensitivity to promotion in either direction, MaxEffectiveTTL aggregation. Build + race-tests green. * refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics Addresses PR #9447 review feedback. Three medium-priority items from gemini, all code-quality refinements (no behavior change): 1. Duplicated sort comparator between ReplayContentHash and PromotedHash. Extract sortHashItems shared helper so the two hashes use the same ordering by construction — if one drifted, the cursor could see a spurious "rule changed" on a no-op snapshot rebuild. 2. Duplicated writeField/writeInt closures. Extract hashWriter struct holding the sha256 running hash + lenbuf, with method helpers. Same allocation profile (one Hash, one tiny stack buffer per helper); just deduplicates ~20 lines. 3. shardID parameter on RulesForShard is unused. Per the design's open question, every shard sees every rule today (shard filter runs at the entry-iteration site, not view construction). Keep the parameter for API stability — removing it now would force a breaking change when bucket-shard ownership lands — and update the doc comment to explain why it's reserved. go build ./... clean; engine test suite green.	2026-05-11 18:07:54 -07:00
Chris Lu	122ca7c020	feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446 ) * docs(s3lifecycle): design for daily-replay worker Captures the algorithm and dev plan iterated on in PR #9431 and the discussion leading up to it: per-shard daily meta-log replay, walker as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent plus a recovery branch over engine.RecoveryView(snap), explicit retention-window input to RulesForShard, two cursor hashes (ReplayContentHash + PromotedHash) that together detect every invalidation case. Implementation phases are sequenced so each can ship independently — Phase 1 (noncurrent_since stamp) just landed. * feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) New weed/s3api/s3lifecycle/dailyrun package implementing the bounded daily meta-log scan from the design doc. One pass per Execute per shard: load cursor, scan events forward, route each through router.Route, dispatch any due Match, advance the cursor on success. Halt-on-failure keeps the cursor at the last fully-processed event so tomorrow resumes from the same point — head-of-line blocking is the deliberate failure signal. Replay-only in this phase. Phase 4 wires the walker for ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules. Until then a typed UnsupportedRuleError refuses runs on those buckets: operators see the rejection in the activity log rather than silently losing rules. Behavior: - Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in Phase 2; Phase 4 turns it on. - Rule-change branch rewinds cursor to now - max_ttl when the replay-content hash mismatches. Cold start uses the same floor. - Transport errors retry 3x with exponential backoff capped at 5s; server outcomes (RETRY_LATER / BLOCKED) halt the run without retry. - Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules exist, only the hash gates a future addition. Worker shape: - New admin config field "algorithm" with enum streaming\|daily_replay, default streaming. Existing deployments are unaffected. - handler.Execute branches on the flag: streaming routes through the current scheduler.Scheduler, daily_replay routes through dailyrun.Run. - dispatcher.NewFilerSiblingLister exported so both paths share the same .versions/ + null-bare lookup. Engine integration: - Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase 4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace them with one-line redirects; the local versions hash the same fields so the cursor stays valid across the swap. Tests cover cursor persistence, unsupported-rule rejection, hash stability under rule reordering, hash sensitivity to TTL edits, max-TTL aggregation, dispatch retry budget, and request shape including the identity-CAS witness. Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers and future phases share the same spec. * feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob The streaming dispatcher hasn't shipped to users yet, so there's no backward-compat surface to preserve. Flip the algorithm default from streaming to daily_replay so the new path is the standard from day one. Streaming stays as an explicit opt-in escape hatch during the Phase 4 walker rollout; Phase 5 deletes both the flag and the streaming code. Buckets whose lifecycle rules require walker-bound dispatch (ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only) will fail the daily_replay run with the existing UnsupportedRuleError until Phase 4 walker integration ships. Operators hitting that case can set algorithm=streaming until the follow-up lands. Updates the test for the default value and renames the unknown-value-fallback case to reflect the new default. * fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}. That's only safe when each event produces at most one match per ActionKey with a single deterministic due-time formula — ExpirationDays and AbortMPU fit that shape because due_time = ev.TsNs + r.days is monotonic in event TsNs. But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed in Phase 2 since it compiles to ActionKindNoncurrentDays) routes through routePointerTransitionExpand, which emits matches for every noncurrent sibling — each with its own SuccessorModTime taken from the demoting event for that specific sibling. A single event can therefore produce two matches for the same ActionKey on different objects with wildly different DueTimes. With the old code, a not-yet-due sibling encountered first would set done[ActionKey] = true and then the next sibling — even though its DueTime had already passed — would be skipped. Future events for the same rule would also be suppressed for the rest of the run. Objects that should have been deleted weren't. Fix: drop the early-stop optimization. Process every match independently. A future-DueTime match is now silently skipped without affecting any later match. The performance hit is small (Phase 2 is a single bounded daily pass, and the rate limiter is the real throughput governor); the correctness gain is non-negotiable. Also fixes the inverted comment in processMatches that described the old check as "due_time is past now" when it actually checked DueTime.After(now) (i.e., NOT yet due). Adds four targeted tests: - not-yet-due match first in slice does not suppress two later due matches for the same rule; - reversed slice ordering produces identical dispatch; - BLOCKED outcome halts the loop before later due matches are sent; - empty match slice is a no-op. Phase 4's walker-and-recovery integration can revisit a per-(rule, object) memoization if profiling argues for it. * fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency Addresses PR #9446 review feedback. Eight distinct fixes: 1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the persisted cursor to lastOK = TsNs of the last event processed, including events whose matches were skipped as not-yet-due. Those skipped matches would never be re-scanned, so objects under long-TTL rules would never expire. Track a "stuck" flag in drainShardEvents: the first event with a skipped (future-DueTime) match stops cursorAdvanceTo from rising, but the loop keeps processing later events to dispatch any that ARE due. The persisted cursor sits at the last fully-processed event so tomorrow's run re-scans from the skipped event onward and the future-due matches get re-evaluated when they age in. processMatches now returns (skippedAny, halted, err) so the drain loop can tell apart "event fully drained" from "event had pending future-due matches." 2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the ActionKind. A replay-eligible kind with Mode != ModeEventDriven (e.g. ModeScanOnly via retention promotion) passed the check but then got silently ignored by router.Route, which gates dispatch on Mode == ModeEventDriven. Reject loudly with the typed error so admin sees the rejection in the activity log. 3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard goroutines regardless of cfg.Workers. Add a Workers field to dailyrun.Config and gate the goroutine fan-out on a semaphore of that size; the handler now passes cfg.Workers through. 4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per shard. Mid-run Compile would let shards process different rule sets. Capture snap at the top of Run, pass it down to every shard. 5. FROZEN runNow (coderabbit). drainShardEvents and processMatches accepted a `now func() time.Time` and called it multiple times. DueTime comparisons would slip as the run wore on. Capture runNow once at the top of Run and thread it through as a time.Time value. 6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case broke out of the loop and returned nil, marking interrupted runs as successful. Return ctx.Err() instead so the caller propagates the interrupt; cursorAdvanceTo carries whatever progress was made. 7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently accepted empty files, mismatched shard_ids, and hash slices shorter than 32 bytes (copy() would zero-pad). Each now returns a typed error so the run halts and an operator investigates rather than silently re-scanning from time zero or persisting a zero-padded hash that masks corruption forever. 8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted" guard in runShard was unreachable because drainShardEvents initialized lastOK := startTsNs and only ever raised it. Removed along with the new cursor-advancement semantics that handle the "no events processed" case implicitly. Plus markdown lint: DESIGN.md fenced code blocks now carry a `text` language identifier to satisfy MD040. Skipped from the review: - gemini's "maxTTL == 0 incorrectly skips immediate expirations": actions with Days <= 0 don't compile to a CompiledAction (see weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new empty-replay sentinel uses `rsh == [32]byte{}` for clarity per gemini's suggested form, but the behavior is equivalent. Tests added/updated: - TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when all matches are past their DueTime. - TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins the new Mode check. - TestFilerCursorPersister_EmptyFileReturnsError, _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin the new validation rules. - Existing process-matches tests reshaped for the (skippedAny, halted, err) return tuple. Full build clean. Dailyrun + worker test packages green.	2026-05-11 18:07:17 -07:00
Chris Lu	46bb70d93e	feat(s3): stamp noncurrent_since on versioned demotions (#9431 ) * feat(s3): stamp noncurrent_since on versioned demotions A version's noncurrent TTL clock starts when the next version is written, not at its own mtime. Today the lifecycle engine derives that moment from the next-newer sibling's mtime — a heuristic that drifts if the sibling is later modified and is unavailable when the demoting event sits outside meta-log retention. Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at the two places where a PUT flips the latest pointer: updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning. Timestamp source is time.Now().UnixNano() captured once per demotion — the documented Phase 1 fallback until the filer write API surfaces its own TsNs. Engine reads the stamp on both the bootstrap walker path and the event-driven router; missing/zero falls back to the legacy sibling-mtime derivation, so pre-stamp entries keep working. Prerequisite for the daily-replay lifecycle worker (Phase 2+). * fix(s3): address CI failure and PR review feedback - Backdating tests must move both clocks: the lifecycle integration tests backdate version mtimes to simulate aging, but my earlier commit made the engine prefer the explicit demotion stamp over sibling mtime, so a real-now stamp dominated a backdated mtime and the rule never fired. Update backdateVersionedMtime to also rewrite Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it. This is a test simplification — production stamps record when the successor was written, not the demoted version's own mtime — but the resulting clock is correctly old enough. - Refactor stamp parsing into one shared helper. Per gemini-code-assist: the parsing logic for ExtNoncurrentSinceNsKey was duplicated in router/router.go and scheduler/bootstrap.go. Move it to a new weed/s3api/s3lifecycle/noncurrent_since.go as exported SuccessorFromEntryStamp; both call sites now go through it. - Make the parser ordering test deterministic. Per coderabbitai: time.Now().UnixNano() drops the monotonic clock component, so two back-to-back calls can decrease if the wall clock steps backward — the prior test was exercising OS clock behavior rather than the parser. Replace with fixed nanosecond values. - Close a suspended-versioning race. Per coderabbitai: the prior putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning after putToFiler returned, i.e. after the object write lock released. A concurrent PUT could promote a newer latest version, which we'd then wipe — leaving the older "null" object incorrectly current. Move the cleanup into the afterCreate callback so the null write and the .versions pointer clear (including the new demotion stamp) run atomically under the same lock. Best-effort logging is preserved. * fix(s3/lifecycle): clear noncurrent_since stamp on test backdate Backdating a version's mtime in tests is not a coherent claim about when it became noncurrent — production stamps record the successor's PUT time, which the test doesn't manipulate. The prior commit rewrote the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent that creates an inconsistent state: v3's stamp says "demoted 30 days ago" while v4's mtime (the supposed demoter) is real-now. With both NewerNoncurrentVersions and NoncurrentDays in the same rule, the NoncurrentDays floor passes against the backdated stamp and the rank-based check then deletes v3 via the meta-log historical replay that misranks against current state. Clearing the stamp instead lets the lifecycle engine fall back to the sibling-mtime derivation the tests were originally written against: the legacy code path is preserved end-to-end while the new explicit- stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go and the bootstrap-walker integration in scheduler/bootstrap_test.go. The deeper interaction — historical meta-log replay ranking against current state inside routePointerTransitionExpand — is pre-existing and is no longer masked by the freshly-PUT successor's mtime once the stamp is read. Tracked separately; not blocking this PR. * fix(s3): stamp noncurrent_since before the .versions/ pointer flip The pointer-flip on the .versions/ directory emits a meta-log event that the lifecycle router consumes via routePointerTransition. The router then calls LookupVersion on the demoted version's id. With the prior ordering — pointer flip first, stamp second — the router could read the demoted entry before markVersionNoncurrent landed and fall back to the legacy sibling-mtime derivation. Versioned COPY is the clean break: the new latest version keeps the source object's mtime instead of recording the moment v_old was demoted, so the fallback's successor clock can be arbitrarily wrong. Reorder both updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning so the stamp is written first; the pointer flip then emits an event into a state where the stamp is already present. Failure of the stamp write remains non-fatal — lifecycle still falls back to the legacy derivation in that case, with the same caveats as before the PR but no race window.	2026-05-11 13:41:33 -07:00
Chris Lu	9a70bbfcc6	feat(s3api): full-chunk gzip pass-through skips volume-side decompress (#9427 ) Building on the io.Pipe streaming chunk copy: when a copy operation covers an entire source chunk (the common case for Harbor's part-size = chunk-size assemble pattern), ask the source volume for compressed bytes via Accept-Encoding: gzip and forward them to the destination as-is. This trades a Range fetch (where the volume decompresses the chunk internally to satisfy the byte range) for a full-chunk fetch that returns whatever wire bytes the chunk is stored as. For gzipped chunks the source volume avoids the decompression entirely; we never allocate a chunk-sized decompress buffer. Implementation: build the source GET directly instead of going through ReadUrlAsStream, because that helper auto-decompresses gzip responses (which would defeat the point). Trust the response's Content-Encoding header over caller hints — for partial ranges the volume always returns raw bytes regardless of how the chunk is stored, so labeling those as gzip would corrupt subsequent reads. End-to-end repro impact (512 MiB src, 6 parallel UploadPartCopy): + #9420/#9421/#9422 : 2236 MiB + io.Pipe streaming : 1521 MiB + this commit : 1149 MiB (round 2 RSS, perfectly flat) Round 3 now completes (was hitting volume-full before, since chunks took up uncompressed space on disk; we now store the gzipped chunks the volume gives us, which fit in the test's 8 GiB volume budget). Heap inuse_space (after force GC): before all: ~1.5 GiB this PR: 266 MiB Volume-side bytes.Buffer.ReadFrom inuse: before: 611 MiB streaming: 571 MiB this PR: 297 MiB (now in destination-volume parseUpload's size-hint decompression — separate optimization opportunity for a hint header)	2026-05-10 14:55:59 -07:00
Chris Lu	4a04594826	feat(s3api): stream chunk copy via io.Pipe to cut peak working set (#9424 ) * fix: cap pool retention so chunk-copy buffers don't hoard memory Two pool-retention sites kept the runaway-RSS pattern in #6541 visible even after #9420 and #9421: * weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into sync.Pool regardless of how big it had grown. After a 64 MiB chunk upload through volume.PostHandler -> needle.ParseUpload, the pool hoarded a 64 MiB byte array per cached entry for the rest of the process's lifetime. Cap retention at 4 MiB; oversized buffers are dropped so GC can reclaim the backing array. * weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer unset, so operation.upload_content fell back to the package-global valyala/bytebufferpool. That pool also retains high-water buffers forever, and concurrent UploadPartCopy filled it with one chunk-sized buffer per concurrent upload. Provide a fresh per-call bytes.Buffer pre-sized to chunk + multipart framing; it's GC'd as soon as the upload returns. Tests: - weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized buffers don't round-trip), the inverse (right-sized buffers do), and nil-safety. - weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption and pin that BytesBuffer is always non-nil and pre-sized, and that each call gets a distinct buffer. * feat(s3api): stream chunk copy via io.Pipe to cut peak working set Final piece for #6541. The buffered chunk-copy path holds two chunk-sized buffers per copy in flight (download buffer + multipart- encoded upload buffer). Under concurrent UploadPartCopy that put a floor on RSS at concurrency × 2 × chunk_size — about 768 MiB for the 6-way / 64 MiB Harbor-style assemble repro, even after the previous pool/retention fixes. Replace the buffered path with an io.Pipe between the source GET and the destination POST: ReadUrlAsStream pumps data into the pipe via a multipart.Writer, the http.Client reads from the pipe end and POSTs the body. In-flight per copy is now ~32 KiB (pipe hand-off + http buffers), regardless of chunk size. The streaming path is gated by canStreamCopyChunk: only used when no in-transit transformation is needed (no per-chunk CipherKey, no SSE). SSE-C / SSE-KMS / SSE-S3 paths still go through the buffered path, which already handles re-encryption correctly. Benchmarks (Apple M4, httptest source/dest, B/op = bytes per copy): Buffered 1 MiB: 6.0 MB B/op, 443 MB/s Streamed 1 MiB: 374 KB B/op, 727 MB/s Buffered 8 MiB: 56 MB B/op, 559 MB/s Streamed 8 MiB: 379 KB B/op, 1138 MB/s Buffered 64 MiB: 455 MB B/op, 718 MB/s Streamed 64 MiB: 304 KB B/op, 1387 MB/s End-to-end repro (512 MiB src, 6 parallel UploadPartCopy): pre-#9420 RSS round 2: 3134 MiB + #9420/#9421/#9422 : 2236 MiB + this PR : 1521 MiB heap inuse_space : 350 MiB (was 1422 / 1187 MiB) HeapSys (MemStats) : 1.74 GiB (was 2.49 GiB) * review: surface shouldRetry, add int32 guard, drop redundant drains Address review on PR 9424: * coderabbit (HIGH, line 122): ReadUrlAsStream can set shouldRetry=true with readErr=nil. Before this fix, that fell through to mw.Close() and the destination POST succeeded against a possibly-truncated multipart body. Mirror downloadChunkData's explicit check and surface shouldRetry as a producer error so the dst POST aborts. * gemini (line 98): chunk size is int64 but ReadUrlAsStream takes int. Reject sizes above MaxInt32 up front so the int(size) cast can't truncate negative on 32-bit platforms — same guard downloadChunkData uses. * gemini (line 151): util_http.CloseResponse already drains the body (io.Copy(io.Discard, ...) inside the helper) before closing, so the manual io.Copy drains we added are redundant. Drop them. * review: cancel source GET when destination POST fails Address coderabbit review (line 165 / second pass on PR 9424): when the POST leg fails or returns an error status, closing pipeReader only fails the producer's writes. ReadUrlAsStream's own read loop runs under the parent ctx, so it keeps draining the source body in the background until EOF — wasting source-volume bandwidth and CPU on a copy that's already failed. Wrap streamCopyChunkRange in a child context cancelled on return. ReadUrlAsStream checks ctx.Done() per 256 KiB tick, so the in-flight read aborts on the next iteration once the function returns. The POST also moves to streamCtx so the in-flight request can be cancelled the same way if the producer fails first. Defer-cancel runs after both legs return, so the success path still sends EOF cleanly through pipeWriter.Close before cancellation.	2026-05-10 14:29:39 -07:00
Chris Lu	d8bbc1d855	fix: cap pool retention so chunk-copy buffers don't hoard memory (#9422 ) Two pool-retention sites kept the runaway-RSS pattern in #6541 visible even after #9420 and #9421: * weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into sync.Pool regardless of how big it had grown. After a 64 MiB chunk upload through volume.PostHandler -> needle.ParseUpload, the pool hoarded a 64 MiB byte array per cached entry for the rest of the process's lifetime. Cap retention at 4 MiB; oversized buffers are dropped so GC can reclaim the backing array. * weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer unset, so operation.upload_content fell back to the package-global valyala/bytebufferpool. That pool also retains high-water buffers forever, and concurrent UploadPartCopy filled it with one chunk-sized buffer per concurrent upload. Provide a fresh per-call bytes.Buffer pre-sized to chunk + multipart framing; it's GC'd as soon as the upload returns. Tests: - weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized buffers don't round-trip), the inverse (right-sized buffers do), and nil-safety. - weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption and pin that BytesBuffer is always non-nil and pre-sized, and that each call gets a distinct buffer.	2026-05-10 13:34:25 -07:00
Chris Lu	926a8e9351	fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup (#9420 ) * fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup downloadChunkData accumulated the streamed chunk into a nil []byte via `chunkData = append(chunkData, data...)`. ReadUrlAsStream pumps in 256 KiB ticks, so a 64 MiB chunk grew the slice geometrically (256K → 512K → 1M → ... → 64M), allocating ~2x the chunk size for every transferred byte. Combined with the 4-way per-request concurrency and any number of in-flight UploadPartCopy calls (Harbor multipart assemble), this is what produces the runaway-RSS pattern reported in #6541. Pre-size the receive buffer to the known sizeInt so the callback fills in place. Add a regression test that downloads a 16 MiB chunk through httptest and asserts TotalAlloc stays under 1.5x the chunk size — the pre-fix code allocates ~5x and trips the bound. Local repro (weed 4.23, 6 parallel UploadPartCopy on a 512 MiB source): before: baseline 96 MiB → peak 3124 MiB, never reclaimed pprof: 650 MiB inuse in bytes.growSlice + 461 MiB in downloadChunkData.func1 * test(s3api): assert downloaded chunk content matches payload Address PR review feedback: the allocation-bound check alone would still pass if a future regression silently truncated or corrupted the chunk. Compare the returned bytes against the source payload (after the TotalAlloc measurement window so bytes.Equal doesn't pollute it).	2026-05-10 12:08:06 -07:00
Chris Lu	82648cca53	test(s3/lifecycle/engine): pin delay-group dedup across buckets (#9418 ) Compile a 100-bucket × 5-rule snapshot where the five Days values include duplicates (1, 1, 7, 7, 30) and assert: - snap.actions has 500 entries — every (bucket, rule) compiles to its own ActionKey, no collapse. - snap.originalDelayGroups has exactly 3 entries — the routing index is keyed by Delay, so same-day rules across all buckets share a group. This is the property that lets the dispatcher index by delay group rather than per-rule. - Per-group key count = (rules with that day) × buckets, so every action is reachable from its group entry.	2026-05-10 10:36:54 -07:00
Chris Lu	1b1d4aa814	refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate (#9417 ) * test(s3/lifecycle): integration coverage for versioning + filters First integration-test bundle building on the existing single-test backdating harness. Each scenario follows the same shape: create bucket, set lifecycle, PUT object, backdate mtime via filer UpdateEntry, run the shell command for one shard sweep, assert S3-side state. Five new tests: - TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a versioned bucket must produce a delete marker (latest after worker runs is a marker) AND keep the original version directly addressable by versionId. ListObjectVersions confirms IsLatest=true on the marker. - TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent), backdate v1, run worker. v1 must be gone, v2 still current. - TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent + expired-delete-marker) cleans up a sole-survivor marker. PUT v1, DELETE (creates marker), backdate both, run worker. Every version AND marker must be gone for the key. - TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled must not produce dispatches even on a backdated match. Negative test for the engine's enabled-status gate. - TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches objects carrying the tag. Two backdated objects (one tagged, one not) — only the tagged one is removed. Helpers extracted to keep each test focused: putVersioningEnabled, putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle, backdateVersionedMtime (ages a specific .versions/v_<id> entry), runLifecycleShard (one-shot shell invocation with FATAL guard). * test(s3/lifecycle): tighten noncurrent expiration diagnostics Local run showed TestLifecycleNoncurrentVersionExpiration failing with a bare 404 on HEAD(latest), not enough to tell whether v2 was deleted, the bare-key pointer was removed, or a delete marker was synthesized. Strengthen the test to: - HEAD by versionId=v2 first, so we pin "v2 file still on disk" separately from "the latest pointer resolves to v2" - on HEAD(latest) failure, log ListObjectVersions output (versions + markers, with IsLatest) so the next failure shows which side the bug is on rather than just NotFound * test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload Exercises the lifecycleAbortMPU handler path that the prefix-based expiration tests can't reach — routing keys off of .uploads/<id>/ directory events, not regular object events, and the dispatcher uses a different RPC path (rm on the .uploads/<id>/ folder). Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1, CreateMultipartUpload, UploadPart (so the directory carries the right shape), backdate the .uploads/<uploadID>/ directory entry 30 days, run the worker. The upload must drop out of ListMultipartUploads. Helpers added: putAbortMPULifecycle, backdateUploadDir. * test(s3/lifecycle): integration coverage for NewerNoncurrentVersions NewerNoncurrentVersions=N keeps the N most recent noncurrent versions and expires the rest. Distinct from per-version NoncurrentDays — depends on per-version rank, not just per-version age — and routes through routePointerTransition's "needs full expansion" path. Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3 noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1 floor, run the worker. Expect v1+v2 expired (older noncurrent), v3 (newest noncurrent within keep=1) and v4 (current) preserved. Helper added: putNewerNoncurrentLifecycle. * test(s3/lifecycle): integration coverage for suspended-versioning Expiration Suspended versioning takes a distinct code path in lifecycleDispatch: the VersioningSuspended branch first deletes the null version (via deleteSpecificObjectVersion(versionId="null")) and then writes a fresh delete marker on top. Other branches (Enabled → only writes a marker; Off → straight rm) miss this two-step. Setup: enable versioning, PUT v1 (real versionId), suspend versioning, PUT again (creates the null version, demotes v1 to noncurrent), set the Expiration rule, backdate the null at the bare path. Expect: latest is now a fresh delete marker, the "null" version is gone from ListObjectVersions, and v1 (noncurrent under Enabled) still addressable directly — suspended Expiration must only touch the null, not other versions. Helper added: putVersioningSuspended. * test(s3/lifecycle): integration coverage for multi-bucket sweep A single shell-driven shard sweep must process every bucket carrying lifecycle config, not just the first one alphabetically. Pinned because the scheduler iterates the buckets directory and a regression that returns early after the first match would silently disable lifecycle for every later bucket. Two buckets, each with their own prefix-expiration rule and a backdated object. Both must be expired after the same sweep. * test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter ObjectSizeGreaterThan is a strict > gate (filterAllows uses ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the boundary: an object whose size equals the threshold must remain; only an object strictly larger expires. Catches a > vs >= flip. Two backdated objects on the same prefix, sizes 100 and 150 with threshold=100 — boundary survives, larger expires. * test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup Tests share one weed mini server. Two pollution modes were producing order-dependent failures: - A later test's shard sweep would still load the prior test's lifecycle config (the worker reads every bucket's XML from filer state, and DeleteBucket alone doesn't drop lifecycle config cleanly on this codebase). - Versioned-bucket tests left versions + delete markers behind that ListObjectsV2 can't see, so the existing best-effort empty-then- delete didn't actually empty those buckets. - The AbortMPU test intentionally leaves an in-flight upload; without an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty. Cleanup now runs DeleteBucketLifecycle, ListObjectVersions → DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload, then DeleteBucket. Best-effort throughout so a half-torn-down bucket doesn't fail the cleanup chain. * test(s3/lifecycle): backdate both versions for NoncurrentDays clock Per codex review: NoncurrentDays is clocked from the SUCCESSOR version's mtime (when the displaced version became noncurrent), not from the displaced version's own mtime. Backdating only v1 left the clock (v2's mtime) at "now" and the rule never fired — the test was wrong, not the production path. Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold relative to v2, the noncurrent rule fires, and v2 stays current. * test(s3/lifecycle): assert specific NotFound on multi-bucket deletion Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any HeadObject error as "deleted", which lets a transport failure or dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404 specifically via a small isS3NotFound helper so the assertion actually proves deletion happened, not just that the call broke. * test(s3/lifecycle): gofmt size-filter test * test(s3/lifecycle): integration coverage for Object Lock skip Object Lock retention must override the lifecycle rule. The handler's enforceObjectLockProtections check (s3api_internal_lifecycle.go:47) returns an error when retention is active; the dispatcher then classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays. No existing integration test reaches that outcome. Setup: bucket created with ObjectLockEnabledForBucket=true, expiration rule on prefix "lock/", two backdated objects under the same prefix — one with GOVERNANCE retention until 1h from now, one without. After the worker runs, the unlocked object expires (positive control); the locked one survives. Custom cleanup uses BypassGovernanceRetention so the test can drop the locked version when the test finishes — otherwise the retention window keeps the bucket from being deleted. * test(s3/lifecycle): integration coverage for config update between sweeps An operator changes the lifecycle rule between two shell-driven sweeps. The second sweep must respect the NEW rule, not a cached copy of the old one. Each runLifecycleShard invocation spawns a fresh weed shell subprocess, so cached engine state from a previous sweep doesn't persist — but a regression that caches rules across PutBucketLifecycleConfiguration calls within the S3 server itself would still surface here. Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker → firstKey expires. Update rule to prefix="second/", PUT + backdate secondKey AND a new key under the OLD prefix ("first/post-update.txt"). Sweep 2 must expire only the second-prefix object; the post-update old- prefix one must survive — config replacement, not merge. * test(s3/lifecycle): integration coverage for ExpirationDate (past) Rules with Expiration{Date: <past>} route through ScanAtDate in the engine (decideMode's ActionKindExpirationDate case) — a separate compile + dispatch branch from the EventDriven delay-group path the Days-based tests exercise. Past date + in-prefix object → must expire. Out-of-prefix object → must remain. Object also backdated as defense-in-depth so the assertion doesn't depend on whether the dispatcher consults MinTriggerAge for date kinds. * test(s3/lifecycle): integration coverage for bootstrap walk on existing objects Production scenario: operator enables lifecycle on a bucket that already holds objects from before the policy. The worker must discover them via the bootstrap walk (BucketBootstrapper) — there were no meta-log events to observe because the objects predate the rule. Without the bootstrap path, only NEW writes would ever match. Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix survivor, backdate all, THEN set the Expiration rule, run the worker. Every in-prefix pre-existing object must be expired; the out-of-prefix one must remain. * test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching Operator UX: after DeleteBucketLifecycle, the worker must observe the removal on the next sweep and stop expiring objects under the now-gone rule. A regression that caches old configs across PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep silently dropping objects. Setup: positive control (rule active, backdated obj expires) → DeleteBucketLifecycle → PUT + backdate a fresh object → second sweep. The fresh object must remain. * test(s3/lifecycle): integration coverage for empty bucket sweep no-op A bucket carrying lifecycle config but no objects must produce a successful sweep — no hangs, no errors, no dispatches. Pinned because the bootstrap walker iterates bucket directories, and an empty directory is a corner of that traversal that's easy to break (slice-bounds bug on the first listing returning zero entries). Asserts: worker logs "loaded lifecycle for" and "shards 0-15 complete", no FATAL output, bucket still exists after the sweep. * test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate ObjectLock: enabling Object Lock on a bucket implicitly enables versioning, so PUT objects land at .versions/v_<id>, not at the bare key. The test was calling backdateMtime (bare path) and failing in the helper with "filer: no entry is found". Switch to backdateVersionedMtime with the versionId returned by PutObject. ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard shell command yet — the bootstrap walker explicitly skips actions in ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date- triggered bootstrap" but no such bootstrap exists in the scheduler or shell). Skip with a t.Skip + explanation so the test activates the moment the date-triggered path lands. * fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker The walker explicitly skipped ModeScanAtDate actions on the comment "SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such bootstrap exists in the scheduler or shell layer. The result: rules with Expiration{Date: ...} compiled correctly, populated the snapshot's dateActions map, and were never dispatched. ExpirationDate is silently a no-op in production. EvaluateAction already handles ActionKindExpirationDate correctly (rejects when now.Before(rule.ExpirationDate), otherwise emits ActionDeleteObject). The walker just needed to fall through instead of skipping. Pre-date walks become no-ops via EvaluateAction's date check; post-date walks expire eligible objects. Un-skip TestLifecycleExpirationDateInThePast — it now exercises the fixed path end-to-end. * test(s3/lifecycle): integration coverage for multiple rules per bucket A single bucket carries two independent Expiration rules with disjoint prefix filters and different Days thresholds. Each rule must fire only on its prefix; objects outside both prefixes must survive. Pinned because Compile builds one CompiledAction per rule per kind all sharing the same bucket index — a bug that lets one rule's prefix or threshold leak into another (e.g. last-write-wins on a shared map) would silently expire wrong objects. Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/ Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin, data/keep.bin. After the worker runs, logs/ + tmp/ are gone; data/ — outside both rule prefixes — survives. * fix(s3/lifecycle): mark ScanAtDate actions active in Compile Two layers were silently filtering ScanAtDate actions out of routing: the walker's mode skip (fixed in `e785f59d6`) and Compile only marking ModeEventDriven actions active. MatchPath / MatchOriginalWrite both require IsActive() to emit a key, so a ScanAtDate action that's never marked active never reaches a dispatch path even after the walker falls through. ScanAtDate's only dispatch path is the bootstrap walk's MatchPath call — there's no bootstrap-completion rendezvous to wait on. Make the active flag include ModeScanAtDate alongside the EventDriven+BootstrapComplete combination. ExpirationDate-based rules now actually fire end-to-end. The TestLifecycleExpirationDateInThePast integration test exercises this. * fix(s3/lifecycle): route date kinds via ComputeDueAt ExpirationDate has MinTriggerAge=0, so router computed dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's now.Before(rule.ExpirationDate) check returned ActionNone and the date rule never fired through the event-driven path. ComputeDueAt already knows the per-kind shape — rule.ExpirationDate for date kinds, ModTime+Days for the rest — so use it as the single source of truth for dueTime in Route's main loop. * test(s3/lifecycle): pin bootstrap walker date dispatch The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6 behavior that the regular walker skipped ExpirationDate. That walker was rewired to fire date rules whose date has passed (the SCAN_AT_DATE bootstrap was never wired); update the test to match. Split into two: post-date entries dispatch, pre-date entries don't. * test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup constructs a combined noncurrent + expired-marker rule inline, which the helper doesn't cover. The blank-assignment workaround was just hiding dead code; remove both. * test(s3/lifecycle): tighten HeadObject termination check to typed not-found Generic err != nil also passes on transport/auth/timeouts, letting the test go green without proving the lifecycle action actually fired. Switch the three Eventuallyf HeadObject predicates to isS3NotFound, matching the pattern already in the multi-bucket and expiration-date tests. * test(s3/lifecycle): guard ListObjectVersions diagnostic against nil When ListObjectVersions errors, listOut is nil and the diagnostic log path panics on listOut.Versions before the real assertion fires. Branch on (listErr != nil \|\| listOut == nil) so the failure log is robust whatever ListObjectVersions returned. * refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate The metadata-only delete decision (entry.Attributes.TtlSec > 0) was inlined in lifecycleDispatch with no direct test. Lift it into a named predicate with the rationale comment moved onto the function and pin the four edge cases: nil entry, nil attributes, TtlSec=0, TtlSec>0, plus a defensive check that TtlSec<0 doesn't flip the path on.	2026-05-10 09:39:05 -07:00

1 2 3 4 5 ...

1184 Commits