1184 Commits

Author SHA1 Message Date
Chris Lu
2a4923e7e8 ObjectTransaction: filer-side forwarding via route_key (#9659)
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.
2026-05-24 14:21:06 -07:00
Chris Lu
1f0c366583 s3: route metadata-only self-copy off the distributed lock (#9638)
A non-versioned metadata-only self-copy (CopyObject with source == destination
and the REPLACE directive) is a read-modify-write of one entry, which is why it
held the distributed lock. It now routes to the owner as a serialized
PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements,
delete the dropped keys) onto a fresh read of the entry under its per-path lock,
so a concurrent change to non-managed keys (legal hold, retention, version id) is
preserved instead of clobbered, and bumps mtime.

PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended
self-copies create a new version (already routed via the copy finalize) and the
no-owner bootstrap keep the lock.
2026-05-24 12:32:57 -07:00
Chris Lu
fa7056dc6f s3: route object-lock version-specific deletes off the distributed lock (#9657)
A version-specific DELETE (real version or the null version, including
object-lock WORM-checked ones and governance-bypass) now runs as one routed
transaction on the object's owner instead of holding the distributed lock.

For a real version: recompute the .versions pointer excluding the version
(repoint-before-delete, so a crash leaves a recoverable orphan rather than a
dangling pointer), then delete the version file, under the object's per-path lock.
The null version is the regular object entry, deleted directly (no pointer).

Object-lock buckets gate the delete on the version's WORM guards evaluated on the
owner: legal hold (always) + retention (while not elapsed). Governance bypass
scopes the retention guard to COMPLIANCE mode, so the filer allows a
governance-mode delete while still denying compliance and legal hold — the
gateway never reads the version.

Three primitives make this expressible:
- ObjectTransaction.condition_key: evaluate the condition against a named entry
  (the version) while the lock stays on lock_key (the object).
- Recompute.exclude_name: omit a child from the scan, to repoint before delete.
- WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a
  mode, expressing governance bypass without a gateway-side read.
2026-05-24 11:41:08 -07:00
Chris Lu
eeda7181aa s3: route multipart-upload completion off the distributed lock (#9632)
completeMultipartUpload routes its writes to the object's owner filer when an
owner is known, off the distributed lock. Idempotent replay is handled
gateway-side in prepareMultipartCompletionState (it returns the existing result
when the object already carries this UploadId), so the lock is not needed to
dedupe retries; with no owner yet, the lock remains as the bootstrap path.

Versioned completion flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via
routedMkFile (a routed PUT) so the write serializes with concurrent writes to
the same key on the owner's per-path lock. The version file itself is a unique
path and stays a plain mkFile.
2026-05-24 11:07:39 -07:00
Chris Lu
4b9d46b5ad s3: route versioned COPY and delete-marker off the DLM (#9633)
s3: route versioned/suspended delete markers and versioned COPY off the lock

createDeleteMarker flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or
Suspended DeleteObject takes its pointer flip off the distributed lock; the
delete marker file is written first and the owner re-derives the pointer.

DeleteObjectHandler routes a versioned/suspended delete with no specific version
straight to the owner, off the lock. A specific-version delete and object-lock
buckets keep the lock (the former needs a recompute-after-delete handled
separately; the latter needs gateway-side enforcement).

CopyObject into a versioned bucket finalizes the new version through the same
routed pointer flip.
2026-05-24 07:22:27 -07:00
Chris Lu
5bac8b9281 s3: route object-lock object writes off the distributed lock (#9635)
routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT
(which creates a new version, never overwriting a locked one) and a
non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the
owner filer like any other write.

routedObjectOwner still excludes object-lock: an unversioned object-lock delete
enforces WORM under the lock, so it stays there rather than routing past the
check. Version-specific deletes likewise stay on the lock — routing them needs
the WORM check (on the version entry) and the latest-pointer recompute (on the
object) under one transaction, which the current single condition target cannot
express.
2026-05-24 07:20:44 -07:00
Chris Lu
db954b5503 s3: route versioned PutObject finalize off the DLM (#9631)
s3: route versioned PutObject finalize off the distributed lock

A versioned write's finalize (flip the .versions pointer to the newest version,
demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction
on the object's owner filer, under its per-path lock, instead of the unserialized
updateLatestVersionInDirectory. The version file is written first; the owner
re-derives the pointer by scanning the directory.

RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's
size and mtime on the pointer, and demote_key / demote_value to stamp the
displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves.

Falls back to updateLatestVersionInDirectory when no owner is known yet.
2026-05-24 03:10:30 -07:00
Chris Lu
32aa70ab59 s3: serialize bucket config writes with field-level filer patches (#9655)
PutBucketVersioning and PutBucketEncryption ran concurrently each did a
whole-entry read-modify-write of the bucket entry, so one could overwrite the
other's field with a stale copy. Each config write is now a field-level
PATCH_EXTENDED (extended attributes) or set_content (the metadata blob)
ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh
read under its per-path lock. Disjoint fields no longer clobber each other.
2026-05-24 02:30:26 -07:00
Chris Lu
f9bc6adf98 s3: route single-entry object writes to the owner filer, off the DLM (#9629)
s3: route non-versioned object PUT and DELETE off the distributed lock

A non-versioned, non-object-lock object write now goes straight to the key's
owner filer as a single-mutation ObjectTransaction, which serializes it with the
owner's per-path lock and evaluates the precondition, instead of taking a
cluster-wide lock. PUT and DELETE use the object's full path as the lock key, so
a concurrent create and delete of the same key serialize against each other.

The fast path is taken only when the precondition reduces to clauses the filer
can evaluate (existence and a single strong-ETag match); time-based conditions,
ETag lists, weak ETags, post-create hooks, and an unknown owner fall back to the
lock. A routed mutation error other than a failed precondition also falls back,
so the lock path stays the authority for the cases it alone covers.

PrimaryForKey returns "" until the ring view arrives, keeping writes on the lock
until routing is known.
2026-05-24 02:10:32 -07:00
Chris Lu
f037fc4dce s3: dial the object lock's primary filer directly (#9626)
* s3: dial the object lock's primary filer directly

The S3 object write lock builds a fresh short-lived lock per write, each
starting at the seed filer. When the seed isn't the key's hash-ring primary
the filer forwards the request to the primary, and in multi-cluster setups
that forward crosses clusters on every write.

Give the lock client a view of the filer lock ring, fed by the master's
LockRingUpdate broadcasts the gateway already receives, so it dials the
primary directly. The view tracks filer membership by version; a stale view
stays correct because the filer still forwards as a fallback.

Also send the initial ring snapshot to S3 clients, not just filers.

* s3: subscribe to lock-ring updates before starting the master loop

The master delivers the initial LockRingUpdate once, on connect. Registering the
callback after KeepConnectedToMaster started left a window where that first
update could arrive before the handler was set and be dropped, delaying the ring
view until the next membership change. Build the lock client and register the
callback in the masters block before launching the loop; the filers block reuses
that client (or creates a plain one when no masters are configured).

* lock_manager: build the hash ring in a deterministic server order

rebuildRing ranged over the server set (a map), whose iteration order is
randomized per process. On a vnode hash collision the last writer into
vnodeToServer wins, so two nodes holding the same server set could resolve the
collision to different servers and disagree on the primary for keys near that
slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement
would route the same key to different filers and defeat per-path serialization.

Iterate the servers in sorted order so the ring is identical on every node with
the same set, regardless of discovery order.

* lock_manager: skip redundant ring rebuilds, trim comments

SetRing now ignores a non-zero version at or below the current one once a ring
exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the
ring.

* s3: hold the lock-ring client on the server for route-by-key

Store the object-write lock client on S3ApiServer so handlers can resolve a
key's owner filer via PrimaryForKey.
2026-05-24 00:40:43 -07:00
Aleksey
917a87928c fix(s3api/list): cancel ListEntries stream in hasChildren (#9617)
* fix(s3api/list): cancel ListEntries stream in hasChildren

* fix(s3api): use filer_pb.List in hasChildren

filer_pb.List already wraps the ListEntries stream in a cancellable
context, so the single-entry probe needs no separate helper or manual
context plumbing to avoid the leaked gRPC stream goroutine.

* fix(s3api): propagate request context into hasChildren

Thread r.Context() through listFilerEntries and hasChildren so the
implicit-directory probe cancels when the client disconnects, instead
of running on context.Background().

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-21 15:48:47 -07:00
Chris Lu
fbdcec1cba fix(s3): list empty directories as directory markers (#9615)
* fix(s3): list empty directories as directory markers

A real but empty directory created out of band (mount, mkdir, filer API)
carries no MIME, so it was hidden from S3 listings. hadoop-aws getFileStatus
probes LIST prefix=dir/ &delimiter=/ and reads an empty result as a missing
path, which breaks Spark's eventLog.dir when it points at an empty directory.

Surface such directories as directory markers, matching directories created
via PutObject with a trailing "/". Emptiness comes from the recursion result,
and the marker MIME is set only on the in-memory listing entry, so empty
directories stay eligible for empty-folder cleanup.

* fix(s3): only surface empty directory markers for explicit dir probes

Restrict the empty-directory marker to a trailing-slash prefix probe
(prefix=dir/), the pattern hadoop-aws getFileStatus uses. Plain listings
are left as before, so an empty directory left behind by deleted objects
(e.g. after lifecycle expiration) is no longer shown as a phantom key.
2026-05-21 14:05:16 -07:00
Chris Lu
d82b3a8d6a refactor(s3): drop unused source path in copy ETag check
ETagEntry derives the tag from chunks/Md5/remote-etag, never the entry path,
so the conditional-copy check no longer builds a bogus FullPath.
2026-05-21 09:51:50 -07:00
Chris Lu
83b7ea5e7b fix(s3): keep server-side copy data in the bucket collection (#9607)
* fix(s3): keep server-side copy data in the bucket collection

UploadPartCopy and SSE-C CopyObject assigned destination volumes against
r.URL.Path, the S3 request URI. The filer derives a bucket's collection
only when the assign path sits under its buckets folder, so an S3 URI
routed copied bytes to the default collection instead of the destination
bucket's. Assign against the destination's real filer path.

* refactor(s3): centralize copy-part path and thread dstPath into SSE-C copy

Extract copyPartLocation so the fast path and writeEmptyCopyPart share one
definition of the .uploads/<id>/<n>_copy.part location. Pass the destination
filer path into copyChunksWithSSEC instead of re-deriving it from the request,
and thread it through key rotation so re-encrypt copies also assign in the
destination bucket's collection.
2026-05-21 09:35:42 -07:00
Mmx233
9b9fdb5b76 fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577)
* fix(s3): sync IAM policies to advanced IAM Manager policy engine

* test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync

* fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload

Sync the policies already loaded from the credential store into the IAM
Manager's engine from SetIAMIntegration itself, instead of re-running a
full LoadS3ApiConfigurationFromCredentialManager after setup. This covers
both startup orderings without a second filer round-trip or racing the
async loader goroutine: if the load won, the policies are in memory to
push; if SetIAMIntegration won, the load's own sync runs afterward.

Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock
so the per-request auth RLock path isn't blocked by the policy recompile.

* fix(s3): serialize IAM manager policy resync to avoid stale snapshots

SyncRuntimePolicies replaces the manager's full policy set, so applying a
policy view captured before a later mutation can resurrect a deleted
policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy,
SetIAMIntegration, and the credential-manager load) through a single
resyncIAMManagerPolicies that serializes on a dedicated mutex and reads
iam.policies fresh at apply time, so the live map always wins regardless
of interleaving. The load now installs the config into iam.policies
before resyncing, closing the window where the manager held policies the
map didn't yet have.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-21 00:39:42 -07:00
Chris Lu
c00aa90990 fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581)
Authentication records the identity with r.WithContext, which returns a
request copy. Handlers that log their own audit entry (PUT, DELETE,
tagging) see it, but GET/HEAD object and IAM operations rely on track()'s
fallback entry, which is built from the original request the auth copy
never reached - so requester came out empty.

Install a mutable identity holder on the request before authentication
and have SetIdentityNameInContext record into it. The holder is shared by
pointer across every request copy, so the fallback entry recovers the
authenticated requester. The per-request context value still takes
precedence, so nothing changes for handlers that see the auth copy.
2026-05-20 10:13:33 -07:00
Chris Lu
cc5ef1b741 feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572)
* feat(s3): add TagUser, UntagUser, ListUserTags IAM actions

Adds AWS IAM-compatible user tag operations on the embedded IAM
endpoint. Tags persist in the Identity proto as a repeated UserTag
field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits
are enforced. Pagination is stubbed (IsTruncated=false) since the
50-tag cap means all tags fit in a single response.

* review: validate UntagUser TagKeys entries

parseTagKeysParams now rejects empty keys and keys past
MaxUserTagKeyLength; UntagUser additionally requires at least one
TagKeys.member.N entry to match AWS validation behavior.

* review: pre-allocate user-tag merge and filter slices

mergeUserTags now allocates the combined existing+incoming capacity
up front; UntagUser builds the filtered slice via make with the full
ident.Tags capacity instead of ident.Tags[:0:0], which forced a
reallocation on every append.

* review: cover duplicate-in-request and invalid TagKeys cases

Regression tests assert TagUser rejects two members with the same key
in one request, and UntagUser rejects missing/empty/oversized TagKeys
entries.
2026-05-19 17:35:44 -07:00
Chris Lu
37b6a14b0d feat(s3): add four bucket configuration handlers (#9570)
* feat(s3): add four bucket configuration handlers

- GetBucketPolicyStatus: computes IsPublic from the existing bucket policy
- PutBucketRequestPayment: companion writer to the existing GET; accepts
  only BucketOwner
- GetBucketAccelerateConfiguration: returns <Status>Suspended</Status>
- GetBucketLogging: returns an empty BucketLoggingStatus

Lets AWS SDK probes succeed instead of returning MethodNotAllowed.

* review: route GetBucketPolicyStatus through checkBucket

Mirrors the existence/auth gating used by other bucket handlers and
drops the bespoke filer_pb lookup so NoSuchBucket precedence is
consistent across the API surface.

* review: cap PutBucketRequestPayment body with MaxBytesReader

The body is unmarshalled as RequestPaymentConfiguration, which is a
handful of bytes; reject excessively large payloads up front and
defer Close immediately after wrapping.

* review: gate static getters on checkBucket

GetBucketAccelerateConfiguration and GetBucketLogging now run the
standard bucket existence check before returning the static
Suspended / empty-status response so a missing bucket cannot appear
to have valid configuration.

* review: share cache helper across misc tests; check io.ReadAll error

Accelerate and Logging tests now run through newMiscTestServer like
the others so the checkBucket guard sees a cached bucket; the
ReadAll error is explicitly checked.
2026-05-19 17:35:08 -07:00
Chris Lu
cee2bf697c feat(s3): stub bucket configuration list endpoints (#9571)
* feat(s3): stub bucket configuration list endpoints

Adds Get and List handlers for Analytics, Inventory, IntelligentTiering,
and Metrics bucket configurations. List returns an empty result with
IsTruncated=false; single-get returns NoSuchConfiguration so SDK error
parsing remains predictable.

* review: gate stubs on bucket existence

All eight stub handlers now call checkBucket via stubBucketGuard so
NoSuchBucket takes precedence over NoSuchConfiguration / empty-list
responses, matching AWS S3 precedence. Tests provide a cached bucket
so the guard sees it as present.
2026-05-19 17:34:51 -07:00
Chris Lu
285025eb73 s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap

Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.

The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.

Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.

* s3api: support group inline policies + Condition enforcement

PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.

Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.

Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.

The NotImplemented unit test is inverted to drive the new round-trip.

* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy

DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.

* s3api: scope group-policy resource names per test; tighten deny polling

- Integration test resource names get a per-test suffix so retried or
  parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
  rather than any non-nil error; transient setup errors no longer end
  the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
  is nil, matching Put/Get/DeleteGroupPolicy.

* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs

CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
2026-05-19 16:03:45 -07:00
Chris Lu
f72983c1fd fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"

The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".

Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.

ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.

* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal

The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.

Extend the matcher so a request is recognized as S3 Tables when either:

  - its AWS V4 credential scope names SERVICE=s3tables; or
  - it carries the canonical AWS JSON RPC 1.1 content type and is
    unsigned (a request explicitly signed for SERVICE=s3 still wins).

The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).

Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.

* s3api: extend service matcher to all S3 Tables routes; simplify scope check

- Apply serviceMatcher to every S3 Tables route, not just the bare-path
  ones. ARN-bearing paths could otherwise be hit by an S3 object key
  that starts with arn:aws:s3tables:..., inside a bucket named
  "buckets", "namespaces", "tables", or "tag". One matcher everywhere
  closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
  credential-scope check. The scope shape is fixed at
  AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
  and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
  is exactly s3tables. Existing unit cases (including the
  access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
  the single Read could return short and make the equality check flaky.

* s3api: drop content-type fallback; sign s3 tables harness traffic instead

The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.

Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.

Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
2026-05-19 14:24:25 -07:00
Chris Lu
d57de6dc20 fix(s3): keep anonymous access working with EnableIam default (fixes #9557) (#9567)
fix(s3): keep anonymous access working with EnableIam default

`docker run seaweedfs` (and `weed mini` with no config) start with
EnableIam=true but no IAM config file and no identities. The advanced-IAM
init path was failing in 4.25 because of the missing STS signing key,
which masked a latent bug: SetIAMIntegration unconditionally flipped
isAuthEnabled to true, and isEnabled() also treated a non-nil
iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the
STS fallback started succeeding, the integration got installed end to
end, and every anonymous S3 request bounced as AccessDenied.

Separate the two concerns: SetIAMIntegration just plumbs in the OIDC /
embedded-IAM machinery, and a new EnableAuthEnforcement opts in to
enforcement. The startup path calls it only when -s3.iam.config is
actually provided, so operators with explicit IAM configs still get auth
(preserves #7726). isEnabled() now reads isAuthEnabled only.
2026-05-19 13:03:30 -07:00
Chris Lu
c61d227613 s3api: verify source permission on CopyObject and UploadPartCopy (#9555)
* s3api: verify source permission on CopyObject and UploadPartCopy

The Auth middleware only authorized the destination because routes key on
the request URL. The source from X-Amz-Copy-Source was never evaluated,
so an STS session token scoped to one prefix could copy from any other
prefix in the same bucket.

Add AuthorizeCopySource on IdentityAccessManagement to run the full
bucket-policy + IAM/identity flow against the source, using a synthetic
GetObject request so action resolution lands on s3:GetObject (or
s3:GetObjectVersion when a source versionId is supplied). Both
CopyObjectHandler and CopyObjectPartHandler now invoke it before reading
the source.

* s3api: preserve presigned-URL session token on copy-source check

Presigned CopyObject / UploadPartCopy requests carry the STS session
token in the query string (X-Amz-Security-Token), not in a header.
Rebuilding the synthetic source URL from scratch dropped that token, so
the source authorization would fall through to non-STS paths and miss
session policy enforcement. Forward X-Amz-Security-Token from the
original query (alongside versionId), still excluding unrelated params
like uploadId/partNumber that would steer ResolveS3Action away from
s3:GetObject.
2026-05-18 21:35:53 -07:00
Chris Lu
58c3fa802c fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540)
When s3.domainName is set, all bucket-prefix routes were gated on a
matching Host header. Requests that arrive via an IP, an unlisted
hostname, or a reverse proxy that rewrites Host hit no router and bounce
back as 405/404 (and 503 once a proxy maps the upstream error).

Register the path-style catch-all unconditionally, after the
host-specific routers, so it only fires when no Host matcher applies.
2026-05-18 19:44:19 -07:00
Chris Lu
6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Konstantin Lebedev
7d1b16fbcd fix: ListBucketsHandler for pathStyleDomains (#9510) 2026-05-15 13:12:55 -07:00
Chris Lu
e9bcb8f4ad docs(s3/lifecycle): refresh DESIGN.md as-built (#9491)
* docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages

DESIGN.md was written as a phased implementation plan ("Phase 2 will
ship X, Phase 4 will ship Y"). All phases are now merged, plus the
post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially
changed the worker model (single subscription, walker throttle,
observability gauges). Rewrite the doc in present tense describing
what's actually there.

Net changes vs the prior plan-style doc:
- Algorithm pseudo-code reflects the single-subscription fan-out plus
  walkedThisPass within-pass guard.
- Walker invocation table replaces the implicit "two distinct calls"
  prose with three call sites (recovery / steady-state / empty-replay)
  and their throttle gates.
- New section on the subscription model (one Reader, ShardPredicate,
  fan-out by ev.ShardID).
- New section on cursor.LastWalkedNs and the WalkerInterval throttle.
- Observability section: gauges, heartbeat tokens, what each means.
- "Implementation history" table maps phases to merged PRs.
- "Future work" lists the four optimizations we deferred (long-lived
  subscription, bucket-coordinated walker, per-bucket lag metric,
  filer meta-log retention).

Drop the "Phase N — ..." narrative from the bottom; the PR history
table is the durable artifact now.

Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for
the operator-facing docs. README explains the sync workflow with the
external seaweedfs.wiki.git repo. Five pages:

- Home.md — landing page, supported rule shapes, what the worker does
- Operator-Guide.md — config knobs, when to change each, walker
  interval recommendations by cluster size
- Monitoring.md — Prometheus metric reference + heartbeat token table
  + suggested PromQL alerts
- Troubleshooting.md — stuck cursor, walker stuck, failure outcomes,
  cursor schema for manual inspection
- Architecture.md — high-level overview for newcomers; sits between
  Home.md (operator) and DESIGN.md (developer)

* docs(s3/lifecycle): address PR review feedback on docs

Coderabbit + gemini findings on #9491:

- Monitoring.md: clarify the "matches all dispatched" phrasing; note
  that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value
  (shouldn't appear in healthy systems); filter PromQL alerts to
  ignore zero-valued gauges so fresh-install heartbeats don't trip.
- Operator-Guide.md, Troubleshooting.md: clarify weed shell -master
  format as host:http_port.grpc_port (SeaweedFS ServerAddress).
- Troubleshooting.md: pause the s3_lifecycle job in the admin UI
  before manually editing a cursor file, otherwise the worker's
  save races with the operator's edit.
- Architecture.md, Home.md, Operator-Guide.md, Monitoring.md,
  Troubleshooting.md, DESIGN.md: add language tags (`text`) to
  fenced code blocks for markdownlint MD040 compliance.
- DESIGN.md: standardize on the S3 spec rule names
  (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`,
  `AbortIncompleteMultipartUpload`) and add a one-line note mapping
  them to the engine's `ActionKind*` constants.
- README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the
  sync workflow so the `cp` commands' repo-root-relative paths work
  whether the operator's shell is at the repo root or at
  docs/wiki/s3-lifecycle/.
- Home.md: was lagging the wiki-repo merged version (had the older
  pre-merge content). Re-sync from the wiki repo so source matches.

* docs(s3/lifecycle): remove wiki pages from PR

The wiki pages belong in seaweedfs.wiki.git, not the main repo. The
source-of-truth concern that motivated adding them here is real but
the cost — every code-review touchpoint requires reviewers to load
operator-facing pages too — outweighs it. The wiki pages are already
pushed locally (~/dev/seaweedfs.wiki); they'll publish on the
operator-side workflow.

This PR remains scoped to DESIGN.md (the developer-facing reference
that does belong with the code).

* docs(s3/lifecycle): drop Implementation history section

git log is the durable record of what shipped when; the prose table
duplicates it and goes stale faster than commit metadata.

* docs(s3/lifecycle): soften 'exactly once per run' in Goal

The prior phrasing overstated the guarantee versus the failure model
documented later in the same file. Reword to: 'process due objects
each pass; retryable/blocked outcomes get retried from the cursor on
later runs.' Surfaces the head-of-line-blocking semantics up front so
the rest of the doc reads consistently.

Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those
pages live in the wiki repo, not the main repo.
2026-05-13 17:06:14 -07:00
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
c6582228b8 feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.
2026-05-13 14:09:13 -07:00
Chris Lu
79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00
Chris Lu
f5a4bfb514 fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete

deleteSpecificObjectVersion did two non-atomic filer ops: rm the version
blob, then update the .versions/ pointer. Step 2 failures were silently
logged and the client got 204 OK, so any transient blip (filer timeout,
process restart between RPCs, lock contention) left the .versions/
directory naming a missing file. Subsequent GETs paid the 10-retry
self-heal cost and returned NoSuchKey — surfacing as "Storage not found"
to Veeam, which is what triggered this investigation.

Three changes:

1. Pre-roll the pointer for the singleton / multi-version-deleting-latest
   cases. The pointer is repointed (multi) or cleared (singleton) before
   the blob rm. A failure between leaves a recoverable orphan blob —
   pointer is consistent, GETs succeed or correctly miss without
   entering the stale-pointer self-heal path.

2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion
   with bounded retries (~6.3s worst case). When retries are exhausted
   the function now returns a non-nil error instead of swallowing it.
   The caller logs at Error level and queues the path for the
   reconciler.

3. Background reconciler drains stranded .versions/ pointer-to-missing
   states off the hot path. Bounded in-memory queue with capped retries;
   read-path heal remains as a last-resort safety net.

* fix(s3/versioning): address review on #9460

Four fixes addressing review on PR #9460. All four are correctness;
no behavioural change for the happy path.

1. repointLatestBeforeDeletion: discriminate NotFound from transient
   errors when re-fetching the .versions/ entry. Previously any error
   returned rolled=true,nil — a transient filer hiccup at that point
   would cause the caller to skip the post-delete reconciliation AND
   proceed with the blob rm, producing exactly the dangling pointer
   state the PR aims to prevent. NotFound stays "vacuously consistent"
   (directory already gone); other errors surface so the caller aborts
   before removing the blob.

2. Move the singleton .versions/ teardown out of
   repointLatestBeforeDeletion (where it ran BEFORE the blob rm and
   always failed with "non-empty folder") into deleteSpecificObjectVersion
   AFTER the blob rm. Adds a wasSingleton return value so the caller
   knows when to run the teardown. Without this, every singleton-version
   delete in a versioned bucket leaked an empty .versions/ directory.

3. Wrap the list, getEntry, and mkFile calls inside
   repointLatestBeforeDeletion with retryFilerOp so the pre-roll has
   the same transient-failure resilience as the post-roll path. Without
   retries, a single transient blip causes the caller to fall back to
   the legacy non-atomic flow even when the filer recovers immediately.

4. healVersionsPointer in the reconciler: same NotFound-vs-transient
   discrimination on both the .versions/ getEntry and the latest-file
   presence probe. Previously a transient filer error would silently
   evict the candidate from the queue as "healed", leaving the real
   stranded state until a client read happened to surface it.

Also fixes the gemini-flagged consistency nit: the queued-for-reconciler
error log now uses normalizedObject instead of object so it matches the
queue entry's key.

* fix(s3/versioning): short-circuit terminal errors in retryFilerOp

Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound,
gRPC NotFound, context.Canceled, and context.DeadlineExceeded.
retryFilerOp now bails immediately on a terminal error and returns it
unwrapped, so callers like repointLatestBeforeDeletion.getEntry and
updateLatestVersionAfterDeletion.rm see the raw NotFound instead of
paying the ~6.3 s retry-budget delay AND parsing it out of an
"exhausted N retries" wrapper.

errors.Is and status.Code already walk the %w chain so today's call
sites still work, but the delay was real on the hot DELETE path
whenever a key was genuinely absent. Test added covering all five
terminal-error shapes — each must run the wrapped fn exactly once and
return in under 50 ms.
2026-05-13 10:14:27 -07:00
Chris Lu
3f1eaf9724 fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD

Successful GET/HEAD object requests never produced a fluent audit entry
because those handlers write the response directly (streaming for GET,
WriteHeader for HEAD) and never reach a PostLog call site. The wiki
advertises GET as an audited verb, so the asymmetry surprises operators
who rely on the log for read-access auditing.

Move the safety net into the track() middleware: tag each request with
an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it,
and emit a single fallback entry after the handler returns when nothing
fired. The recorder's status flows into the fallback so the audit row
still reflects 200/206 vs 404 etc. No double logging for handlers that
already emit (write helpers, error paths, bulk delete).

Refs #9463

* fix(s3/audit): defensive nil checks on audit-tracking helpers

Address PR review: guard against nil request and nil *atomic.Bool stored
under the audit-tracking key. The conditions are unreachable today (the
key is private and we only ever store new(atomic.Bool)), but the checks
are free and keep the helpers safe if a future caller misbehaves.

* test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469)

test(s3/audit): cover track() fallback wiring + cleanup

Adds two unit tests in weed/s3api/stats_test.go that exercise the
audit-tracking flag set up by track(): one verifies the fallback path
fires when a handler writes the response directly (the GET/HEAD object
regression in #9463), the other verifies the flag is set when a handler
emits PostLog itself so the fallback is skipped.

To make the wiring observable without standing up fluent, PostLog now
marks the audit flag before short-circuiting on a nil Logger; production
behavior is unchanged (no logger, no posting) but the flag stays
consistent.

Also drops two stale comments in s3api_object_handlers.go that still
referenced proxyToFiler — that helper was removed when GET/HEAD started
streaming from volume servers directly.

Stacks on #9467.
2026-05-13 09:24:59 -07:00
Chris Lu
d5372f9eb7 feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster
rate.Limiter that processMatches honors. A walker hitting a large
bucket on the recovery branch could burst LifecycleDelete RPCs past
the cluster_deletes_per_second cap that streaming-replay respects.

WalkerDispatcher now takes a *rate.Limiter and waits on it before
each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds
just like processMatches does. The handler passes the same limiter
to both paths so replay + walk share one budget; nil disables
throttling (unchanged default).

Tests pin: the limiter actually delays a dispatch when the burst
token is drained, and a ctx cancellation in Limiter.Wait surfaces
as an error without sending the RPC.
2026-05-13 09:24:50 -07:00
Chris Lu
37e505b8fd refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass

Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.

drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.

cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.

Filer load: 16x -> 1x SubscribeMetadata streams per pass.

* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)

The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.

Same regression I fixed in 6796ab6db for the per-shard subscription;
now applied at the shared level. computeGlobalStartTsNs loads every
shard's cursor and picks the minimum, falling back to the cold-start
floor only for shards with no persisted cursor.
2026-05-13 02:13:11 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
5004b4e542 feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b)

Phase 5a (PR #9465) retired the algorithm flag and made daily_replay
the only execution path. The streaming-side code (scheduler.Scheduler,
scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher,
dispatcher.FilerPersister, and their tests) has had no in-tree caller
since then. This PR deletes it.

Net change: ~4800 lines removed, ~130 added (the scheduler/configload
tests' helper file the deleted bootstrap_test.go used to host).

Removed:
  - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test,
    scheduler,scheduler_test,pipeline_fanout_test,
    refresh_default,refresh_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test,
    dispatcher_helpers_test,edge_cases_test,multi_shard_test,
    pipeline,pipeline_test,pipeline_helpers_test,toproto_test,
    dispatch_ticks_default,dispatch_ticks_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go
    (FilerPersister deleted; FilerStore tests don't need their own
    file)
  - weed/shell/command_s3_lifecycle_run_shard{,_test}.go
    (debug-only shell command that only ever wrapped the streaming
    pipeline; the production worker now exercises the same path
    every daily run)

Trimmed:
  - dispatcher/filer_persister.go down to FilerStore +
    NewFilerStoreClient — the small interface daily_replay's cursor
    persister (dailyrun.FilerCursorPersister) plugs into.

Kept (still consumed by daily_replay):
  - scheduler/configload.{go,_test.go} (LoadCompileInputs,
    AllActivePriorStates)
  - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister,
    FilerSiblingLister)
  - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient)

scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream,
dirEntry, fileEntry — helpers the configload tests used to share with
the deleted bootstrap_test.go.

Updates the handler-package doc strings and one reader-package
comment that still named the streaming pipeline.

* fix(s3/lifecycle): hold lock through tree read in test filer client

gemini caught an inconsistency in scheduler/testhelpers_test.go:
LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was
releasing the lock before reading c.tree. The map is effectively
static during tests so there's no actual race today, but matching
the convention keeps the helper safe if a future test mutates the
tree mid-run.
2026-05-12 12:54:52 -07:00
Chris Lu
2f682303fb fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464)
Follow-up to PR #9459 (merged before this fix landed). The walker
dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and
"NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay
drain (processMatches in run.go via #9462) use "RPC_ERROR" for the
same condition. Aligning so a single Prometheus query covers all
three delete paths.

Folds nil-response under RPC_ERROR rather than a separate label —
operationally it's the same class of failure (server returned no
usable response).
2026-05-12 12:38:52 -07:00
Chris Lu
495632730c feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log

Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.

  - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
    wall-clock per shard. p95 climbing toward MaxRuntime means the
    shard is brushing its budget.
  - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
    events drainShardEvents processed. Pairs with the duration so a
    spike in events-per-shard correlates with a slow shard.
  - S3LifecycleDispatchCounter (existing, reused): processMatches now
    increments this with the outcome label, so streaming and
    daily_replay paths share one outcome view. Transport errors are
    counted under outcome="TRANSPORT_ERROR".

dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.

Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.

* fix(s3/lifecycle): align dispatch error label + clean test labels

Two PR-9462 review fixes from gemini:

1. processMatches' transport-failure label was "TRANSPORT_ERROR";
   streaming's dispatcher uses "RPC_ERROR" for the same condition
   (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
   the same Prometheus query covers both delete paths.

2. The dispatch-counter assertion test now deletes its label row
   on exit so the in-process Prometheus registry doesn't accumulate
   per-test state across the suite.
2026-05-12 12:15:20 -07:00
Chris Lu
f954781169 feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459)
* feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config

Adds a Config.RetentionWindow field that runShard threads into
engine.PromotedHash. Zero (the default) falls back to maxTTL, which
matches Phase 4a behavior — PromotedHash stays empty and the
partition-flip recovery trigger stays dormant.

Pure plumbing. The handler still passes zero so nothing changes at
runtime. The walker work (Phase 4b proper) sets a real retention from
the meta-log boundary and the partition-flip trigger starts firing.

* feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker

Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient
so the same LifecycleDelete RPC drives both the meta-log replay path
and the walker. No CAS witness — the server's identityMatches treats
nil ExpectedIdentity as a bootstrap call and rebuilds the witness from
the live entry, which is the right contract for a full-tree walk.

Adds VersionID to bootstrap.Entry so versioned-bucket walks address
the right version. MPU init uses DestKey for ObjectPath (matching the
prefix-match contract); rejecting empty DestKey keeps malformed init
records out of the dispatch path.

Not wired yet — runShard still doesn't invoke the walker. Follow-up
commits add the ListFunc adapter and the recovery-branch wiring.

* feat(s3/lifecycle): wire Walker hook into runShard's recovery branch

Adds a Config.Walker callback that fires on rule-content edit /
partition flip BEFORE the cursor rewinds, so already-due objects across
the rewritten rule set get caught instead of waiting on meta-log
replay alone. The callback receives engine.RecoveryView(snap) and the
per-shard ID; nil disables it (Phase 4a behavior preserved).

Decoupling the wiring from the implementation: the handler-side
WalkerFunc that drives bootstrap.Walk via the filer is the follow-up
commit, and tests can stub the callback without standing up the full
filer/client/lister harness.

Tests pin: walker fires exactly once on hash mismatch, walker error
propagates and leaves the cursor unchanged, nil Walker is a no-op.

* feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard

Adds dailyrun.WalkBuckets — the composable driver the handler-side
WalkerFunc will call. Iterates a bucket list, wraps the supplied
bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey
for MPU init), and runs bootstrap.Walk per bucket using the supplied
Dispatcher. First bucket error wins; remaining buckets log and run to
completion so one filer flake doesn't kill the shard.

Composable rather than monolithic so callers and tests can swap parts:
production uses a filer-backed ListFunc + WalkerDispatcher; tests use
bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the
next commit.

Tests pin: shard filter routes only matching entries, MPU shard uses
DestKey not the .uploads/<id> path, single-bucket error propagates
while other buckets still run, ctx cancellation short-circuits between
buckets, nil guards on view/list/dispatch.

* feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker

Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that
streams entries under <bucketsPath>/<bucket> by paginated SeaweedList.
Recurses into regular directories; .versions/ and .uploads/ are
skipped at this stage so they don't surface as raw children — the
sibling expansion (versioned NoncurrentDays state, MPU init dispatch)
lands in the next commit.

listAll and isVersionsDir are ported from scheduler/bootstrap.go's
same-named helpers. Phase 5 deletes the scheduler copies along with
the streaming path.

Tests pin: flat listing, recursion through nested directories,
.versions/ and .uploads/ skipped, kill-resume via the start path
contract, nil-client error, attribute propagation (mtime / size /
IsLatest default).

* feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc

Adds the .versions/<key>/ expansion to the daily-run's filer-backed
ListFunc. Each call emits one bootstrap.Entry per sibling (real
version files + the bare null version, when found) with the same
sibling state the streaming bootstrap injects via reader.Event:

  - Path = logical key (not the .versions/<file> physical path), so
    bootstrap.Walk's MatchPath uses the user's intended path.
  - VersionID per sibling (version_id or "null").
  - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back
    to explicit-null-bare, falling back to newest-by-mtime.
  - NoncurrentIndex rank computed against the latest's position.
  - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the
    previous-newer sibling's mtime (legacy derivation).
  - IsDeleteMarker from ExtDeleteMarkerKey.
  - NumVersions = len(siblings).

Two-pass walk so .versions/ dirs run before regular files; the bare
null-version path is recorded in skipBare so pass 2 doesn't emit it
twice.

expandVersionsDir and lookupNullVersion are ported from
scheduler/bootstrap.go. Sort order, latest resolution, and successor
derivation must agree with that path verbatim so streaming and walker
reach the same verdict on the same objects. Phase 5 deletes the
scheduler copy.

MPU init (.uploads/<id>) remains skipped — the dedicated commit emits
it with IsMPUInit and DestKey.

Tests pin: pointer-wins latest resolution, no-pointer newest-sibling
fallback, explicit-null-is-latest with skipBare suppression of the
bare emission, coincidentally-named .versions folder recursing as a
regular subdir, delete-marker propagation.

* feat(s3/lifecycle): emit MPU init records from FilerListFunc

Last gap in the filer-backed ListFunc. A directory at .uploads/<id>
carrying ExtMultipartObjectKey is the MPU init record; emit one
bootstrap.Entry with IsMPUInit=true and DestKey set to the user's
intended path. The walker's MatchPath uses DestKey for prefix
matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's
ObjectPath. .uploads/<id> directories without the extended key are
mid-write before metadata landed and stay skipped.

isMPUInitDir is upgraded from the path-shape-only stub to the full
shape + extended-attr check that mirrors router.mpuInitInfo and
scheduler/bootstrap.go's same-named helper.

Tests pin: valid init record emits with the right DestKey, missing
ExtMultipartObjectKey skips the directory.

* feat(s3/lifecycle): wire walker into executeDailyReplay

Activates the recovery-branch walker. The handler composes the three
Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets
— into a dailyrun.WalkerFunc and passes it via Config.Walker. The
bucket list is derived from the compiled inputs so it matches the
engine snapshot exactly.

Effect on master behavior: when a worker observes a RuleSetHash or
PromotedHash mismatch on its persisted cursor (rule content edited /
partition flip), runShard now walks the live filer tree under the
RecoveryView before rewinding the cursor. Already-due objects across
the rewritten rule set fire immediately instead of waiting on the
sliding meta-log replay.

Still scoped to replay-eligible action kinds because
checkSnapshotForUnsupported continues to reject walker-bound rules
(ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and
scan_only-promoted rules at the top of Run. The follow-up commit
relaxes the gate once the steady-state walker over RulesForShard's
walk view is wired so those rules fire every day, not just on rule
edits.

* feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate

Adds the second walker invocation in runShard. After the recovery
check passes, runShard derives the walk view via snap.RulesForShard
(using the same retentionWindow PromotedHash used, so the partition
is consistent) and runs the walker over it. The view holds
walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) plus any replay-eligible rules promoted to walk by
retention shortage; an empty view skips the call so non-versioned,
replay-only deployments don't pay an O(N) bucket walk per run.

With the walker now servicing every rule kind, checkSnapshotForUnsupported
and its UnsupportedRuleError type are obsolete. router.Route gates
replay on Mode == ModeEventDriven, so walker-bound and scan_only
rules are silently dropped by replay and picked up by the walker
instead — no double-dispatch. Drop the gate, delete replayability.go
+ replayability_test.go, and remove the handler's redundant
IsUnsupportedRule branch.

* fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment

Two PR-review fixes on 9459:

1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return —
   add a defensive nil-response check so the walk halts cleanly
   instead. Spotted by coderabbit.

2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash
   "stays empty" in fallback mode, which gemini correctly pointed out
   is only true once rules are active. During bootstrap (rules
   compiled but IsActive=false) MaxEffectiveTTL is 0 while
   PromotedHash counts every non-disabled rule, so promoted becomes
   non-empty and the next post-activation run hits the recovery
   branch. That's the intended bootstrap walk — rewrite the comment
   to explain it rather than misstate the invariant.

Test: pins nil-response → error path on WalkerDispatcher.

* fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion

Reviewer caught a structural bug in expandVersionsDir's latest
resolution: when ExtLatestVersionIdKey was set but no scanned sibling
carried that id (stale pointer), the code left latestPos at the
default 0 without ever entering the no-pointer fallback. Today the
two paths yield the same value (newest sibling wins), but the
implicit fall-through makes the intent unclear and would break
silently if the no-pointer branch ever did anything more than
latestPos=0.

Track a pointerResolved flag explicitly so the no-pointer branch
(including the explicit-null-bare check) re-runs on a stale pointer.
Behavior unchanged today.

Test pins: stale pointer + two real versions falls back to
newest-sibling (vnew, not vold).

* feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher

Mirrors the Phase 6 instrumentation already on the replay side
(processMatches) onto the walker's Delete dispatch. Every walker
dispatch now bumps S3LifecycleDispatchCounter with the resolved
outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so
streaming, daily_replay's replay drain, and daily_replay's walker
share a single per-(bucket, kind, outcome) counter view.

Lands together with the rest of Phase 4b — no new metric, just an
extra observation site for the existing one.
2026-05-12 11:39:15 -07:00
Chris Lu
644664bbee feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457)
* feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a)

Replace the local replay-content-hash / max-effective-TTL helpers in
dailyrun with the engine package's canonical versions (ReplayContentHash,
MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface.

Adds PromotedHash to the cursor's recovery triggers: a partition flip
(rule moving between replay and walk because retention shifted) now
fires the rule-change branch alongside RuleSetHash mismatch. The
retentionWindow is set to MaxEffectiveTTL today, which keeps the
promoted set empty and the trigger dormant; Phase 4b will plumb the
real meta-log retention boundary so true scan_only promotions are
detected.

Cursor schema is unchanged — PromotedHash was already persisted as
the zero hash in Phase 2.

* docs(s3/lifecycle): note the one-time cursor rewind on hash format change

gemini-code-assist flagged that swapping localReplayContentHash for
engine.ReplayContentHash changes the persisted RuleSetHash byte layout
(sort order + tagged-field encoding). Phase-2 cursors mismatch on first
post-upgrade run and drop into the rule-change branch.

Going with option 3 (document the intentional one-time rewind). The
rewind is bounded to runNow - maxTTL (not time-zero), self-healing on
the next save, and daily_replay is off by default so the affected
population is limited to early adopters of the algorithm flag. A
migration shim or a hash-compat layer would carry the legacy encoder
forever for one bounded re-scan; not worth it.

Comment in runShard makes the trade explicit so a future reader doesn't
hunt for the "why does my cursor rewind once after upgrade" mystery.

* chore(s3/lifecycle): trim verbose comments in dailyrun

Cut multi-paragraph headers and narration that just described what the
code does. Kept the small WHY notes (per-match skip vs per-rule, the
one-time post-upgrade cursor rewind, scan_only rejection rationale).
Same behavior, ~150 fewer lines of comment.

* fix(s3/lifecycle): persist PromotedHash on the successful runShard save

The comment-trim pass dropped the field alongside a "stays empty in
Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b
turns promoted into a real value — and a save that writes zero would
make the next run falsely detect drift and rewind. Spotted by
gemini-code-assist on PR 9457.

Other save paths (recovery, drain-error) already persisted it; the
success path is the only one that was missing it. Now consistent.
2026-05-11 21:18:19 -07:00
Chris Lu
884b0bcbfd feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)

Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.

Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.

Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
  non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
  the input ClusterContext and injects two metadata keys for
  s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
  decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
  config; other job types pass through unchanged.

Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
  on (admin-config field names, metadata keys). Plain strings on the
  admin side keep weed/admin/plugin free of a dependency on the
  s3_lifecycle worker package; the two sets of constants are pinned
  to identical values and a mismatch would silently disable rate
  limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
  builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
  Missing/empty/non-positive values → no limiter (legacy unlimited
  behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
  bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
  (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).

Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
  long each Limiter.Wait blocks before a LifecycleDelete RPC.
  Operators tune the cap by reading p95 — near-zero means the cap
  isn't binding, a long tail at 1/rate means it is.

Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
  pass-through for non-allocator job types, rps=0 / no-executors
  skip, even sharing, burst sharing, burst=0 omit (worker default
  kicks in), burst floor of 1, no mutation of input metadata, nil
  input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
  covering nil/empty/missing metadata, non-positive/invalid rate,
  positive rate builds correctly, burst missing defaults to 2× rate,
  tiny rate clamps burst to ≥1.

Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.

* fix(s3/lifecycle): divide cluster budget by active workers, not all capable

gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.

Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.

Tests swap the SharedEvenly case for two pinned scenarios:
  - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
  - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
  - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4

* feat(s3/lifecycle): drop Worker Count knob from admin config form

The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.

Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).

handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
2026-05-11 19:17:06 -07:00
Chris Lu
3f4cb6d2fb feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine)

Adds the engine-side API the new daily-replay worker reaches for:
per-view snapshot construction (RulesForShard, RecoveryView), the two
cursor hashes that gate recovery (ReplayContentHash, PromotedHash),
and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot
is a stub keyed on a package-level atomic that the worker startup wiring
populates.

Views return new *Snapshot instances holding cloned *CompiledAction
values so per-clone active/Mode never leak across partitions. Replay
clones force Mode=ModeEventDriven to rehabilitate any persistent
ModeScanOnly carried over from PriorState; walk and recovery clones
preserve Mode as-is. Disabled actions are excluded from all views.

No production caller is wired here — Phase 4's walker/dailyrun
integration is the follow-up. dailyrun's local helpers
(localReplayContentHash, localMaxEffectiveTTL) become one-line
redirects to these exports.

API surface:
- CurrentSnapshot() *Snapshot — stub until Phase 4 wiring.
- SetCurrentEngine(*Engine) — Phase 4 wiring entry point.
- Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot)
- RecoveryView(s *Snapshot) *Snapshot — force-active over the full set.
- ReplayContentHash(s *Snapshot) [32]byte — partition-independent.
- PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip.
- MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only.

30 unit tests covering clone isolation, Mode rewrite, partition
membership including the multi-action-kind XML rule split,
RecoveryView activating pre-BootstrapComplete actions,
ReplayContentHash partition-independence, PromotedHash sensitivity to
promotion in either direction, MaxEffectiveTTL aggregation. Build +
race-tests green.

* refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics

Addresses PR #9447 review feedback. Three medium-priority items from
gemini, all code-quality refinements (no behavior change):

1. Duplicated sort comparator between ReplayContentHash and
   PromotedHash. Extract sortHashItems shared helper so the two
   hashes use the same ordering by construction — if one drifted, the
   cursor could see a spurious "rule changed" on a no-op snapshot
   rebuild.

2. Duplicated writeField/writeInt closures. Extract hashWriter struct
   holding the sha256 running hash + lenbuf, with method helpers.
   Same allocation profile (one Hash, one tiny stack buffer per
   helper); just deduplicates ~20 lines.

3. shardID parameter on RulesForShard is unused. Per the design's
   open question, every shard sees every rule today (shard filter
   runs at the entry-iteration site, not view construction). Keep
   the parameter for API stability — removing it now would force
   a breaking change when bucket-shard ownership lands — and update
   the doc comment to explain why it's reserved.

go build ./... clean; engine test suite green.
2026-05-11 18:07:54 -07:00
Chris Lu
122ca7c020 feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446)
* docs(s3lifecycle): design for daily-replay worker

Captures the algorithm and dev plan iterated on in PR #9431 and the
discussion leading up to it: per-shard daily meta-log replay, walker
as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent
plus a recovery branch over engine.RecoveryView(snap), explicit
retention-window input to RulesForShard, two cursor hashes
(ReplayContentHash + PromotedHash) that together detect every
invalidation case. Implementation phases are sequenced so each can
ship independently — Phase 1 (noncurrent_since stamp) just landed.

* feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2)

New weed/s3api/s3lifecycle/dailyrun package implementing the bounded
daily meta-log scan from the design doc. One pass per Execute per
shard: load cursor, scan events forward, route each through router.Route,
dispatch any due Match, advance the cursor on success. Halt-on-failure
keeps the cursor at the last fully-processed event so tomorrow resumes
from the same point — head-of-line blocking is the deliberate failure
signal.

Replay-only in this phase. Phase 4 wires the walker for ExpirationDate,
ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules.
Until then a typed UnsupportedRuleError refuses runs on those buckets:
operators see the rejection in the activity log rather than silently
losing rules.

Behavior:
- Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted
  under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in
  Phase 2; Phase 4 turns it on.
- Rule-change branch rewinds cursor to now - max_ttl when the
  replay-content hash mismatches. Cold start uses the same floor.
- Transport errors retry 3x with exponential backoff capped at 5s;
  server outcomes (RETRY_LATER / BLOCKED) halt the run without retry.
- Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules
  exist, only the hash gates a future addition.

Worker shape:
- New admin config field "algorithm" with enum streaming|daily_replay,
  default streaming. Existing deployments are unaffected.
- handler.Execute branches on the flag: streaming routes through the
  current scheduler.Scheduler, daily_replay routes through
  dailyrun.Run.
- dispatcher.NewFilerSiblingLister exported so both paths share the
  same .versions/ + null-bare lookup.

Engine integration:
- Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase
  4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace
  them with one-line redirects; the local versions hash the same
  fields so the cursor stays valid across the swap.

Tests cover cursor persistence, unsupported-rule rejection,
hash stability under rule reordering, hash sensitivity to TTL edits,
max-TTL aggregation, dispatch retry budget, and request shape
including the identity-CAS witness.

Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers
and future phases share the same spec.

* feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob

The streaming dispatcher hasn't shipped to users yet, so there's no
backward-compat surface to preserve. Flip the algorithm default from
streaming to daily_replay so the new path is the standard from day
one. Streaming stays as an explicit opt-in escape hatch during the
Phase 4 walker rollout; Phase 5 deletes both the flag and the
streaming code.

Buckets whose lifecycle rules require walker-bound dispatch
(ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only)
will fail the daily_replay run with the existing
UnsupportedRuleError until Phase 4 walker integration ships. Operators
hitting that case can set algorithm=streaming until the follow-up
lands.

Updates the test for the default value and renames the
unknown-value-fallback case to reflect the new default.

* fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches

The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}.
That's only safe when each event produces at most one match per
ActionKey with a single deterministic due-time formula —
ExpirationDays and AbortMPU fit that shape because due_time
= ev.TsNs + r.days is monotonic in event TsNs.

But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed
in Phase 2 since it compiles to ActionKindNoncurrentDays) routes
through routePointerTransitionExpand, which emits matches for every
noncurrent sibling — each with its own SuccessorModTime taken from
the demoting event for that specific sibling. A single event can
therefore produce two matches for the same ActionKey on different
objects with wildly different DueTimes.

With the old code, a not-yet-due sibling encountered first would set
done[ActionKey] = true and then the next sibling — even though its
DueTime had already passed — would be skipped. Future events for the
same rule would also be suppressed for the rest of the run. Objects
that should have been deleted weren't.

Fix: drop the early-stop optimization. Process every match
independently. A future-DueTime match is now silently skipped without
affecting any later match. The performance hit is small (Phase 2 is a
single bounded daily pass, and the rate limiter is the real
throughput governor); the correctness gain is non-negotiable.

Also fixes the inverted comment in processMatches that described the
old check as "due_time is past now" when it actually checked
DueTime.After(now) (i.e., NOT yet due).

Adds four targeted tests:
- not-yet-due match first in slice does not suppress two later
  due matches for the same rule;
- reversed slice ordering produces identical dispatch;
- BLOCKED outcome halts the loop before later due matches are sent;
- empty match slice is a no-op.

Phase 4's walker-and-recovery integration can revisit a
per-(rule, object) memoization if profiling argues for it.

* fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency

Addresses PR #9446 review feedback. Eight distinct fixes:

1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the
   persisted cursor to lastOK = TsNs of the last event processed,
   including events whose matches were skipped as not-yet-due. Those
   skipped matches would never be re-scanned, so objects under
   long-TTL rules would never expire.

   Track a "stuck" flag in drainShardEvents: the first event with a
   skipped (future-DueTime) match stops cursorAdvanceTo from rising,
   but the loop keeps processing later events to dispatch any that ARE
   due. The persisted cursor sits at the last fully-processed event so
   tomorrow's run re-scans from the skipped event onward and the
   future-due matches get re-evaluated when they age in.

   processMatches now returns (skippedAny, halted, err) so the drain
   loop can tell apart "event fully drained" from "event had pending
   future-due matches."

2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the
   ActionKind. A replay-eligible kind with Mode != ModeEventDriven
   (e.g. ModeScanOnly via retention promotion) passed the check but
   then got silently ignored by router.Route, which gates dispatch
   on Mode == ModeEventDriven. Reject loudly with the typed error
   so admin sees the rejection in the activity log.

3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard
   goroutines regardless of cfg.Workers. Add a Workers field to
   dailyrun.Config and gate the goroutine fan-out on a semaphore of
   that size; the handler now passes cfg.Workers through.

4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one
   snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per
   shard. Mid-run Compile would let shards process different rule
   sets. Capture snap at the top of Run, pass it down to every shard.

5. FROZEN runNow (coderabbit). drainShardEvents and processMatches
   accepted a `now func() time.Time` and called it multiple times.
   DueTime comparisons would slip as the run wore on. Capture runNow
   once at the top of Run and thread it through as a time.Time value.

6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case
   broke out of the loop and returned nil, marking interrupted runs as
   successful. Return ctx.Err() instead so the caller propagates the
   interrupt; cursorAdvanceTo carries whatever progress was made.

7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently
   accepted empty files, mismatched shard_ids, and hash slices shorter
   than 32 bytes (copy() would zero-pad). Each now returns a typed
   error so the run halts and an operator investigates rather than
   silently re-scanning from time zero or persisting a zero-padded
   hash that masks corruption forever.

8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted"
   guard in runShard was unreachable because drainShardEvents
   initialized lastOK := startTsNs and only ever raised it. Removed
   along with the new cursor-advancement semantics that handle the
   "no events processed" case implicitly.

Plus markdown lint: DESIGN.md fenced code blocks now carry a `text`
language identifier to satisfy MD040.

Skipped from the review:
- gemini's "maxTTL == 0 incorrectly skips immediate expirations":
  actions with Days <= 0 don't compile to a CompiledAction (see
  weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new
  empty-replay sentinel uses `rsh == [32]byte{}` for clarity per
  gemini's suggested form, but the behavior is equivalent.

Tests added/updated:
- TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when
  all matches are past their DueTime.
- TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins
  the new Mode check.
- TestFilerCursorPersister_EmptyFileReturnsError,
  _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin
  the new validation rules.
- Existing process-matches tests reshaped for the
  (skippedAny, halted, err) return tuple.

Full build clean. Dailyrun + worker test packages green.
2026-05-11 18:07:17 -07:00
Chris Lu
46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00
Chris Lu
9a70bbfcc6 feat(s3api): full-chunk gzip pass-through skips volume-side decompress (#9427)
Building on the io.Pipe streaming chunk copy: when a copy operation
covers an entire source chunk (the common case for Harbor's
part-size = chunk-size assemble pattern), ask the source volume for
compressed bytes via Accept-Encoding: gzip and forward them to the
destination as-is.

This trades a Range fetch (where the volume decompresses the chunk
internally to satisfy the byte range) for a full-chunk fetch that
returns whatever wire bytes the chunk is stored as. For gzipped
chunks the source volume avoids the decompression entirely; we never
allocate a chunk-sized decompress buffer.

Implementation: build the source GET directly instead of going
through ReadUrlAsStream, because that helper auto-decompresses gzip
responses (which would defeat the point). Trust the response's
Content-Encoding header over caller hints — for partial ranges the
volume always returns raw bytes regardless of how the chunk is
stored, so labeling those as gzip would corrupt subsequent reads.

End-to-end repro impact (512 MiB src, 6 parallel UploadPartCopy):
  + #9420/#9421/#9422       : 2236 MiB
  + io.Pipe streaming       : 1521 MiB
  + this commit             : 1149 MiB  (round 2 RSS, perfectly flat)

Round 3 now completes (was hitting volume-full before, since
chunks took up uncompressed space on disk; we now store the gzipped
chunks the volume gives us, which fit in the test's 8 GiB volume
budget).

Heap inuse_space (after force GC):
  before all: ~1.5 GiB
  this PR:    266 MiB

Volume-side bytes.Buffer.ReadFrom inuse:
  before:     611 MiB
  streaming:  571 MiB
  this PR:    297 MiB (now in destination-volume parseUpload's
                       size-hint decompression — separate
                       optimization opportunity for a hint header)
2026-05-10 14:55:59 -07:00
Chris Lu
4a04594826 feat(s3api): stream chunk copy via io.Pipe to cut peak working set (#9424)
* fix: cap pool retention so chunk-copy buffers don't hoard memory

Two pool-retention sites kept the runaway-RSS pattern in #6541 visible
even after #9420 and #9421:

* weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into
  sync.Pool regardless of how big it had grown. After a 64 MiB chunk
  upload through volume.PostHandler -> needle.ParseUpload, the pool
  hoarded a 64 MiB byte array per cached entry for the rest of the
  process's lifetime. Cap retention at 4 MiB; oversized buffers are
  dropped so GC can reclaim the backing array.

* weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer
  unset, so operation.upload_content fell back to the package-global
  valyala/bytebufferpool. That pool also retains high-water buffers
  forever, and concurrent UploadPartCopy filled it with one chunk-sized
  buffer per concurrent upload. Provide a fresh per-call bytes.Buffer
  pre-sized to chunk + multipart framing; it's GC'd as soon as the
  upload returns.

Tests:
- weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized
  buffers don't round-trip), the inverse (right-sized buffers do), and
  nil-safety.
- weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption
  and pin that BytesBuffer is always non-nil and pre-sized, and that
  each call gets a distinct buffer.

* feat(s3api): stream chunk copy via io.Pipe to cut peak working set

Final piece for #6541. The buffered chunk-copy path holds two
chunk-sized buffers per copy in flight (download buffer + multipart-
encoded upload buffer). Under concurrent UploadPartCopy that put a
floor on RSS at concurrency × 2 × chunk_size — about 768 MiB for the
6-way / 64 MiB Harbor-style assemble repro, even after the previous
pool/retention fixes.

Replace the buffered path with an io.Pipe between the source GET and
the destination POST: ReadUrlAsStream pumps data into the pipe via a
multipart.Writer, the http.Client reads from the pipe end and POSTs
the body. In-flight per copy is now ~32 KiB (pipe hand-off + http
buffers), regardless of chunk size.

The streaming path is gated by canStreamCopyChunk: only used when no
in-transit transformation is needed (no per-chunk CipherKey, no SSE).
SSE-C / SSE-KMS / SSE-S3 paths still go through the buffered path,
which already handles re-encryption correctly.

Benchmarks (Apple M4, httptest source/dest, B/op = bytes per copy):

  Buffered  1 MiB:   6.0 MB B/op,  443 MB/s
  Streamed  1 MiB:   374 KB B/op,  727 MB/s
  Buffered  8 MiB:    56 MB B/op,  559 MB/s
  Streamed  8 MiB:   379 KB B/op, 1138 MB/s
  Buffered 64 MiB:   455 MB B/op,  718 MB/s
  Streamed 64 MiB:   304 KB B/op, 1387 MB/s

End-to-end repro (512 MiB src, 6 parallel UploadPartCopy):
  pre-#9420 RSS round 2: 3134 MiB
  + #9420/#9421/#9422  : 2236 MiB
  + this PR            : 1521 MiB
  heap inuse_space     :  350 MiB (was 1422 / 1187 MiB)
  HeapSys (MemStats)   : 1.74 GiB (was 2.49 GiB)

* review: surface shouldRetry, add int32 guard, drop redundant drains

Address review on PR 9424:

* coderabbit (HIGH, line 122): ReadUrlAsStream can set shouldRetry=true
  with readErr=nil. Before this fix, that fell through to mw.Close()
  and the destination POST succeeded against a possibly-truncated
  multipart body. Mirror downloadChunkData's explicit check and
  surface shouldRetry as a producer error so the dst POST aborts.
* gemini (line 98): chunk size is int64 but ReadUrlAsStream takes int.
  Reject sizes above MaxInt32 up front so the int(size) cast can't
  truncate negative on 32-bit platforms — same guard downloadChunkData
  uses.
* gemini (line 151): util_http.CloseResponse already drains the body
  (io.Copy(io.Discard, ...) inside the helper) before closing, so the
  manual io.Copy drains we added are redundant. Drop them.

* review: cancel source GET when destination POST fails

Address coderabbit review (line 165 / second pass on PR 9424): when
the POST leg fails or returns an error status, closing pipeReader
only fails the producer's *writes*. ReadUrlAsStream's own read loop
runs under the parent ctx, so it keeps draining the source body in
the background until EOF — wasting source-volume bandwidth and CPU
on a copy that's already failed.

Wrap streamCopyChunkRange in a child context cancelled on return.
ReadUrlAsStream checks ctx.Done() per 256 KiB tick, so the in-flight
read aborts on the next iteration once the function returns. The POST
also moves to streamCtx so the in-flight request can be cancelled the
same way if the producer fails first.

Defer-cancel runs after both legs return, so the success path still
sends EOF cleanly through pipeWriter.Close before cancellation.
2026-05-10 14:29:39 -07:00
Chris Lu
d8bbc1d855 fix: cap pool retention so chunk-copy buffers don't hoard memory (#9422)
Two pool-retention sites kept the runaway-RSS pattern in #6541 visible
even after #9420 and #9421:

* weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into
  sync.Pool regardless of how big it had grown. After a 64 MiB chunk
  upload through volume.PostHandler -> needle.ParseUpload, the pool
  hoarded a 64 MiB byte array per cached entry for the rest of the
  process's lifetime. Cap retention at 4 MiB; oversized buffers are
  dropped so GC can reclaim the backing array.

* weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer
  unset, so operation.upload_content fell back to the package-global
  valyala/bytebufferpool. That pool also retains high-water buffers
  forever, and concurrent UploadPartCopy filled it with one chunk-sized
  buffer per concurrent upload. Provide a fresh per-call bytes.Buffer
  pre-sized to chunk + multipart framing; it's GC'd as soon as the
  upload returns.

Tests:
- weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized
  buffers don't round-trip), the inverse (right-sized buffers do), and
  nil-safety.
- weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption
  and pin that BytesBuffer is always non-nil and pre-sized, and that
  each call gets a distinct buffer.
2026-05-10 13:34:25 -07:00
Chris Lu
926a8e9351 fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup (#9420)
* fix(s3api): cap copy-chunk receive buffer to avoid append-grow blowup

downloadChunkData accumulated the streamed chunk into a nil []byte via
`chunkData = append(chunkData, data...)`. ReadUrlAsStream pumps in 256 KiB
ticks, so a 64 MiB chunk grew the slice geometrically (256K → 512K →
1M → ... → 64M), allocating ~2x the chunk size for every transferred
byte. Combined with the 4-way per-request concurrency and any number of
in-flight UploadPartCopy calls (Harbor multipart assemble), this is what
produces the runaway-RSS pattern reported in #6541.

Pre-size the receive buffer to the known sizeInt so the callback fills
in place. Add a regression test that downloads a 16 MiB chunk through
httptest and asserts TotalAlloc stays under 1.5x the chunk size — the
pre-fix code allocates ~5x and trips the bound.

Local repro (weed 4.23, 6 parallel UploadPartCopy on a 512 MiB source):

  before:  baseline 96 MiB → peak 3124 MiB, never reclaimed
  pprof:   650 MiB inuse in bytes.growSlice + 461 MiB in
           downloadChunkData.func1

* test(s3api): assert downloaded chunk content matches payload

Address PR review feedback: the allocation-bound check alone would still
pass if a future regression silently truncated or corrupted the chunk.
Compare the returned bytes against the source payload (after the
TotalAlloc measurement window so bytes.Equal doesn't pollute it).
2026-05-10 12:08:06 -07:00
Chris Lu
82648cca53 test(s3/lifecycle/engine): pin delay-group dedup across buckets (#9418)
Compile a 100-bucket × 5-rule snapshot where the five Days values
include duplicates (1, 1, 7, 7, 30) and assert:

- snap.actions has 500 entries — every (bucket, rule) compiles to its
  own ActionKey, no collapse.
- snap.originalDelayGroups has exactly 3 entries — the routing index
  is keyed by Delay, so same-day rules across all buckets share a
  group. This is the property that lets the dispatcher index by
  delay group rather than per-rule.
- Per-group key count = (rules with that day) × buckets, so every
  action is reachable from its group entry.
2026-05-10 10:36:54 -07:00
Chris Lu
1b1d4aa814 refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate (#9417)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.

* refactor(s3/lifecycle): extract entryUsesMetadataOnlyDelete predicate

The metadata-only delete decision (entry.Attributes.TtlSec > 0) was
inlined in lifecycleDispatch with no direct test. Lift it into a
named predicate with the rationale comment moved onto the function
and pin the four edge cases: nil entry, nil attributes, TtlSec=0,
TtlSec>0, plus a defensive check that TtlSec<0 doesn't flip the
path on.
2026-05-10 09:39:05 -07:00