Commit Graph

1190 Commits

Author SHA1 Message Date
Chris Lu
16717b0bf4 fix(s3): authenticate JWT unsigned-streaming uploads (#9729)
A bearer-token client whose SDK appends a CRC32 trailer sends an
unsigned-streaming PUT (STREAMING-UNSIGNED-PAYLOAD-TRAILER) with no SigV4
signature, so getRequestAuthType classifies it as authTypeStreamingUnsigned.
The auth dispatch ignored the bearer token and fell back to anonymous, and
newChunkedReader tried to verify the bearer token as a SigV4 seed signature
and failed, so the body could not be decoded either.

Dispatch the streaming-unsigned auth on whatever credential is present
(SigV4 / JWT / anonymous), and skip the SigV4 seed-signature recompute for
JWT requests in the chunked reader.
2026-05-28 18:10:24 -07:00
Chris Lu
685571d93f fix(s3): allow anonymous unsigned-streaming PutObject (#9727)
Modern botocore attaches a CRC32 trailer to plain PutObject, turning the
payload into STREAMING-UNSIGNED-PAYLOAD-TRAILER. An anonymous upload then
carries that header but no Authorization, so it was classified as
authTypeStreamingUnsigned and sent straight to SigV4 verification, which
rejected it as AccessDenied while explicit credentials kept working.

Fall back to the anonymous identity when an unsigned-streaming request
carries no signature, mirroring the plain anonymous path. The request
stays classified as unsigned-streaming so the chunked body is still
decoded.
2026-05-28 17:00:41 -07:00
qzhello
5b1098e2ad fix(s3): honor MetadataDirective=REPLACE for system metadata on CopyObject (#9721)
* fix(s3): honor MetadataDirective=REPLACE for system metadata on CopyObject

* fix(s3): match copy metadata keys case-insensitively for legacy data

Legacy / non-S3 write paths (FUSE mount, direct filer HTTP API, older
versions) may persist Cache-Control etc. in lowercase form. Make
isManagedCopyMetadataKey case-insensitive so mergeCopyMetadata still
clears stale source values under REPLACE, and let the COPY branch of
processMetadataBytes fall back to a lowercase key on the source so
legacy values survive into the destination (re-emitted as canonical).

Mirrors the existing x-amz-meta-* backward-compat path.

* fix(s3): keep legacy non-canonical tag and system metadata across COPY

The previous case-insensitive isManagedCopyMetadataKey caused
mergeCopyMetadata to delete legacy lowercase x-amz-tagging-* and
mixed-case system headers, but the COPY branch in processMetadataBytes
only matched canonical or strict-lowercase keys when re-populating
them, so any non-canonical key was permanently dropped on COPY.

- COPY now scans existing in a single pass and uses strings.EqualFold
  against the system header whitelist, re-emitting under the canonical
  header name. Handles any case folding (CACHE-CONTROL, Cache-control,
  etc.), not just strings.ToLower.
- COPY tagging branch now uses hasPrefixFold(k, AmzObjectTagging) and
  re-emits the canonical X-Amz-Tagging-<suffix>, mirroring the existing
  X-Amz-Meta-* migration path.
- Tests cover lowercase/uppercase/mixed-case system headers and tags
  surviving COPY.

* fix(s3): make COPY of system metadata and tags deterministic across case variants

Single-pass EqualFold matching let Go's randomized map iteration pick
either the canonical or a legacy-cased value when both lived on the
source, so the COPY result varied between calls.

Both COPY branches now use two passes: a canonical-exact lookup first,
then a case-insensitive fallback that only writes when the canonical
slot is still empty. Mirrors the collision-check pattern used by the
X-Amz-Meta-* migration path.

Tests run the canonical-vs-legacy collision 32 times each to exercise
varied map orders.

* fix(s3): apply REPLACE Content-Type on in-place copy

The metadata-only self-copy path never set Attributes.Mime, so a same-key
CopyObject with REPLACE and a new Content-Type silently kept the old type.
Route in place only when the Mime is unchanged; otherwise take the locked
clone path (still metadata-only, reuses source chunks) and set the new Mime
there. Also covers the versioned self-copy path.

* perf(s3): drop per-key ToLower in isManagedCopyMetadataKey

Use the allocation-free hasPrefixFold helper instead of lowercasing the key
and both constant prefixes on every metadata-key check.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-28 12:55:08 -07:00
7y-9
bbbc3925ec fix: validate s3 ownership controls rule (#9684) 2026-05-27 14:41:10 -07:00
qzhello
69c84801e4 fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names (#9703)
* fix(s3tables/iceberg): make metadata spec-compliant and accept real-world manifest names

Two related issues prevent SeaweedFS S3 Tables from interoperating with
strict Iceberg clients (Java/Spark/Flink/Trino):

1. iceberg-go v0.5.0 serializes empty TableMetadata state by dropping
   keys via `omitempty` on optional pointer/slice fields. The Iceberg
   table spec, however, requires `current-snapshot-id`, `snapshots`,
   `snapshot-log`, `metadata-log`, and `refs` to be present even when
   empty (`current-snapshot-id` must be -1 for a table with no
   snapshots). Java's TableMetadataParser uses JsonUtil.getLong on
   `current-snapshot-id` and throws "Cannot parse missing long
   current-snapshot-id" against responses produced by this server.

2. The Iceberg layout validator only accepts manifest filenames that
   match Iceberg's internal naming (`{uuid}-m{n}.avro`,
   `snap-{n}-{n}-{uuid}.avro`). Real writers — notably Flink's sink —
   emit manifests like
   `{flink-job-id}-{checkpoint}-{operator-id}-{n}.avro`, which the
   validator rejects with 403, breaking INSERT commits.

Fixes:

* Add ensureMetadataSpecCompliance helper that backfills the five
  spec-required empty-state fields when iceberg-go omits them or emits
  explicit JSON null. Apply it on every code path that writes
  v*.metadata.json to S3 or returns metadata to clients
  (handlers_table create-table, handlers_commit, commit_helpers
  create-on-commit, plus MarshalJSON on LoadTableResult and
  CommitTableResponse). Real values from non-empty tables are never
  overwritten.

* Add catch-all regex entries to metadataFilePatterns accepting any
  *.avro / *.metadata.json filename composed of [A-Za-z0-9._-]. The
  Iceberg spec does not mandate filename format; the strict patterns
  remain for documentation. Metadata-directory subdirectory rejection
  and the data-file path validation are unchanged.

No upstream dependencies are forked: iceberg-go stays at v0.5.0 and
go.mod is untouched. The compliance layer can be removed once upstream
emits spec-compliant output.

Tests (all pass under `go test -race`):
- metadata_compliance_test.go: 5 cases covering missing fields,
  preserved real values, explicit null, invalid JSON, empty input.
- iceberg_layout_test.go: 3 groups (16 subtests) covering real-world
  manifest names from Flink/Spark/Iceberg, security boundary
  (subdirectories, bad extensions), and data-file regression.

* fix(s3tables/iceberg): preserve metadata key order and keep config field stable

Two small follow-ups on the spec-compliance fix:

* ensureMetadataSpecCompliance now splices missing keys in at the byte
  level just before the closing brace, so iceberg-go's struct-declared
  key order survives the backfill. The previous unmarshal/remarshal
  through map[string]json.RawMessage silently alphabetized every key in
  the document, which is spec-legal but breaks byte-equality fixtures
  and any downstream hashing of the persisted metadata. The slower
  remarshal path is kept for the rare explicit-null replacement case.

* LoadTableResult.MarshalJSON now serializes Config without omitempty,
  matching the struct field tag. The custom marshaler had silently
  flipped the tag to ,omitempty, which made the "config" key disappear
  from the response whenever s3Endpoint was unset (since
  buildFileIOConfig returned an empty but non-nil Properties map).

Tests:
- PreservesOriginalKeyOrder pins the byte-level output against
  iceberg-go's emitted shape; would have caught the alphabetization
  regression.
- EmptyObjectBackfilled covers the {} -> sentinels-only case (no
  leading comma).
- AllPresentReturnsSameBytes confirms the no-op path returns input
  bytes unchanged, with whitespace intact.
- iceberg_layout_test pins the catch-all $ anchor: metadata/file.avro.txt
  must still be rejected.

* fix(s3tables/iceberg): guard ensureMetadataSpecCompliance against top-level null

json.Unmarshal of a JSON `null` literal succeeds but leaves the map nil.
The current byte-append path no-ops gracefully on this input, but the
slow remarshal path would panic with "assignment to entry in nil map"
if the input ever combined `null` with the explicit-null detection. Add
an explicit nil-map short-circuit so the safety property is obvious
from the source, and a test that pins the contract.

* test(s3tables/iceberg): assert full byte equality in AllPresentReturnsSameBytes

The prefix check only caught a missing "{\n  " opener, so the test
would have passed even if the function silently reordered keys or
collapsed whitespace later in the document. Switch to a full string
comparison so any future regression in the no-op path is loud.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-27 13:05:41 -07:00
Chris Lu
dd1b428789 s3,iceberg: reject .. in URL path vars (#9687)
* s3,iceberg: reject `..`/NUL in URL path vars

Both gateway routers use mux.NewRouter().SkipClean(true), so a request like
`GET /bucket-A/../evil-bucket/key` survives routing as bucket=bucket-A,
object=../evil-bucket/key. The captured key is then joined into a filer path;
util.JoinPath / path.Join collapse the `..` server-side and the read lands in
evil-bucket. With auth on, IAM still authorizes against bucket-A (the mux var),
so policy is evaluated against the wrong target.

Add a middleware on the S3 bucket subrouter and the Iceberg REST router that
rejects any `.`, `..`, NUL, or — for single-segment slots — embedded slash in
the captured path vars before any handler runs. NormalizeObjectKey already
folds `\` to `/` and decoding happens in mux, so `%2e%2e` and `..\` are caught.

* s3,iceberg: reject empty captured vars and empty namespace parts

Comma-ok the var lookup so we only check captured slots, then treat an empty
captured value as a rejection on its own — downstream path.Join would
otherwise collapse it and let the next segment pick the bucket.

For iceberg, also reject empty parts after splitting the namespace on \x1F so
leading/trailing/consecutive unit separators (which parseNamespace silently
folds out) don't let distinct route values collapse to the same parsed
namespace.

Register loggingMiddleware before validateRequestPath on the iceberg router
so rejected requests still produce an audit-log line.
2026-05-26 01:04:59 -07:00
Chris Lu
2a4923e7e8 ObjectTransaction: filer-side forwarding via route_key (#9659)
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.
2026-05-24 14:21:06 -07:00
Chris Lu
1f0c366583 s3: route metadata-only self-copy off the distributed lock (#9638)
A non-versioned metadata-only self-copy (CopyObject with source == destination
and the REPLACE directive) is a read-modify-write of one entry, which is why it
held the distributed lock. It now routes to the owner as a serialized
PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements,
delete the dropped keys) onto a fresh read of the entry under its per-path lock,
so a concurrent change to non-managed keys (legal hold, retention, version id) is
preserved instead of clobbered, and bumps mtime.

PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended
self-copies create a new version (already routed via the copy finalize) and the
no-owner bootstrap keep the lock.
2026-05-24 12:32:57 -07:00
Chris Lu
fa7056dc6f s3: route object-lock version-specific deletes off the distributed lock (#9657)
A version-specific DELETE (real version or the null version, including
object-lock WORM-checked ones and governance-bypass) now runs as one routed
transaction on the object's owner instead of holding the distributed lock.

For a real version: recompute the .versions pointer excluding the version
(repoint-before-delete, so a crash leaves a recoverable orphan rather than a
dangling pointer), then delete the version file, under the object's per-path lock.
The null version is the regular object entry, deleted directly (no pointer).

Object-lock buckets gate the delete on the version's WORM guards evaluated on the
owner: legal hold (always) + retention (while not elapsed). Governance bypass
scopes the retention guard to COMPLIANCE mode, so the filer allows a
governance-mode delete while still denying compliance and legal hold — the
gateway never reads the version.

Three primitives make this expressible:
- ObjectTransaction.condition_key: evaluate the condition against a named entry
  (the version) while the lock stays on lock_key (the object).
- Recompute.exclude_name: omit a child from the scan, to repoint before delete.
- WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a
  mode, expressing governance bypass without a gateway-side read.
2026-05-24 11:41:08 -07:00
Chris Lu
eeda7181aa s3: route multipart-upload completion off the distributed lock (#9632)
completeMultipartUpload routes its writes to the object's owner filer when an
owner is known, off the distributed lock. Idempotent replay is handled
gateway-side in prepareMultipartCompletionState (it returns the existing result
when the object already carries this UploadId), so the lock is not needed to
dedupe retries; with no owner yet, the lock remains as the bootstrap path.

Versioned completion flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST). Non-versioned and suspended completion write the object via
routedMkFile (a routed PUT) so the write serializes with concurrent writes to
the same key on the owner's per-path lock. The version file itself is a unique
path and stays a plain mkFile.
2026-05-24 11:07:39 -07:00
Chris Lu
4b9d46b5ad s3: route versioned COPY and delete-marker off the DLM (#9633)
s3: route versioned/suspended delete markers and versioned COPY off the lock

createDeleteMarker flips the .versions pointer via routedVersionedFinalize
(RECOMPUTE_LATEST on the owner filer) when an owner is known, so an Enabled or
Suspended DeleteObject takes its pointer flip off the distributed lock; the
delete marker file is written first and the owner re-derives the pointer.

DeleteObjectHandler routes a versioned/suspended delete with no specific version
straight to the owner, off the lock. A specific-version delete and object-lock
buckets keep the lock (the former needs a recompute-after-delete handled
separately; the latter needs gateway-side enforcement).

CopyObject into a versioned bucket finalizes the new version through the same
routed pointer flip.
2026-05-24 07:22:27 -07:00
Chris Lu
5bac8b9281 s3: route object-lock object writes off the distributed lock (#9635)
routableWriteOwner no longer excludes object-lock buckets, so a versioned PUT
(which creates a new version, never overwriting a locked one) and a
non-versioned overwrite (WORM-checked gateway-side before dispatch) route to the
owner filer like any other write.

routedObjectOwner still excludes object-lock: an unversioned object-lock delete
enforces WORM under the lock, so it stays there rather than routing past the
check. Version-specific deletes likewise stay on the lock — routing them needs
the WORM check (on the version entry) and the latest-pointer recompute (on the
object) under one transaction, which the current single condition target cannot
express.
2026-05-24 07:20:44 -07:00
Chris Lu
db954b5503 s3: route versioned PutObject finalize off the DLM (#9631)
s3: route versioned PutObject finalize off the distributed lock

A versioned write's finalize (flip the .versions pointer to the newest version,
demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction
on the object's owner filer, under its per-path lock, instead of the unserialized
updateLatestVersionInDirectory. The version file is written first; the owner
re-derives the pointer by scanning the directory.

RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's
size and mtime on the pointer, and demote_key / demote_value to stamp the
displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves.

Falls back to updateLatestVersionInDirectory when no owner is known yet.
2026-05-24 03:10:30 -07:00
Chris Lu
32aa70ab59 s3: serialize bucket config writes with field-level filer patches (#9655)
PutBucketVersioning and PutBucketEncryption ran concurrently each did a
whole-entry read-modify-write of the bucket entry, so one could overwrite the
other's field with a stale copy. Each config write is now a field-level
PATCH_EXTENDED (extended attributes) or set_content (the metadata blob)
ObjectTransaction, routed to the bucket's owner filer and merged onto a fresh
read under its per-path lock. Disjoint fields no longer clobber each other.
2026-05-24 02:30:26 -07:00
Chris Lu
f9bc6adf98 s3: route single-entry object writes to the owner filer, off the DLM (#9629)
s3: route non-versioned object PUT and DELETE off the distributed lock

A non-versioned, non-object-lock object write now goes straight to the key's
owner filer as a single-mutation ObjectTransaction, which serializes it with the
owner's per-path lock and evaluates the precondition, instead of taking a
cluster-wide lock. PUT and DELETE use the object's full path as the lock key, so
a concurrent create and delete of the same key serialize against each other.

The fast path is taken only when the precondition reduces to clauses the filer
can evaluate (existence and a single strong-ETag match); time-based conditions,
ETag lists, weak ETags, post-create hooks, and an unknown owner fall back to the
lock. A routed mutation error other than a failed precondition also falls back,
so the lock path stays the authority for the cases it alone covers.

PrimaryForKey returns "" until the ring view arrives, keeping writes on the lock
until routing is known.
2026-05-24 02:10:32 -07:00
Chris Lu
f037fc4dce s3: dial the object lock's primary filer directly (#9626)
* s3: dial the object lock's primary filer directly

The S3 object write lock builds a fresh short-lived lock per write, each
starting at the seed filer. When the seed isn't the key's hash-ring primary
the filer forwards the request to the primary, and in multi-cluster setups
that forward crosses clusters on every write.

Give the lock client a view of the filer lock ring, fed by the master's
LockRingUpdate broadcasts the gateway already receives, so it dials the
primary directly. The view tracks filer membership by version; a stale view
stays correct because the filer still forwards as a fallback.

Also send the initial ring snapshot to S3 clients, not just filers.

* s3: subscribe to lock-ring updates before starting the master loop

The master delivers the initial LockRingUpdate once, on connect. Registering the
callback after KeepConnectedToMaster started left a window where that first
update could arrive before the handler was set and be dropped, delaying the ring
view until the next membership change. Build the lock client and register the
callback in the masters block before launching the loop; the filers block reuses
that client (or creates a plain one when no masters are configured).

* lock_manager: build the hash ring in a deterministic server order

rebuildRing ranged over the server set (a map), whose iteration order is
randomized per process. On a vnode hash collision the last writer into
vnodeToServer wins, so two nodes holding the same server set could resolve the
collision to different servers and disagree on the primary for keys near that
slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement
would route the same key to different filers and defeat per-path serialization.

Iterate the servers in sorted order so the ring is identical on every node with
the same set, regardless of discovery order.

* lock_manager: skip redundant ring rebuilds, trim comments

SetRing now ignores a non-zero version at or below the current one once a ring
exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the
ring.

* s3: hold the lock-ring client on the server for route-by-key

Store the object-write lock client on S3ApiServer so handlers can resolve a
key's owner filer via PrimaryForKey.
2026-05-24 00:40:43 -07:00
Aleksey
917a87928c fix(s3api/list): cancel ListEntries stream in hasChildren (#9617)
* fix(s3api/list): cancel ListEntries stream in hasChildren

* fix(s3api): use filer_pb.List in hasChildren

filer_pb.List already wraps the ListEntries stream in a cancellable
context, so the single-entry probe needs no separate helper or manual
context plumbing to avoid the leaked gRPC stream goroutine.

* fix(s3api): propagate request context into hasChildren

Thread r.Context() through listFilerEntries and hasChildren so the
implicit-directory probe cancels when the client disconnects, instead
of running on context.Background().

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-21 15:48:47 -07:00
Chris Lu
fbdcec1cba fix(s3): list empty directories as directory markers (#9615)
* fix(s3): list empty directories as directory markers

A real but empty directory created out of band (mount, mkdir, filer API)
carries no MIME, so it was hidden from S3 listings. hadoop-aws getFileStatus
probes LIST prefix=dir/ &delimiter=/ and reads an empty result as a missing
path, which breaks Spark's eventLog.dir when it points at an empty directory.

Surface such directories as directory markers, matching directories created
via PutObject with a trailing "/". Emptiness comes from the recursion result,
and the marker MIME is set only on the in-memory listing entry, so empty
directories stay eligible for empty-folder cleanup.

* fix(s3): only surface empty directory markers for explicit dir probes

Restrict the empty-directory marker to a trailing-slash prefix probe
(prefix=dir/), the pattern hadoop-aws getFileStatus uses. Plain listings
are left as before, so an empty directory left behind by deleted objects
(e.g. after lifecycle expiration) is no longer shown as a phantom key.
2026-05-21 14:05:16 -07:00
Chris Lu
d82b3a8d6a refactor(s3): drop unused source path in copy ETag check
ETagEntry derives the tag from chunks/Md5/remote-etag, never the entry path,
so the conditional-copy check no longer builds a bogus FullPath.
2026-05-21 09:51:50 -07:00
Chris Lu
83b7ea5e7b fix(s3): keep server-side copy data in the bucket collection (#9607)
* fix(s3): keep server-side copy data in the bucket collection

UploadPartCopy and SSE-C CopyObject assigned destination volumes against
r.URL.Path, the S3 request URI. The filer derives a bucket's collection
only when the assign path sits under its buckets folder, so an S3 URI
routed copied bytes to the default collection instead of the destination
bucket's. Assign against the destination's real filer path.

* refactor(s3): centralize copy-part path and thread dstPath into SSE-C copy

Extract copyPartLocation so the fast path and writeEmptyCopyPart share one
definition of the .uploads/<id>/<n>_copy.part location. Pass the destination
filer path into copyChunksWithSSEC instead of re-deriving it from the request,
and thread it through key rotation so re-encrypt copies also assign in the
destination bucket's collection.
2026-05-21 09:35:42 -07:00
Mmx233
9b9fdb5b76 fix(s3): sync IAM policies to advanced IAM Manager policy engine (#9577)
* fix(s3): sync IAM policies to advanced IAM Manager policy engine

* test(s3): add unit tests for PutPolicy/DeletePolicy IAM Manager sync

* fix(s3): flush loaded policies in SetIAMIntegration, drop extra reload

Sync the policies already loaded from the credential store into the IAM
Manager's engine from SetIAMIntegration itself, instead of re-running a
full LoadS3ApiConfigurationFromCredentialManager after setup. This covers
both startup orderings without a second filer round-trip or racing the
async loader goroutine: if the load won, the policies are in memory to
push; if SetIAMIntegration won, the load's own sync runs afterward.

Move the runtime PutPolicy/DeletePolicy sync out of the iam.m write lock
so the per-request auth RLock path isn't blocked by the policy recompile.

* fix(s3): serialize IAM manager policy resync to avoid stale snapshots

SyncRuntimePolicies replaces the manager's full policy set, so applying a
policy view captured before a later mutation can resurrect a deleted
policy or drop a new one. Funnel every path (PutPolicy, DeletePolicy,
SetIAMIntegration, and the credential-manager load) through a single
resyncIAMManagerPolicies that serializes on a dedicated mutex and reads
iam.policies fresh at apply time, so the live map always wins regardless
of interleaving. The load now installs the config into iam.policies
before resyncing, closing the window where the manager held policies the
map didn't yet have.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-21 00:39:42 -07:00
Chris Lu
c00aa90990 fix(s3/audit): populate requester for GET/HEAD/IAM operations (#9581)
Authentication records the identity with r.WithContext, which returns a
request copy. Handlers that log their own audit entry (PUT, DELETE,
tagging) see it, but GET/HEAD object and IAM operations rely on track()'s
fallback entry, which is built from the original request the auth copy
never reached - so requester came out empty.

Install a mutable identity holder on the request before authentication
and have SetIdentityNameInContext record into it. The holder is shared by
pointer across every request copy, so the fallback entry recovers the
authenticated requester. The per-request context value still takes
precedence, so nothing changes for handlers that see the auth copy.
2026-05-20 10:13:33 -07:00
Chris Lu
cc5ef1b741 feat(s3): add TagUser, UntagUser, ListUserTags IAM actions (#9572)
* feat(s3): add TagUser, UntagUser, ListUserTags IAM actions

Adds AWS IAM-compatible user tag operations on the embedded IAM
endpoint. Tags persist in the Identity proto as a repeated UserTag
field; the existing 50-tag / 128-byte-key / 256-byte-value AWS limits
are enforced. Pagination is stubbed (IsTruncated=false) since the
50-tag cap means all tags fit in a single response.

* review: validate UntagUser TagKeys entries

parseTagKeysParams now rejects empty keys and keys past
MaxUserTagKeyLength; UntagUser additionally requires at least one
TagKeys.member.N entry to match AWS validation behavior.

* review: pre-allocate user-tag merge and filter slices

mergeUserTags now allocates the combined existing+incoming capacity
up front; UntagUser builds the filtered slice via make with the full
ident.Tags capacity instead of ident.Tags[:0:0], which forced a
reallocation on every append.

* review: cover duplicate-in-request and invalid TagKeys cases

Regression tests assert TagUser rejects two members with the same key
in one request, and UntagUser rejects missing/empty/oversized TagKeys
entries.
2026-05-19 17:35:44 -07:00
Chris Lu
37b6a14b0d feat(s3): add four bucket configuration handlers (#9570)
* feat(s3): add four bucket configuration handlers

- GetBucketPolicyStatus: computes IsPublic from the existing bucket policy
- PutBucketRequestPayment: companion writer to the existing GET; accepts
  only BucketOwner
- GetBucketAccelerateConfiguration: returns <Status>Suspended</Status>
- GetBucketLogging: returns an empty BucketLoggingStatus

Lets AWS SDK probes succeed instead of returning MethodNotAllowed.

* review: route GetBucketPolicyStatus through checkBucket

Mirrors the existence/auth gating used by other bucket handlers and
drops the bespoke filer_pb lookup so NoSuchBucket precedence is
consistent across the API surface.

* review: cap PutBucketRequestPayment body with MaxBytesReader

The body is unmarshalled as RequestPaymentConfiguration, which is a
handful of bytes; reject excessively large payloads up front and
defer Close immediately after wrapping.

* review: gate static getters on checkBucket

GetBucketAccelerateConfiguration and GetBucketLogging now run the
standard bucket existence check before returning the static
Suspended / empty-status response so a missing bucket cannot appear
to have valid configuration.

* review: share cache helper across misc tests; check io.ReadAll error

Accelerate and Logging tests now run through newMiscTestServer like
the others so the checkBucket guard sees a cached bucket; the
ReadAll error is explicitly checked.
2026-05-19 17:35:08 -07:00
Chris Lu
cee2bf697c feat(s3): stub bucket configuration list endpoints (#9571)
* feat(s3): stub bucket configuration list endpoints

Adds Get and List handlers for Analytics, Inventory, IntelligentTiering,
and Metrics bucket configurations. List returns an empty result with
IsTruncated=false; single-get returns NoSuchConfiguration so SDK error
parsing remains predictable.

* review: gate stubs on bucket existence

All eight stub handlers now call checkBucket via stubBucketGuard so
NoSuchBucket takes precedence over NoSuchConfiguration / empty-list
responses, matching AWS S3 precedence. Tests provide a cached bucket
so the guard sees it as present.
2026-05-19 17:34:51 -07:00
Chris Lu
285025eb73 s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap

Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.

The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.

Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.

* s3api: support group inline policies + Condition enforcement

PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.

Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.

Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.

The NotImplemented unit test is inverted to drive the new round-trip.

* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy

DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.

* s3api: scope group-policy resource names per test; tighten deny polling

- Integration test resource names get a per-test suffix so retried or
  parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
  rather than any non-nil error; transient setup errors no longer end
  the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
  is nil, matching Put/Get/DeleteGroupPolicy.

* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs

CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
2026-05-19 16:03:45 -07:00
Chris Lu
f72983c1fd fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"

The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".

Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.

ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.

* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal

The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.

Extend the matcher so a request is recognized as S3 Tables when either:

  - its AWS V4 credential scope names SERVICE=s3tables; or
  - it carries the canonical AWS JSON RPC 1.1 content type and is
    unsigned (a request explicitly signed for SERVICE=s3 still wins).

The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).

Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.

* s3api: extend service matcher to all S3 Tables routes; simplify scope check

- Apply serviceMatcher to every S3 Tables route, not just the bare-path
  ones. ARN-bearing paths could otherwise be hit by an S3 object key
  that starts with arn:aws:s3tables:..., inside a bucket named
  "buckets", "namespaces", "tables", or "tag". One matcher everywhere
  closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
  credential-scope check. The scope shape is fixed at
  AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
  and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
  is exactly s3tables. Existing unit cases (including the
  access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
  the single Read could return short and make the equality check flaky.

* s3api: drop content-type fallback; sign s3 tables harness traffic instead

The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.

Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.

Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
2026-05-19 14:24:25 -07:00
Chris Lu
d57de6dc20 fix(s3): keep anonymous access working with EnableIam default (fixes #9557) (#9567)
fix(s3): keep anonymous access working with EnableIam default

`docker run seaweedfs` (and `weed mini` with no config) start with
EnableIam=true but no IAM config file and no identities. The advanced-IAM
init path was failing in 4.25 because of the missing STS signing key,
which masked a latent bug: SetIAMIntegration unconditionally flipped
isAuthEnabled to true, and isEnabled() also treated a non-nil
iamIntegration as auth-on. Once the mini SSE-S3 KEK landed in 4.26 the
STS fallback started succeeding, the integration got installed end to
end, and every anonymous S3 request bounced as AccessDenied.

Separate the two concerns: SetIAMIntegration just plumbs in the OIDC /
embedded-IAM machinery, and a new EnableAuthEnforcement opts in to
enforcement. The startup path calls it only when -s3.iam.config is
actually provided, so operators with explicit IAM configs still get auth
(preserves #7726). isEnabled() now reads isAuthEnabled only.
2026-05-19 13:03:30 -07:00
Chris Lu
c61d227613 s3api: verify source permission on CopyObject and UploadPartCopy (#9555)
* s3api: verify source permission on CopyObject and UploadPartCopy

The Auth middleware only authorized the destination because routes key on
the request URL. The source from X-Amz-Copy-Source was never evaluated,
so an STS session token scoped to one prefix could copy from any other
prefix in the same bucket.

Add AuthorizeCopySource on IdentityAccessManagement to run the full
bucket-policy + IAM/identity flow against the source, using a synthetic
GetObject request so action resolution lands on s3:GetObject (or
s3:GetObjectVersion when a source versionId is supplied). Both
CopyObjectHandler and CopyObjectPartHandler now invoke it before reading
the source.

* s3api: preserve presigned-URL session token on copy-source check

Presigned CopyObject / UploadPartCopy requests carry the STS session
token in the query string (X-Amz-Security-Token), not in a header.
Rebuilding the synthetic source URL from scratch dropped that token, so
the source authorization would fall through to non-STS paths and miss
session policy enforcement. Forward X-Amz-Security-Token from the
original query (alongside versionId), still excluding unrelated params
like uploadId/partNumber that would steer ResolveS3Action away from
s3:GetObject.
2026-05-18 21:35:53 -07:00
Chris Lu
58c3fa802c fix(s3): keep host-less bucket catch-all so reverse proxies work (#9540)
When s3.domainName is set, all bucket-prefix routes were gated on a
matching Host header. Requests that arrive via an IP, an unlisted
hostname, or a reverse proxy that rewrites Host hit no router and bounce
back as 405/404 (and 503 once a proxy maps the upstream error).

Register the path-style catch-all unconditionally, after the
host-specific routers, so it only fires when no Host matcher applies.
2026-05-18 19:44:19 -07:00
Chris Lu
6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Konstantin Lebedev
7d1b16fbcd fix: ListBucketsHandler for pathStyleDomains (#9510) 2026-05-15 13:12:55 -07:00
Chris Lu
e9bcb8f4ad docs(s3/lifecycle): refresh DESIGN.md as-built (#9491)
* docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages

DESIGN.md was written as a phased implementation plan ("Phase 2 will
ship X, Phase 4 will ship Y"). All phases are now merged, plus the
post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially
changed the worker model (single subscription, walker throttle,
observability gauges). Rewrite the doc in present tense describing
what's actually there.

Net changes vs the prior plan-style doc:
- Algorithm pseudo-code reflects the single-subscription fan-out plus
  walkedThisPass within-pass guard.
- Walker invocation table replaces the implicit "two distinct calls"
  prose with three call sites (recovery / steady-state / empty-replay)
  and their throttle gates.
- New section on the subscription model (one Reader, ShardPredicate,
  fan-out by ev.ShardID).
- New section on cursor.LastWalkedNs and the WalkerInterval throttle.
- Observability section: gauges, heartbeat tokens, what each means.
- "Implementation history" table maps phases to merged PRs.
- "Future work" lists the four optimizations we deferred (long-lived
  subscription, bucket-coordinated walker, per-bucket lag metric,
  filer meta-log retention).

Drop the "Phase N — ..." narrative from the bottom; the PR history
table is the durable artifact now.

Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for
the operator-facing docs. README explains the sync workflow with the
external seaweedfs.wiki.git repo. Five pages:

- Home.md — landing page, supported rule shapes, what the worker does
- Operator-Guide.md — config knobs, when to change each, walker
  interval recommendations by cluster size
- Monitoring.md — Prometheus metric reference + heartbeat token table
  + suggested PromQL alerts
- Troubleshooting.md — stuck cursor, walker stuck, failure outcomes,
  cursor schema for manual inspection
- Architecture.md — high-level overview for newcomers; sits between
  Home.md (operator) and DESIGN.md (developer)

* docs(s3/lifecycle): address PR review feedback on docs

Coderabbit + gemini findings on #9491:

- Monitoring.md: clarify the "matches all dispatched" phrasing; note
  that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value
  (shouldn't appear in healthy systems); filter PromQL alerts to
  ignore zero-valued gauges so fresh-install heartbeats don't trip.
- Operator-Guide.md, Troubleshooting.md: clarify weed shell -master
  format as host:http_port.grpc_port (SeaweedFS ServerAddress).
- Troubleshooting.md: pause the s3_lifecycle job in the admin UI
  before manually editing a cursor file, otherwise the worker's
  save races with the operator's edit.
- Architecture.md, Home.md, Operator-Guide.md, Monitoring.md,
  Troubleshooting.md, DESIGN.md: add language tags (`text`) to
  fenced code blocks for markdownlint MD040 compliance.
- DESIGN.md: standardize on the S3 spec rule names
  (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`,
  `AbortIncompleteMultipartUpload`) and add a one-line note mapping
  them to the engine's `ActionKind*` constants.
- README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the
  sync workflow so the `cp` commands' repo-root-relative paths work
  whether the operator's shell is at the repo root or at
  docs/wiki/s3-lifecycle/.
- Home.md: was lagging the wiki-repo merged version (had the older
  pre-merge content). Re-sync from the wiki repo so source matches.

* docs(s3/lifecycle): remove wiki pages from PR

The wiki pages belong in seaweedfs.wiki.git, not the main repo. The
source-of-truth concern that motivated adding them here is real but
the cost — every code-review touchpoint requires reviewers to load
operator-facing pages too — outweighs it. The wiki pages are already
pushed locally (~/dev/seaweedfs.wiki); they'll publish on the
operator-side workflow.

This PR remains scoped to DESIGN.md (the developer-facing reference
that does belong with the code).

* docs(s3/lifecycle): drop Implementation history section

git log is the durable record of what shipped when; the prose table
duplicates it and goes stale faster than commit metadata.

* docs(s3/lifecycle): soften 'exactly once per run' in Goal

The prior phrasing overstated the guarantee versus the failure model
documented later in the same file. Reword to: 'process due objects
each pass; retryable/blocked outcomes get retried from the cursor on
later runs.' Surfaces the head-of-line-blocking semantics up front so
the rest of the doc reads consistently.

Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those
pages live in the wiki repo, not the main repo.
2026-05-13 17:06:14 -07:00
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
c6582228b8 feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.
2026-05-13 14:09:13 -07:00
Chris Lu
79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00
Chris Lu
f5a4bfb514 fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete

deleteSpecificObjectVersion did two non-atomic filer ops: rm the version
blob, then update the .versions/ pointer. Step 2 failures were silently
logged and the client got 204 OK, so any transient blip (filer timeout,
process restart between RPCs, lock contention) left the .versions/
directory naming a missing file. Subsequent GETs paid the 10-retry
self-heal cost and returned NoSuchKey — surfacing as "Storage not found"
to Veeam, which is what triggered this investigation.

Three changes:

1. Pre-roll the pointer for the singleton / multi-version-deleting-latest
   cases. The pointer is repointed (multi) or cleared (singleton) before
   the blob rm. A failure between leaves a recoverable orphan blob —
   pointer is consistent, GETs succeed or correctly miss without
   entering the stale-pointer self-heal path.

2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion
   with bounded retries (~6.3s worst case). When retries are exhausted
   the function now returns a non-nil error instead of swallowing it.
   The caller logs at Error level and queues the path for the
   reconciler.

3. Background reconciler drains stranded .versions/ pointer-to-missing
   states off the hot path. Bounded in-memory queue with capped retries;
   read-path heal remains as a last-resort safety net.

* fix(s3/versioning): address review on #9460

Four fixes addressing review on PR #9460. All four are correctness;
no behavioural change for the happy path.

1. repointLatestBeforeDeletion: discriminate NotFound from transient
   errors when re-fetching the .versions/ entry. Previously any error
   returned rolled=true,nil — a transient filer hiccup at that point
   would cause the caller to skip the post-delete reconciliation AND
   proceed with the blob rm, producing exactly the dangling pointer
   state the PR aims to prevent. NotFound stays "vacuously consistent"
   (directory already gone); other errors surface so the caller aborts
   before removing the blob.

2. Move the singleton .versions/ teardown out of
   repointLatestBeforeDeletion (where it ran BEFORE the blob rm and
   always failed with "non-empty folder") into deleteSpecificObjectVersion
   AFTER the blob rm. Adds a wasSingleton return value so the caller
   knows when to run the teardown. Without this, every singleton-version
   delete in a versioned bucket leaked an empty .versions/ directory.

3. Wrap the list, getEntry, and mkFile calls inside
   repointLatestBeforeDeletion with retryFilerOp so the pre-roll has
   the same transient-failure resilience as the post-roll path. Without
   retries, a single transient blip causes the caller to fall back to
   the legacy non-atomic flow even when the filer recovers immediately.

4. healVersionsPointer in the reconciler: same NotFound-vs-transient
   discrimination on both the .versions/ getEntry and the latest-file
   presence probe. Previously a transient filer error would silently
   evict the candidate from the queue as "healed", leaving the real
   stranded state until a client read happened to surface it.

Also fixes the gemini-flagged consistency nit: the queued-for-reconciler
error log now uses normalizedObject instead of object so it matches the
queue entry's key.

* fix(s3/versioning): short-circuit terminal errors in retryFilerOp

Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound,
gRPC NotFound, context.Canceled, and context.DeadlineExceeded.
retryFilerOp now bails immediately on a terminal error and returns it
unwrapped, so callers like repointLatestBeforeDeletion.getEntry and
updateLatestVersionAfterDeletion.rm see the raw NotFound instead of
paying the ~6.3 s retry-budget delay AND parsing it out of an
"exhausted N retries" wrapper.

errors.Is and status.Code already walk the %w chain so today's call
sites still work, but the delay was real on the hot DELETE path
whenever a key was genuinely absent. Test added covering all five
terminal-error shapes — each must run the wrapped fn exactly once and
return in under 50 ms.
2026-05-13 10:14:27 -07:00
Chris Lu
3f1eaf9724 fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD

Successful GET/HEAD object requests never produced a fluent audit entry
because those handlers write the response directly (streaming for GET,
WriteHeader for HEAD) and never reach a PostLog call site. The wiki
advertises GET as an audited verb, so the asymmetry surprises operators
who rely on the log for read-access auditing.

Move the safety net into the track() middleware: tag each request with
an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it,
and emit a single fallback entry after the handler returns when nothing
fired. The recorder's status flows into the fallback so the audit row
still reflects 200/206 vs 404 etc. No double logging for handlers that
already emit (write helpers, error paths, bulk delete).

Refs #9463

* fix(s3/audit): defensive nil checks on audit-tracking helpers

Address PR review: guard against nil request and nil *atomic.Bool stored
under the audit-tracking key. The conditions are unreachable today (the
key is private and we only ever store new(atomic.Bool)), but the checks
are free and keep the helpers safe if a future caller misbehaves.

* test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469)

test(s3/audit): cover track() fallback wiring + cleanup

Adds two unit tests in weed/s3api/stats_test.go that exercise the
audit-tracking flag set up by track(): one verifies the fallback path
fires when a handler writes the response directly (the GET/HEAD object
regression in #9463), the other verifies the flag is set when a handler
emits PostLog itself so the fallback is skipped.

To make the wiring observable without standing up fluent, PostLog now
marks the audit flag before short-circuiting on a nil Logger; production
behavior is unchanged (no logger, no posting) but the flag stays
consistent.

Also drops two stale comments in s3api_object_handlers.go that still
referenced proxyToFiler — that helper was removed when GET/HEAD started
streaming from volume servers directly.

Stacks on #9467.
2026-05-13 09:24:59 -07:00
Chris Lu
d5372f9eb7 feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster
rate.Limiter that processMatches honors. A walker hitting a large
bucket on the recovery branch could burst LifecycleDelete RPCs past
the cluster_deletes_per_second cap that streaming-replay respects.

WalkerDispatcher now takes a *rate.Limiter and waits on it before
each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds
just like processMatches does. The handler passes the same limiter
to both paths so replay + walk share one budget; nil disables
throttling (unchanged default).

Tests pin: the limiter actually delays a dispatch when the burst
token is drained, and a ctx cancellation in Limiter.Wait surfaces
as an error without sending the RPC.
2026-05-13 09:24:50 -07:00
Chris Lu
37e505b8fd refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass

Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.

drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.

cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.

Filer load: 16x -> 1x SubscribeMetadata streams per pass.

* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)

The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.

Same regression I fixed in 6796ab6db for the per-shard subscription;
now applied at the shared level. computeGlobalStartTsNs loads every
shard's cursor and picks the minimum, falling back to the cold-start
floor only for shards with no persisted cursor.
2026-05-13 02:13:11 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
5004b4e542 feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b)

Phase 5a (PR #9465) retired the algorithm flag and made daily_replay
the only execution path. The streaming-side code (scheduler.Scheduler,
scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher,
dispatcher.FilerPersister, and their tests) has had no in-tree caller
since then. This PR deletes it.

Net change: ~4800 lines removed, ~130 added (the scheduler/configload
tests' helper file the deleted bootstrap_test.go used to host).

Removed:
  - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test,
    scheduler,scheduler_test,pipeline_fanout_test,
    refresh_default,refresh_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test,
    dispatcher_helpers_test,edge_cases_test,multi_shard_test,
    pipeline,pipeline_test,pipeline_helpers_test,toproto_test,
    dispatch_ticks_default,dispatch_ticks_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go
    (FilerPersister deleted; FilerStore tests don't need their own
    file)
  - weed/shell/command_s3_lifecycle_run_shard{,_test}.go
    (debug-only shell command that only ever wrapped the streaming
    pipeline; the production worker now exercises the same path
    every daily run)

Trimmed:
  - dispatcher/filer_persister.go down to FilerStore +
    NewFilerStoreClient — the small interface daily_replay's cursor
    persister (dailyrun.FilerCursorPersister) plugs into.

Kept (still consumed by daily_replay):
  - scheduler/configload.{go,_test.go} (LoadCompileInputs,
    AllActivePriorStates)
  - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister,
    FilerSiblingLister)
  - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient)

scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream,
dirEntry, fileEntry — helpers the configload tests used to share with
the deleted bootstrap_test.go.

Updates the handler-package doc strings and one reader-package
comment that still named the streaming pipeline.

* fix(s3/lifecycle): hold lock through tree read in test filer client

gemini caught an inconsistency in scheduler/testhelpers_test.go:
LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was
releasing the lock before reading c.tree. The map is effectively
static during tests so there's no actual race today, but matching
the convention keeps the helper safe if a future test mutates the
tree mid-run.
2026-05-12 12:54:52 -07:00
Chris Lu
2f682303fb fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464)
Follow-up to PR #9459 (merged before this fix landed). The walker
dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and
"NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay
drain (processMatches in run.go via #9462) use "RPC_ERROR" for the
same condition. Aligning so a single Prometheus query covers all
three delete paths.

Folds nil-response under RPC_ERROR rather than a separate label —
operationally it's the same class of failure (server returned no
usable response).
2026-05-12 12:38:52 -07:00
Chris Lu
495632730c feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log

Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.

  - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
    wall-clock per shard. p95 climbing toward MaxRuntime means the
    shard is brushing its budget.
  - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
    events drainShardEvents processed. Pairs with the duration so a
    spike in events-per-shard correlates with a slow shard.
  - S3LifecycleDispatchCounter (existing, reused): processMatches now
    increments this with the outcome label, so streaming and
    daily_replay paths share one outcome view. Transport errors are
    counted under outcome="TRANSPORT_ERROR".

dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.

Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.

* fix(s3/lifecycle): align dispatch error label + clean test labels

Two PR-9462 review fixes from gemini:

1. processMatches' transport-failure label was "TRANSPORT_ERROR";
   streaming's dispatcher uses "RPC_ERROR" for the same condition
   (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
   the same Prometheus query covers both delete paths.

2. The dispatch-counter assertion test now deletes its label row
   on exit so the in-process Prometheus registry doesn't accumulate
   per-test state across the suite.
2026-05-12 12:15:20 -07:00
Chris Lu
f954781169 feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459)
* feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config

Adds a Config.RetentionWindow field that runShard threads into
engine.PromotedHash. Zero (the default) falls back to maxTTL, which
matches Phase 4a behavior — PromotedHash stays empty and the
partition-flip recovery trigger stays dormant.

Pure plumbing. The handler still passes zero so nothing changes at
runtime. The walker work (Phase 4b proper) sets a real retention from
the meta-log boundary and the partition-flip trigger starts firing.

* feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker

Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient
so the same LifecycleDelete RPC drives both the meta-log replay path
and the walker. No CAS witness — the server's identityMatches treats
nil ExpectedIdentity as a bootstrap call and rebuilds the witness from
the live entry, which is the right contract for a full-tree walk.

Adds VersionID to bootstrap.Entry so versioned-bucket walks address
the right version. MPU init uses DestKey for ObjectPath (matching the
prefix-match contract); rejecting empty DestKey keeps malformed init
records out of the dispatch path.

Not wired yet — runShard still doesn't invoke the walker. Follow-up
commits add the ListFunc adapter and the recovery-branch wiring.

* feat(s3/lifecycle): wire Walker hook into runShard's recovery branch

Adds a Config.Walker callback that fires on rule-content edit /
partition flip BEFORE the cursor rewinds, so already-due objects across
the rewritten rule set get caught instead of waiting on meta-log
replay alone. The callback receives engine.RecoveryView(snap) and the
per-shard ID; nil disables it (Phase 4a behavior preserved).

Decoupling the wiring from the implementation: the handler-side
WalkerFunc that drives bootstrap.Walk via the filer is the follow-up
commit, and tests can stub the callback without standing up the full
filer/client/lister harness.

Tests pin: walker fires exactly once on hash mismatch, walker error
propagates and leaves the cursor unchanged, nil Walker is a no-op.

* feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard

Adds dailyrun.WalkBuckets — the composable driver the handler-side
WalkerFunc will call. Iterates a bucket list, wraps the supplied
bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey
for MPU init), and runs bootstrap.Walk per bucket using the supplied
Dispatcher. First bucket error wins; remaining buckets log and run to
completion so one filer flake doesn't kill the shard.

Composable rather than monolithic so callers and tests can swap parts:
production uses a filer-backed ListFunc + WalkerDispatcher; tests use
bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the
next commit.

Tests pin: shard filter routes only matching entries, MPU shard uses
DestKey not the .uploads/<id> path, single-bucket error propagates
while other buckets still run, ctx cancellation short-circuits between
buckets, nil guards on view/list/dispatch.

* feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker

Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that
streams entries under <bucketsPath>/<bucket> by paginated SeaweedList.
Recurses into regular directories; .versions/ and .uploads/ are
skipped at this stage so they don't surface as raw children — the
sibling expansion (versioned NoncurrentDays state, MPU init dispatch)
lands in the next commit.

listAll and isVersionsDir are ported from scheduler/bootstrap.go's
same-named helpers. Phase 5 deletes the scheduler copies along with
the streaming path.

Tests pin: flat listing, recursion through nested directories,
.versions/ and .uploads/ skipped, kill-resume via the start path
contract, nil-client error, attribute propagation (mtime / size /
IsLatest default).

* feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc

Adds the .versions/<key>/ expansion to the daily-run's filer-backed
ListFunc. Each call emits one bootstrap.Entry per sibling (real
version files + the bare null version, when found) with the same
sibling state the streaming bootstrap injects via reader.Event:

  - Path = logical key (not the .versions/<file> physical path), so
    bootstrap.Walk's MatchPath uses the user's intended path.
  - VersionID per sibling (version_id or "null").
  - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back
    to explicit-null-bare, falling back to newest-by-mtime.
  - NoncurrentIndex rank computed against the latest's position.
  - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the
    previous-newer sibling's mtime (legacy derivation).
  - IsDeleteMarker from ExtDeleteMarkerKey.
  - NumVersions = len(siblings).

Two-pass walk so .versions/ dirs run before regular files; the bare
null-version path is recorded in skipBare so pass 2 doesn't emit it
twice.

expandVersionsDir and lookupNullVersion are ported from
scheduler/bootstrap.go. Sort order, latest resolution, and successor
derivation must agree with that path verbatim so streaming and walker
reach the same verdict on the same objects. Phase 5 deletes the
scheduler copy.

MPU init (.uploads/<id>) remains skipped — the dedicated commit emits
it with IsMPUInit and DestKey.

Tests pin: pointer-wins latest resolution, no-pointer newest-sibling
fallback, explicit-null-is-latest with skipBare suppression of the
bare emission, coincidentally-named .versions folder recursing as a
regular subdir, delete-marker propagation.

* feat(s3/lifecycle): emit MPU init records from FilerListFunc

Last gap in the filer-backed ListFunc. A directory at .uploads/<id>
carrying ExtMultipartObjectKey is the MPU init record; emit one
bootstrap.Entry with IsMPUInit=true and DestKey set to the user's
intended path. The walker's MatchPath uses DestKey for prefix
matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's
ObjectPath. .uploads/<id> directories without the extended key are
mid-write before metadata landed and stay skipped.

isMPUInitDir is upgraded from the path-shape-only stub to the full
shape + extended-attr check that mirrors router.mpuInitInfo and
scheduler/bootstrap.go's same-named helper.

Tests pin: valid init record emits with the right DestKey, missing
ExtMultipartObjectKey skips the directory.

* feat(s3/lifecycle): wire walker into executeDailyReplay

Activates the recovery-branch walker. The handler composes the three
Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets
— into a dailyrun.WalkerFunc and passes it via Config.Walker. The
bucket list is derived from the compiled inputs so it matches the
engine snapshot exactly.

Effect on master behavior: when a worker observes a RuleSetHash or
PromotedHash mismatch on its persisted cursor (rule content edited /
partition flip), runShard now walks the live filer tree under the
RecoveryView before rewinding the cursor. Already-due objects across
the rewritten rule set fire immediately instead of waiting on the
sliding meta-log replay.

Still scoped to replay-eligible action kinds because
checkSnapshotForUnsupported continues to reject walker-bound rules
(ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and
scan_only-promoted rules at the top of Run. The follow-up commit
relaxes the gate once the steady-state walker over RulesForShard's
walk view is wired so those rules fire every day, not just on rule
edits.

* feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate

Adds the second walker invocation in runShard. After the recovery
check passes, runShard derives the walk view via snap.RulesForShard
(using the same retentionWindow PromotedHash used, so the partition
is consistent) and runs the walker over it. The view holds
walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) plus any replay-eligible rules promoted to walk by
retention shortage; an empty view skips the call so non-versioned,
replay-only deployments don't pay an O(N) bucket walk per run.

With the walker now servicing every rule kind, checkSnapshotForUnsupported
and its UnsupportedRuleError type are obsolete. router.Route gates
replay on Mode == ModeEventDriven, so walker-bound and scan_only
rules are silently dropped by replay and picked up by the walker
instead — no double-dispatch. Drop the gate, delete replayability.go
+ replayability_test.go, and remove the handler's redundant
IsUnsupportedRule branch.

* fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment

Two PR-review fixes on 9459:

1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return —
   add a defensive nil-response check so the walk halts cleanly
   instead. Spotted by coderabbit.

2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash
   "stays empty" in fallback mode, which gemini correctly pointed out
   is only true once rules are active. During bootstrap (rules
   compiled but IsActive=false) MaxEffectiveTTL is 0 while
   PromotedHash counts every non-disabled rule, so promoted becomes
   non-empty and the next post-activation run hits the recovery
   branch. That's the intended bootstrap walk — rewrite the comment
   to explain it rather than misstate the invariant.

Test: pins nil-response → error path on WalkerDispatcher.

* fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion

Reviewer caught a structural bug in expandVersionsDir's latest
resolution: when ExtLatestVersionIdKey was set but no scanned sibling
carried that id (stale pointer), the code left latestPos at the
default 0 without ever entering the no-pointer fallback. Today the
two paths yield the same value (newest sibling wins), but the
implicit fall-through makes the intent unclear and would break
silently if the no-pointer branch ever did anything more than
latestPos=0.

Track a pointerResolved flag explicitly so the no-pointer branch
(including the explicit-null-bare check) re-runs on a stale pointer.
Behavior unchanged today.

Test pins: stale pointer + two real versions falls back to
newest-sibling (vnew, not vold).

* feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher

Mirrors the Phase 6 instrumentation already on the replay side
(processMatches) onto the walker's Delete dispatch. Every walker
dispatch now bumps S3LifecycleDispatchCounter with the resolved
outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so
streaming, daily_replay's replay drain, and daily_replay's walker
share a single per-(bucket, kind, outcome) counter view.

Lands together with the rest of Phase 4b — no new metric, just an
extra observation site for the existing one.
2026-05-12 11:39:15 -07:00
Chris Lu
644664bbee feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457)
* feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a)

Replace the local replay-content-hash / max-effective-TTL helpers in
dailyrun with the engine package's canonical versions (ReplayContentHash,
MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface.

Adds PromotedHash to the cursor's recovery triggers: a partition flip
(rule moving between replay and walk because retention shifted) now
fires the rule-change branch alongside RuleSetHash mismatch. The
retentionWindow is set to MaxEffectiveTTL today, which keeps the
promoted set empty and the trigger dormant; Phase 4b will plumb the
real meta-log retention boundary so true scan_only promotions are
detected.

Cursor schema is unchanged — PromotedHash was already persisted as
the zero hash in Phase 2.

* docs(s3/lifecycle): note the one-time cursor rewind on hash format change

gemini-code-assist flagged that swapping localReplayContentHash for
engine.ReplayContentHash changes the persisted RuleSetHash byte layout
(sort order + tagged-field encoding). Phase-2 cursors mismatch on first
post-upgrade run and drop into the rule-change branch.

Going with option 3 (document the intentional one-time rewind). The
rewind is bounded to runNow - maxTTL (not time-zero), self-healing on
the next save, and daily_replay is off by default so the affected
population is limited to early adopters of the algorithm flag. A
migration shim or a hash-compat layer would carry the legacy encoder
forever for one bounded re-scan; not worth it.

Comment in runShard makes the trade explicit so a future reader doesn't
hunt for the "why does my cursor rewind once after upgrade" mystery.

* chore(s3/lifecycle): trim verbose comments in dailyrun

Cut multi-paragraph headers and narration that just described what the
code does. Kept the small WHY notes (per-match skip vs per-rule, the
one-time post-upgrade cursor rewind, scan_only rejection rationale).
Same behavior, ~150 fewer lines of comment.

* fix(s3/lifecycle): persist PromotedHash on the successful runShard save

The comment-trim pass dropped the field alongside a "stays empty in
Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b
turns promoted into a real value — and a save that writes zero would
make the next run falsely detect drift and rewind. Spotted by
gemini-code-assist on PR 9457.

Other save paths (recovery, drain-error) already persisted it; the
success path is the only one that was missing it. Now consistent.
2026-05-11 21:18:19 -07:00
Chris Lu
884b0bcbfd feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)

Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.

Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.

Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
  non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
  the input ClusterContext and injects two metadata keys for
  s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
  decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
  config; other job types pass through unchanged.

Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
  on (admin-config field names, metadata keys). Plain strings on the
  admin side keep weed/admin/plugin free of a dependency on the
  s3_lifecycle worker package; the two sets of constants are pinned
  to identical values and a mismatch would silently disable rate
  limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
  builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
  Missing/empty/non-positive values → no limiter (legacy unlimited
  behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
  bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
  (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).

Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
  long each Limiter.Wait blocks before a LifecycleDelete RPC.
  Operators tune the cap by reading p95 — near-zero means the cap
  isn't binding, a long tail at 1/rate means it is.

Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
  pass-through for non-allocator job types, rps=0 / no-executors
  skip, even sharing, burst sharing, burst=0 omit (worker default
  kicks in), burst floor of 1, no mutation of input metadata, nil
  input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
  covering nil/empty/missing metadata, non-positive/invalid rate,
  positive rate builds correctly, burst missing defaults to 2× rate,
  tiny rate clamps burst to ≥1.

Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.

* fix(s3/lifecycle): divide cluster budget by active workers, not all capable

gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.

Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.

Tests swap the SharedEvenly case for two pinned scenarios:
  - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
  - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
  - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4

* feat(s3/lifecycle): drop Worker Count knob from admin config form

The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.

Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).

handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
2026-05-11 19:17:06 -07:00
Chris Lu
3f4cb6d2fb feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine)

Adds the engine-side API the new daily-replay worker reaches for:
per-view snapshot construction (RulesForShard, RecoveryView), the two
cursor hashes that gate recovery (ReplayContentHash, PromotedHash),
and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot
is a stub keyed on a package-level atomic that the worker startup wiring
populates.

Views return new *Snapshot instances holding cloned *CompiledAction
values so per-clone active/Mode never leak across partitions. Replay
clones force Mode=ModeEventDriven to rehabilitate any persistent
ModeScanOnly carried over from PriorState; walk and recovery clones
preserve Mode as-is. Disabled actions are excluded from all views.

No production caller is wired here — Phase 4's walker/dailyrun
integration is the follow-up. dailyrun's local helpers
(localReplayContentHash, localMaxEffectiveTTL) become one-line
redirects to these exports.

API surface:
- CurrentSnapshot() *Snapshot — stub until Phase 4 wiring.
- SetCurrentEngine(*Engine) — Phase 4 wiring entry point.
- Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot)
- RecoveryView(s *Snapshot) *Snapshot — force-active over the full set.
- ReplayContentHash(s *Snapshot) [32]byte — partition-independent.
- PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip.
- MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only.

30 unit tests covering clone isolation, Mode rewrite, partition
membership including the multi-action-kind XML rule split,
RecoveryView activating pre-BootstrapComplete actions,
ReplayContentHash partition-independence, PromotedHash sensitivity to
promotion in either direction, MaxEffectiveTTL aggregation. Build +
race-tests green.

* refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics

Addresses PR #9447 review feedback. Three medium-priority items from
gemini, all code-quality refinements (no behavior change):

1. Duplicated sort comparator between ReplayContentHash and
   PromotedHash. Extract sortHashItems shared helper so the two
   hashes use the same ordering by construction — if one drifted, the
   cursor could see a spurious "rule changed" on a no-op snapshot
   rebuild.

2. Duplicated writeField/writeInt closures. Extract hashWriter struct
   holding the sha256 running hash + lenbuf, with method helpers.
   Same allocation profile (one Hash, one tiny stack buffer per
   helper); just deduplicates ~20 lines.

3. shardID parameter on RulesForShard is unused. Per the design's
   open question, every shard sees every rule today (shard filter
   runs at the entry-iteration site, not view construction). Keep
   the parameter for API stability — removing it now would force
   a breaking change when bucket-shard ownership lands — and update
   the doc comment to explain why it's reserved.

go build ./... clean; engine test suite green.
2026-05-11 18:07:54 -07:00
Chris Lu
122ca7c020 feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446)
* docs(s3lifecycle): design for daily-replay worker

Captures the algorithm and dev plan iterated on in PR #9431 and the
discussion leading up to it: per-shard daily meta-log replay, walker
as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent
plus a recovery branch over engine.RecoveryView(snap), explicit
retention-window input to RulesForShard, two cursor hashes
(ReplayContentHash + PromotedHash) that together detect every
invalidation case. Implementation phases are sequenced so each can
ship independently — Phase 1 (noncurrent_since stamp) just landed.

* feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2)

New weed/s3api/s3lifecycle/dailyrun package implementing the bounded
daily meta-log scan from the design doc. One pass per Execute per
shard: load cursor, scan events forward, route each through router.Route,
dispatch any due Match, advance the cursor on success. Halt-on-failure
keeps the cursor at the last fully-processed event so tomorrow resumes
from the same point — head-of-line blocking is the deliberate failure
signal.

Replay-only in this phase. Phase 4 wires the walker for ExpirationDate,
ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules.
Until then a typed UnsupportedRuleError refuses runs on those buckets:
operators see the rejection in the activity log rather than silently
losing rules.

Behavior:
- Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted
  under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in
  Phase 2; Phase 4 turns it on.
- Rule-change branch rewinds cursor to now - max_ttl when the
  replay-content hash mismatches. Cold start uses the same floor.
- Transport errors retry 3x with exponential backoff capped at 5s;
  server outcomes (RETRY_LATER / BLOCKED) halt the run without retry.
- Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules
  exist, only the hash gates a future addition.

Worker shape:
- New admin config field "algorithm" with enum streaming|daily_replay,
  default streaming. Existing deployments are unaffected.
- handler.Execute branches on the flag: streaming routes through the
  current scheduler.Scheduler, daily_replay routes through
  dailyrun.Run.
- dispatcher.NewFilerSiblingLister exported so both paths share the
  same .versions/ + null-bare lookup.

Engine integration:
- Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase
  4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace
  them with one-line redirects; the local versions hash the same
  fields so the cursor stays valid across the swap.

Tests cover cursor persistence, unsupported-rule rejection,
hash stability under rule reordering, hash sensitivity to TTL edits,
max-TTL aggregation, dispatch retry budget, and request shape
including the identity-CAS witness.

Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers
and future phases share the same spec.

* feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob

The streaming dispatcher hasn't shipped to users yet, so there's no
backward-compat surface to preserve. Flip the algorithm default from
streaming to daily_replay so the new path is the standard from day
one. Streaming stays as an explicit opt-in escape hatch during the
Phase 4 walker rollout; Phase 5 deletes both the flag and the
streaming code.

Buckets whose lifecycle rules require walker-bound dispatch
(ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only)
will fail the daily_replay run with the existing
UnsupportedRuleError until Phase 4 walker integration ships. Operators
hitting that case can set algorithm=streaming until the follow-up
lands.

Updates the test for the default value and renames the
unknown-value-fallback case to reflect the new default.

* fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches

The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}.
That's only safe when each event produces at most one match per
ActionKey with a single deterministic due-time formula —
ExpirationDays and AbortMPU fit that shape because due_time
= ev.TsNs + r.days is monotonic in event TsNs.

But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed
in Phase 2 since it compiles to ActionKindNoncurrentDays) routes
through routePointerTransitionExpand, which emits matches for every
noncurrent sibling — each with its own SuccessorModTime taken from
the demoting event for that specific sibling. A single event can
therefore produce two matches for the same ActionKey on different
objects with wildly different DueTimes.

With the old code, a not-yet-due sibling encountered first would set
done[ActionKey] = true and then the next sibling — even though its
DueTime had already passed — would be skipped. Future events for the
same rule would also be suppressed for the rest of the run. Objects
that should have been deleted weren't.

Fix: drop the early-stop optimization. Process every match
independently. A future-DueTime match is now silently skipped without
affecting any later match. The performance hit is small (Phase 2 is a
single bounded daily pass, and the rate limiter is the real
throughput governor); the correctness gain is non-negotiable.

Also fixes the inverted comment in processMatches that described the
old check as "due_time is past now" when it actually checked
DueTime.After(now) (i.e., NOT yet due).

Adds four targeted tests:
- not-yet-due match first in slice does not suppress two later
  due matches for the same rule;
- reversed slice ordering produces identical dispatch;
- BLOCKED outcome halts the loop before later due matches are sent;
- empty match slice is a no-op.

Phase 4's walker-and-recovery integration can revisit a
per-(rule, object) memoization if profiling argues for it.

* fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency

Addresses PR #9446 review feedback. Eight distinct fixes:

1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the
   persisted cursor to lastOK = TsNs of the last event processed,
   including events whose matches were skipped as not-yet-due. Those
   skipped matches would never be re-scanned, so objects under
   long-TTL rules would never expire.

   Track a "stuck" flag in drainShardEvents: the first event with a
   skipped (future-DueTime) match stops cursorAdvanceTo from rising,
   but the loop keeps processing later events to dispatch any that ARE
   due. The persisted cursor sits at the last fully-processed event so
   tomorrow's run re-scans from the skipped event onward and the
   future-due matches get re-evaluated when they age in.

   processMatches now returns (skippedAny, halted, err) so the drain
   loop can tell apart "event fully drained" from "event had pending
   future-due matches."

2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the
   ActionKind. A replay-eligible kind with Mode != ModeEventDriven
   (e.g. ModeScanOnly via retention promotion) passed the check but
   then got silently ignored by router.Route, which gates dispatch
   on Mode == ModeEventDriven. Reject loudly with the typed error
   so admin sees the rejection in the activity log.

3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard
   goroutines regardless of cfg.Workers. Add a Workers field to
   dailyrun.Config and gate the goroutine fan-out on a semaphore of
   that size; the handler now passes cfg.Workers through.

4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one
   snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per
   shard. Mid-run Compile would let shards process different rule
   sets. Capture snap at the top of Run, pass it down to every shard.

5. FROZEN runNow (coderabbit). drainShardEvents and processMatches
   accepted a `now func() time.Time` and called it multiple times.
   DueTime comparisons would slip as the run wore on. Capture runNow
   once at the top of Run and thread it through as a time.Time value.

6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case
   broke out of the loop and returned nil, marking interrupted runs as
   successful. Return ctx.Err() instead so the caller propagates the
   interrupt; cursorAdvanceTo carries whatever progress was made.

7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently
   accepted empty files, mismatched shard_ids, and hash slices shorter
   than 32 bytes (copy() would zero-pad). Each now returns a typed
   error so the run halts and an operator investigates rather than
   silently re-scanning from time zero or persisting a zero-padded
   hash that masks corruption forever.

8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted"
   guard in runShard was unreachable because drainShardEvents
   initialized lastOK := startTsNs and only ever raised it. Removed
   along with the new cursor-advancement semantics that handle the
   "no events processed" case implicitly.

Plus markdown lint: DESIGN.md fenced code blocks now carry a `text`
language identifier to satisfy MD040.

Skipped from the review:
- gemini's "maxTTL == 0 incorrectly skips immediate expirations":
  actions with Days <= 0 don't compile to a CompiledAction (see
  weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new
  empty-replay sentinel uses `rsh == [32]byte{}` for clarity per
  gemini's suggested form, but the behavior is equivalent.

Tests added/updated:
- TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when
  all matches are past their DueTime.
- TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins
  the new Mode check.
- TestFilerCursorPersister_EmptyFileReturnsError,
  _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin
  the new validation rules.
- Existing process-matches tests reshaped for the
  (skippedAny, halted, err) return tuple.

Full build clean. Dailyrun + worker test packages green.
2026-05-11 18:07:17 -07:00
Chris Lu
46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00