Commit Graph

13753 Commits

Author SHA1 Message Date
Chris Lu
8cf42a5abb test(s3/lifecycle): assert per-goroutine errors in fake-server concurrent test (#9393)
test(s3/lifecycle): assert per-goroutine errors in concurrent fake test

The previous TestFake_ConcurrentCallsSerializeWithoutDeadlock dropped
the err return from each LifecycleDelete call, so a regression in the
concurrent path could pass the length-only assertion. Capture each
err on a buffered channel and require.NoError after wg.Wait().
2026-05-09 18:54:15 -07:00
Chris Lu
ddfb219ec3 test(s3/lifecycle): fake LifecycleDelete server (Phase 12 slice) (#9391)
* test(s3/lifecycle): fake LifecycleDelete server for component tests

A reusable double for SeaweedS3LifecycleInternalServer with per-key
FIFO outcome queues, a fallback Default, and recorded request capture.
Tests of the worker pipeline that need to hit the proto boundary can
queue up DONE/NOOP/RETRY/FATAL/SKIPPED_OBJECT_LOCK responses per
(bucket, objectPath, versionId) and assert dispatch order against
Recorded(). SetError flips the server into transport-failure mode
without polluting the request log.

* test(s3/lifecycle): use struct map key for FakeLifecycleServer queues

Bucket / object path / version-id are user-supplied strings that can
contain "/" or "@", which would collide if the queue map were keyed by
"<bucket>/<object>@<version>". Switch to a struct key so the
components stay separate.

* test(s3/lifecycle): deep-copy recorded LifecycleDelete requests

Tests that mutate a Recorded() entry — or a request pointer they
already passed in — were able to corrupt the fake's bookkeeping
because the slice carried shared pointers. Clone with proto.Clone at
both record and read time so the fake holds an independent snapshot
of every arriving request and hands callers an independent snapshot
back. Tightened TestFake_VersionIDPartOfKey error checks while there.
2026-05-09 18:38:52 -07:00
Chris Lu
bb0c7c779f feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0 (Phase 2b) (#9390)
* refactor(s3): thread metadataOnly into delete helpers

Add a metadataOnly bool to deleteUnversionedObjectWithClient and
deleteSpecificObjectVersion. When true the helper sends IsDeleteData=
false to the filer's DeleteEntry RPC so per-chunk DeleteFile RPCs are
skipped — the volume server reclaims chunks on its own at TTL drop.
Non-lifecycle callers (DELETE handlers, batch delete) pass false to
preserve today's eager-chunk-delete behavior; only the lifecycle
handler in the next commit will pass true.

* feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0

Per-write TTL stamping (PR 9377) sets Attributes.TtlSec on every
lifecycle-fitting entry. When the live entry the LifecycleDelete
handler fetched carries TtlSec > 0 the volume server is guaranteed
to reclaim chunks at TTL drop, so the filer can skip per-chunk
DeleteFile RPCs and just remove the entry record. lifecycleDispatch
now computes metadataOnly from the live entry and threads it through
the unversioned, suspended-null, and noncurrent/expired-marker delete
paths. createDeleteMarker is unaffected — it creates a marker, never
deletes chunks.
2026-05-09 18:38:38 -07:00
Chris Lu
255e9cd0f7 test(s3/lifecycle): cover reader cursor + Run validation contracts (#9389)
* test(s3/lifecycle): cover reader cursor + Run validation contracts

Layer 2 tests pinning four reader-package contracts the dispatcher
pipeline depends on: MinTsNs anchors at frozen positions, Snapshot
returns a deep copy in both directions, Restore replaces (not merges),
and Run validates ShardID/Events/BucketsPath before subscribing.

* test(s3/lifecycle): tighten cursor composition assertions

Snapshot deep-copy: also assert cursor doesn't see keys added to the
returned map. Restore replace: freeze before second Restore and assert
IsFrozen returns false after, pinning the contract that Restore wipes
frozen state alongside the value map. Run validation: bound the call
with a 5s context timeout so a regression that lets Run reach the nil
client surfaces as a failure instead of a hang.
2026-05-09 14:32:11 -07:00
Chris Lu
aa280443e7 test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher (#9387)
* test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher

The existing dispatcher unit tests cover individual outcomes
(DONE / RETRY_LATER / BLOCKED / etc.) on a single shard, and
pipeline_test.go has only one end-to-end happy-path assertion.
Multi-shard composition — the contract Pipeline.Run wires up at
runtime — was untested at the component level.

Add four Layer 2 tests in dispatcher/multi_shard_test.go:

  Two events for two shards land in different schedules, dispatch
  independently, and each cursor advances only for its own event
  (no cross-contamination on the action-key map).

  A poison event on shard 0 returns BLOCKED and freezes shard 0's
  cursor; shard 1's normal event continues to dispatch and its
  cursor advances. Per-shard isolation contract.

  Save/Load round-trips a per-shard cursor snapshot through the
  Persister: a fresh dispatcher restores the same TsNs map. Pins
  the contract Pipeline.Run drives on the checkpoint ticker.

  RETRY_LATER respects RetryBackoff against the wall clock — a
  Tick within the backoff window doesn't re-dispatch; a Tick past
  the new DueTime does. Guards against premature retries from
  refresh ticks landing inside the backoff.

Pipeline.Run itself can't run here (it builds a real reader.Reader),
so the tests share the same fakeClient pattern dispatcher_test.go
uses and drive Tick directly.

* test(s3/lifecycle): drop unused snapshot helper and addAndTick parameter
2026-05-09 14:12:21 -07:00
Chris Lu
1854101125 feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386)
* feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8)

scan_only actions only fire from the bootstrap walk: the engine
classifies a rule as scan_only when its retention horizon exceeds the
meta-log retention, so event-driven routing can't be trusted. Today
each bucket walks once per process, so a long-running worker never
revisits — scan_only retention only catches up when the worker
restarts.

Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk
(name -> completion time). KickOffNew now re-walks a bucket whose
last walk completed more than BootstrapInterval ago. Zero interval
preserves the legacy walk-once-per-process behavior so existing
deployments don't change cadence by default. walkBucket re-stamps
on success and clears the stamp on failure (via MarkDirty), so the
next KickOffNew picks failed walks back up.

Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and
a Now func() for testable time travel.

weed shell run-shard grows --bootstrap-interval (cadence knob) and
--force-bootstrap (drop in-memory state at startup so every bucket
walks again immediately, useful when a config change should take
effect without a restart).

Tests: cadence respected (skip inside interval, re-walk past it);
zero interval keeps once-per-process; MarkDirty forces re-walk
under a 24h interval; MarkAllDirty resets every record. The
fakeClock helper guards the test clock with a mutex so race-detector
runs are clean.

* fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag

Three issues with the Phase 8 cadence work as it landed:

1. lastWalk did double duty as both completed-walk timestamp and
   in-flight debounce. A walk that took longer than BootstrapInterval
   would have a fresh KickOffNew start a duplicate goroutine on the
   next refresh tick because the stamp from KickOffNew looked stale
   against the interval. Split into lastCompleted (set on success)
   and inFlight (set on dispatch, cleared after the walk goroutine
   returns success or failure). KickOffNew skips inFlight buckets
   regardless of cadence.

2. The cadence knob existed on `weed shell` but not on the production
   path: scheduler.Scheduler constructed BucketBootstrapper without
   BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had
   no field for it. Add Scheduler.BootstrapInterval, parse
   `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk-
   once-per-process; negative clamps to zero), and forward it from
   the handler. Tests cover default, override, clamp, and explicit-zero.

3. --force-bootstrap was a no-op: BucketBootstrapper is freshly
   allocated at command start, so MarkAllDirty on empty state does
   nothing, and the flag couldn't influence an already-running
   process anyway. Remove it; a real runtime trigger (SIGHUP, control
   RPC) is a separate change.

In-flight regression: a blockingInjector pins the first walk in
progress while the test advances the clock past the interval. The
second KickOffNew is a no-op (inFlight check). After release, the
post-completion KickOffNew within the interval is also a no-op.

* test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock

The cadence test polled listedN to know "the walk happened" — but
that fires once both list passes are issued, while the success-stamp
lands later, after walkBucketDir returns. A clock.Advance(30m)
between those two events would record the stamp at clock+30m
instead of T0; the next assertion would then see now.Sub(last) < 1h
and skip the expected re-walk. Tight in practice but exposed under
-race / load.

Add a waitForCompleted helper that polls b.lastCompleted directly,
and use it before each clock advance in both the cadence and zero-
interval tests.

* fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks

Two follow-ups on Phase 8.

The worker config descriptor had no bootstrap_interval_minutes field,
so the production operator UI couldn't enable the cadence — only the
internal ParseConfig + Scheduler wiring knew about it. Add the field
to the cadence section (MinValue=0 since 0 is the legacy default) and
include the default in DefaultValues so existing deployments see the
knob with the right preset.

MarkDirty / MarkAllDirty silently lost their effect when a walk was
in flight: the methods cleared lastCompleted, but the walk's success
path then wrote a fresh timestamp, hiding the operator's invalidation.
Track a pendingDirty set; the walk goroutine consumes the flag on
exit and skips the success stamp, so the next KickOffNew picks the
bucket up immediately.

Regression: pin a walk in progress with a blockingInjector, MarkDirty
the bucket, release the walk, and assert lastCompleted stayed empty
plus the next KickOffNew triggers a new walk inside the
BootstrapInterval window.

* refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty

These methods were the operator-hook half of Phase 8, but the only
caller (--force-bootstrap on the shell command) was removed when it
turned out to be a no-op against a freshly-allocated bootstrapper.
Nothing in production calls them anymore.

Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty
set, the dirty-suppression branch in walkBucket, and the three tests
that only exercised those methods. BootstrapInterval-driven
re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP,
control RPC) is a separate change with a real call site.
2026-05-09 13:42:31 -07:00
Chris Lu
edfa1ce210 feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4) (#9385)
* feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4)

Bootstrap covers existing versions, but a live PUT that creates a new
.versions/<v-new> file and updates the parent's ExtLatestVersionIdKey
didn't fire NoncurrentDays / NewerNoncurrent on the displaced prior
version until the next bootstrap. Close that runtime gap.

The meta-log already emits an Update event for the .versions/
directory itself when the latest pointer changes; the router was
dropping it because buildObjectInfo returns nil for directories. New
branch in Route detects that shape (versioned bucket, NewEntry +
OldEntry both directories with the .versions/ suffix, ExtLatestVersionIdKey
changed, ID different from the new ID) and emits a Match against the
LOGICAL key with VersionID=oldID. Match.Identity comes from a single
LookupVersion RPC for the displaced version file; SuccessorModTime is
the directory update's mtime, which is the moment the displaced
version became noncurrent.

SiblingLister grows LookupVersion(bucket, key, versionID) for that
single-RPC fetch. filerSiblingLister implements it; routing path
treats NotFound as "displaced version was hard-deleted in the
meantime, suppress" rather than an error.

The router gates the lookup on at least one active event-driven
NoncurrentDays / NewerNoncurrent rule for the bucket, so most buckets
pay nothing per directory update.

Tests: pointer-flip fires NoncurrentDays with displaced version_id;
unchanged pointer skips; empty old pointer skips (first-PUT scenario);
displaced-version NotFound suppresses; no-rule skips lookup;
NewerNoncurrentVersions retains rank-0; unversioned bucket skips.

* fix(s3/lifecycle): SuccessorModTime cache + NewerNoncurrent expansion

Two correctness gaps in pointer-transition routing.

The .versions/ directory's own Attributes.Mtime is preserved across
pointer updates by updateLatestVersionInDirectory: it's a stale clock
relative to the freshly-written latest version. Using it as the
displaced version's SuccessorModTime made NoncurrentDays compute
due = staleMtime + days, which fires immediately on a fresh PUT
into an old .versions/ container. Read ExtLatestVersionMtimeKey
written by setCachedListMetadata; suppress (return no matches) when
the cache is missing rather than fall back to dir mtime.

Single-oldID lookup is only enough for pure
NoncurrentVersionExpirationDays. Any rule with NewerNoncurrentVersions
> 0 cares about the noncurrent ranks, and a pointer flip shifts every
prior noncurrent's index by one — the version that just crossed the
keep-count threshold needs to be evaluated too. When any matching
rule needs ranks, list the full .versions/ container, sort newest-
first with mtime + version-id tiebreak, and route every noncurrent
with its real index. Identity-CAS dedups against earlier schedules.

SiblingLister grows ListVersions(bucket, key); filerSiblingLister's
implementation paginates the container fully.

Two regression tests: stale dir mtime + correct cached mtime
schedules ~30 days out (not immediate); NewerNoncurrentVersions=2
with 4 versions fires on the rank-2 entry that just crossed the
threshold while rank-0/1 are retained.

* fix(s3/lifecycle): bound pointer-transition expansion to threshold crossings

routePointerTransitionExpand emitted a Match for every eligible
noncurrent on every PUT. Schedule.Add doesn't dedup, so identity-CAS
at dispatch only saved the wasted RPC, not the heap slot. A hot key
with many already-eligible versions and a count rule would push
O(versions) entries per flip, repeatedly, until dispatch caught up.

Bound the emission to versions that newly entered eligibility on
this specific flip: rank 0 (the displaced version, for the
NoncurrentDays clock) plus rank == rule.NewerNoncurrentVersions
for each active count-gated rule (the version that just crossed
from kept to expired). Bootstrap still owns full backfill for
versions that were already over-threshold.

Adds a regression with 6 versions and NewerNoncurrentVersions=2:
asserts only the rank-2 entry that just crossed fires, not the
already-over-threshold rank-3/rank-4 entries.

* fix(s3/lifecycle): suppress pointer-transition expansion when newID missing

routePointerTransitionExpand defaulted latestPos to 0 if newID wasn't
found in the listing. That made the actual newest sibling latest
against the pointer's intent, then misranked every other version. A
race between the pointer write and the version write could land us
there.

Default latestPos to -1, set it only on a real match, and suppress
the expansion when the search misses. Bootstrap repairs state on
the next walk.

The NewerNoncurrentVersions retention test was setting only
lookupEntry, so Route never reached the expansion path it claimed to
exercise. Repoint to listVersions and assert ListVersions was
consulted while LookupVersion was not. Adds a regression covering
the missing-newID suppression directly.

* fix(s3/lifecycle): include bare null version in pointer-transition routing

Bootstrap models the bare-key object as a "null" sibling alongside
.versions/ children, but the live pointer-transition path didn't.
Two cases lost:

1. oldID == "" was treated as "nothing displaced". A pre-versioning
   bare object becomes noncurrent when the first versioned PUT lands
   and the pointer flips to a real id, but live routing skipped it
   and waited for the next bootstrap.

2. The expansion path's ListVersions returned only .versions/
   children. With a bare null in the picture, the noncurrent ranks
   were wrong, so NewerNoncurrentVersions could keep the wrong
   versions and delete the right ones (or vice versa).

SiblingLister grows LookupNullVersion(bucket, key) returning the
bare entry plus an explicit-null flag (matches the bootstrap shape).
filerSiblingLister implements it via util.NewFullPath +
filer_pb.LookupEntry.

routePointerTransitionDisplaced: oldID == "" now consults
LookupNullVersion. When the bare entry exists, route it as
VersionID="null" against the LOGICAL key.

routePointerTransitionExpand: collect .versions/ children plus the
null entry into one sibling slice before sorting and ranking. The
threshold-crossing logic now sees the same N-version set that
bootstrap would compute.

Three new tests: oldID == "" with no null is a no-op (one null
lookup, no version lookup); oldID == "" with a bare null schedules
NoncurrentDays as VersionID="null"; expansion with a bare null
between .versions/ siblings places null at its mtime-correct rank
and only that rank-N entry fires.

* fix(s3/lifecycle): atomic listPageSize so test cleanup doesn't race

KickOffNew dispatches walks via `go b.walkBucket(...)`. A test that
finishes before its goroutines drain leaves them running into the
next test's t.Cleanup, which mutates listPageSize. -race spots the
read/write collision intermittently. Convert listPageSize to
atomic.Uint32; tests use Load/Store. No production semantics change.

* fix(s3/lifecycle): null becomes latest when suspended PUT clears pointer

The router treated newID == "" as if the cached
ExtLatestVersionMtimeKey were still authoritative — but that cache
holds the displaced version's mtime, written by setCachedListMetadata
when the prior version became latest. Using it as SuccessorModTime
made NoncurrentDays=30 immediately fire on a 100-day-old displaced
version even though it became noncurrent today.

When newID == "" the bare null is the new latest. Look it up,
substitute its mtime as the successor clock, and substitute "null"
as the latestPos target for the expansion path's id match. Both
displaced and expand paths now derive the right clock.

updateIsLatestFlagsForSuspendedVersioning was the upstream cause of
the staleness — it cleared ExtLatestVersionIdKey and FileNameKey but
left the cached size/mtime/etag/owner/delete-marker behind. Call
clearCachedVersionMetadata so the .versions/ container is consistent
with "null is latest". The router-side guard is still needed for
older deployments that ran the buggy code, but new writes won't
exercise the workaround.

Two regressions: 100-day-old displaced under NoncurrentDays=30 with
a today-null PUT schedules ~30d out (not immediate); same shape with
NewerNoncurrentVersions=2 ranks the null at latest and only the
rank-2 entry fires.
2026-05-09 12:21:35 -07:00
Chris Lu
2f7ac1d664 feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3) (#9383)
* feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3)

Bootstrap now expands every <key>.versions/ directory into one event
per version with sibling state pre-computed. The router fires
NoncurrentDays / NewerNoncurrent off these events using
SuccessorModTime as the noncurrent clock; previously these rules
never ran on a versioned bucket because buildObjectInfo couldn't
classify version-folder events without the latest pointer.

Mechanics

walkBucketDir treats a directory ending in .versions and carrying
ExtLatestVersionIdKey as a SeaweedFS .versions container — emit it
once and skip the recursion. Coincidentally-named directories without
the latest pointer recurse normally.

BucketBootstrapper.expandVersionsDir lists the children, sorts
newest-first by mtime, resolves the latest position from the pointer,
and injects a synthesized reader.Event per version with
BootstrapVersion populated. NoncurrentIndex is 0-based among
noncurrents in newest-first order; SuccessorModTime is the immediate
newer sibling's mtime (zero for the latest). Pointer naming a missing
or absent version falls back to the newest-by-mtime sibling so a
race window can't flag every entry as noncurrent.

routeBootstrapVersion uses BootstrapVersion to build ObjectInfo
directly (bypassing the version-folder skip in buildObjectInfo) and
runs the standard match loop. ABORT_MPU is excluded by kind-shape
gate. The schedule clock uses SuccessorModTime for noncurrents and
ModTime for the latest, so the dispatcher fires when the rule's days
threshold is met. Match.ObjectKey is the LOGICAL key,
Match.VersionID is the marker's stored version_id — the dispatcher
reaches deleteSpecificObjectVersion or createDeleteMarker correctly.

Layer 2 tests cover both sides. Router: latest fires ExpirationDays;
noncurrent fires NoncurrentDays; NewerNoncurrentVersions retains the
N newest noncurrents; ABORT_MPU never matches. Bootstrap: .versions
dir emitted once and not recursed; missing latest pointer falls back
to newest; backdated PUT (latest pointer is older by mtime) keeps
the right noncurrent index; delete-marker flag propagates.

* fix(s3/lifecycle): no VersionID for latest expirations, child-based dir disambig

Two correctness gaps in Phase 5b/3.

Bootstrap was pinning the version_id on every Match. For
EXPIRATION_DAYS / EXPIRATION_DATE on the latest version this is
unsafe: between schedule and dispatch a fresh PUT can land, the
dispatcher would still identity-match against the original version's
bytes (it still exists at that path) and the resulting delete marker
would hide the new latest. Drop VersionID for those kinds; an empty
VersionID makes the dispatcher fetch the current latest, where
identity-CAS resolves to STALE_IDENTITY and bootstrap re-schedules
with the new latest's identity. NoncurrentDays / NewerNoncurrent /
EXPIRED_DELETE_MARKER still pin the version_id since those are
version-targeted.

isVersionsDir gating on ExtLatestVersionIdKey lost a race window:
createDeleteMarker writes the version file before updating the
parent's Extended pointer, so a walk between those two steps would
see a .versions/ dir without the pointer, recurse into it, and emit
raw version files that the router drops. Match the suffix only and
let expandVersionsDir disambiguate by child inspection: if any child
carries ExtVersionIdKey it's a real .versions container and we expand;
otherwise it's a coincidentally-named user folder and we recurse via
the bucket-walk's own callback so nested entries still flow through.

Tests: latest-expiration assertion flipped to expect empty VersionID;
new tests cover the coincidentally-named-folder recursion and the
race-window expansion (children present, pointer absent).

* fix(s3/lifecycle): filter directory + missing-version-id children at listing

expandVersionsDir's listing callback collected every child with
attributes; subdirectories or entries without ExtVersionIdKey would
make it past the empty-id skip in the inner loop but still inflate
NumVersions and skew NoncurrentIndex (the rank derives from the
filtered slice's position, which was wrong when the unfiltered slice
was sorted). Drop directories at listing time and partition the
file children into a versions slice that's the actual rank source.

Test cleanups: out-of-order-mtime test now sets v1 older than v2 so
latestPos > 0 actually exercises the rank-skip branch in
expandVersionsDir; bootstrapVersionEntry preserves nanosecond
precision via MtimeNs to match markerLoneEntry's pattern; drop a
leftover unused idx variable.

* fix(s3/lifecycle): null version + canonical version-id tiebreak

Two correctness gaps in Phase 5b/3 bootstrap.

Null versions live at the bare logical path, not under .versions/.
Bootstrap previously expanded only .versions/<key>/ children, so:
  - pre-versioning objects with newer .versions/ history never had
    their null version expired by NoncurrentDays
  - suspended-bucket writes (which clear the .versions/ latest pointer
    so null becomes current) had every .versions/ child wrongly
    classified as latest by the buildObjectInfo fallback

expandVersionsDir now looks up the bare key via NewFullPath +
LookupEntry, accepts a regular file or an explicit S3 directory-key
marker (Mime set), and folds it into the sibling set with
VersionID="null". Latest resolution: pointer present + names a real
id wins; pointer absent + null exists makes null latest; otherwise
falls back to newest sibling. The walker's regular emission for the
bare entry would otherwise duplicate, so walkBucketDir now does a
two-pass walk per directory level — .versions/ first, then everything
else with a per-walk skipBare set keyed by bucket-relative path that
expandVersionsDir populates when it claims a bare null sibling.

Sort tiebreak: PUTs only set second-level Mtime, so two versions
written in the same second tied. The unstable secondary order let
old-format version filenames sort oldest-first and corrupt
NoncurrentIndex under NewerNoncurrentVersions retention. Add
CompareVersionIds to s3lifecycle/version_time.go (mirrors the
canonical comparator in s3api/s3api_version_id.go to avoid the
import cycle) and use it as a secondary key after mtime equality.

Tests: pre-versioning null-as-noncurrent, suspended null-as-current,
directory-key marker as null version, end-to-end claim through
walkBucketDir's two-pass ordering, and same-second tiebreak via
canonical version-id ordering. fakeFilerClient grows a
LookupDirectoryEntry implementation backed by the same in-memory tree.

* fix(s3/lifecycle): only treat explicit-null bare entries as current

The pointer-missing branch in expandVersionsDir made null latest as
soon as a bare object was found. That's correct for suspended-bucket
writes (s3api_object_handlers_put.go writes the bare entry with
ExtVersionIdKey="null") but wrong for the pre-versioning race window:
a brand-new version under .versions/<file> exists before the parent's
ExtLatestVersionIdKey update lands, and a pre-versioning bare object
has no version-id marker. Marking that older bare object latest hides
the real new version and skips noncurrent expiration of the null
until the next process restart/bootstrap.

Distinguish the two: lookupNullVersion now returns whether the bare
entry's Extended map carries ExtVersionIdKey="null" (the suspended
write marker). expandVersionsDir's pointer-missing branch only
promotes null to latest when explicit; otherwise it falls back to
newest-sibling, which is safe for the race window since the new
version's mtime is fresher than the bare object's.

The existing suspended-null test now uses a new helper that adds the
explicit marker. New regression test covers the race window: bare
entry without the marker + a fresh .versions/<v1> file + missing
parent pointer must keep v1 as latest and the null as noncurrent.

* fix(s3/lifecycle): only the newest item can be the explicit-null latest

The pointer-missing branch in expandVersionsDir scanned every item for
an explicit null and promoted it to latest. After a suspended->enabled
transition that's the wrong call: createVersion writes the version
file before updating ExtLatestVersionIdKey, so a bootstrap that lands
in the race window sees an older bare null with ExtVersionIdKey="null"
plus a newer .versions/<v-new> child and no parent pointer. Promoting
the null misclassifies v-new as noncurrent and skips both the new
version's current-version expiration and the null's noncurrent
scheduling until the next bootstrap.

Constrain the explicit-null branch to items[0]: if the suspended-null
write is genuinely current it'll be the newest by mtime AND tagged.
Anything else falls through to the newest-sibling default.

Adds a regression test for the suspended->re-enabled race.

* fix(s3/lifecycle): paginate bootstrap directory listings

SeaweedList(..., limit=0) is a single-page request: the filer caps
limit=0 at DirListingLimit (1000 by default) and returns whatever fits
in one round trip. expandVersionsDir and walkBucketDir both relied on
that, so any directory bigger than the cap silently truncated. For
noncurrent retention this is correctness, not just scale — a hot key
with more versions than the cap had its rank/sort math computed off
the first page only, NumVersions, NoncurrentIndex, SuccessorModTime,
and the latest-fallback all wrong, with the older versions never
scheduled until a future bootstrap.

Add a listAll helper that drives pagination via StartFromFileName +
inclusive=false, looping until a page returns fewer entries than the
configured page size. Use it in both call sites. Page size is a var
(listPageSize, default 1024) so tests can shrink it without
generating thousands of entries.

The fake filer client now mirrors the real semantics: sort children
by name, honor StartFromFileName/InclusiveStartFrom, cap at Limit.
New regression tests force a small page size and assert the full
result set is processed and the call count matches what pagination
should drive.

* perf(s3/lifecycle): stream bucket walk in two passes instead of buffering

walkBucketDir was paginating into a children slice and then iterating
twice (pass 1: .versions/, pass 2: everything else). For flat buckets
with millions of entries the buffer is a real memory spike. Drop the
materialization: each pass now drives its own listAll over the same
directory and acts on entries as they stream in. The skipBare ordering
contract is preserved — pass 2 still runs after pass 1 finishes — and
the per-pass paging keeps memory bounded by listPageSize.

Tradeoff: each directory level is listed twice. For workloads where
that matters more than the memory headroom, we can revisit; the
correctness/scale dial here is what the noncurrent rules need.

Updated three tests for the new call count: each walk now records 2
listings per directory (pass 1 + pass 2). The KickOffNew dedup tests
expect 2 calls per bucket; the pagination test expects 6 instead of 3.
2026-05-09 10:48:32 -07:00
Chris Lu
1c917ffacb fix(volume): sticky EIO quarantine; track streamed reads (#9384)
Two follow-ups on PR #9382:

1. Quarantine wasn't sticky. Once CollectHeartbeat crossed the streak
   threshold and hid the replica, a subsequent successful read called
   checkReadWriteError(nil), wiping the streak; the next heartbeat
   then re-announced the suspect replica as read-only and master could
   send reads back to a disk that already failed IoErrorTolerance.
   Added an ioErrorQuarantined sticky flag set on the first heartbeat
   that observes the threshold and cleared only by MarkVolumeWritable
   (resetIoErrorState). clearIoError continues to reset just the
   streak so successful ops don't accumulate phantom errors.

2. Streamed reads bypassed the EIO counter. readNeedleDataInto and
   ReadNeedleBlob — the hot paths for large/range GETs — returned
   ReadNeedleData / needle.ReadNeedleBlob errors without threading
   them through checkReadWriteError, so a disk failing only on those
   paths would never trip IoErrorTolerance. Both now route the
   backend error through the tracker, and a fully clean
   readNeedleDataInto call clears the streak.

Tests cover the sticky flag (TestQuarantineIsSticky) and the streamed
read path (TestReadNeedleBlobTracksEIO via a fake EIO backend).
2026-05-09 09:55:02 -07:00
Chris Lu
7c60407897 fix(volume): don't nuke local data on transient IO error (#9378) (#9382)
* fix(volume): don't nuke local data on transient IO error (#9378)

A single syscall.EIO from any read/write/delete set v.lastIoError, and
the next CollectHeartbeat then called Volume.Destroy on the replica —
removing the .dat/.idx/.vif/.sdx/.ldb/.rdb files. A brief NFS / fabric
/ controller blip hitting several replicas at once could cascade into
removal of the last healthy copy, with no recovery for non-tiered
volumes.

Now require IoErrorTolerance (3) consecutive EIOs before acting, and on
that threshold mark the volume read-only and stop announcing it to the
master so re-replication kicks in from healthy peers — never delete
the data files. The on-disk copy stays for operator inspection /
recovery.

* review: fix race, accounting, recovery, non-EIO streak break

Addressing PR #9382 review:

- Data race on lastIoError: guard lastIoError + lastIoErrorCount with a
  RWMutex and expose them through note/clear/get helpers so the
  heartbeat reader sees a consistent snapshot. Verified with -race.
- Collection-size accounting: when a volume is quarantined for sustained
  EIO, skip the entire per-volume bookkeeping (`continue`) instead of
  flipping shouldDeleteVolume — the old branch subtracted a size that
  was never added, dragging the collection gauge to zero / negative.
- Recoverability: MarkVolumeWritable now also calls clearIoError so an
  operator can rejoin a quarantined replica. The next failed op
  re-arms the streak if the disk is still bad.
- Non-EIO streak break: a non-EIO error (e.g. ENOSPC) now resets the
  consecutive-EIO counter, so a sequence EIO,EIO,ENOSPC,EIO is treated
  as a streak of one — the counter only tracks consecutive EIOs.

Reads already call checkReadWriteError (volume_read.go), so successful
reads also clear the streak — no change needed there.
2026-05-09 09:20:31 -07:00
Chris Lu
c6ad6dcf74 feat(s3/lifecycle): sole-survivor delete-marker routing (Phase 5b/2) (#9381)
* feat(s3/lifecycle): sole-survivor delete-marker routing (Phase 5b/2)

Production stores delete markers under <object>.versions/<version-file>;
buildObjectInfo skips every version-folder event because it can't tell
current from noncurrent without sibling state. EXP_DM is the one rule
that can be routed from the file path alone: if the marker is the only
entry under .versions/<key>/ then it's necessarily the latest.

Add SiblingLister; gate the listing on (versioned, version-folder path,
delete-marker entry, EXP_DM rule active for the bucket, lister supplied)
so non-marker version writes pay nothing. The Match carries the LOGICAL
key in ObjectKey and the marker's version_id in VersionID so the
dispatcher can reach deleteSpecificObjectVersion(bucket, logical, vid);
without a version_id the dispatcher would BLOCK and freeze the cursor.

Server-side dispatch re-checks sole-survivor before deleting: another
PUT can land between schedule and dispatch, and identity-CAS only covers
the marker entry, not the directory shape. Lists .versions/<key>/ with
limit 2 and returns NOOP_RESOLVED if count != 1 or the surviving entry
is not the same version.

Fix isDeleteMarkerEntry to compare string(v) == "true" — production
writes []byte("true"), not {1}; every other reader of ExtDeleteMarkerKey
in this repo uses the string predicate.

filerSiblingLister.Count caps at limit 2 — callers only distinguish
"sole survivor" (1) from "more than one" (>=2).

Follows #9373.

* fix(s3/lifecycle): gate sibling listing on active event-driven EXP_DM

hasActionKind tested key presence only; an EXP_DM rule whose action was
inactive (BootstrapComplete=false) or scan-only would still trigger the
sibling-listing RPC even though no match could fire. Replace with a
helper that mirrors the per-key filter Route already applies before
emitting matches: kind matches, snap.Action returns non-nil, IsActive,
and Mode == ModeEventDriven.

Add a regression test that builds a versioned snapshot with the rule
but no PriorStates (action stays inactive) and asserts the lister is
never consulted.

* chore(s3/lifecycle): trim verbose comments

* fix(s3/lifecycle): null version, hard-delete trigger, latest pointer

Three correctness gaps in EXP_DM routing.

1. Null version. SiblingLister.Count saw only .versions/<key>/, but a
   pre-versioning bare-key object survives outside that folder when
   versioning is enabled later. Marker + null would look like count==1
   and EXP_DM would re-expose the old object. Replace Count with
   Survivors which also reports HasNullVersion; route- and dispatch-time
   suppress when set.

2. Hard-delete trigger. The router skipped events with NewEntry==nil so
   a noncurrent hard-delete that left the marker as sole survivor never
   fired EXP_DM. Allow version-folder hard-deletes through; the lister's
   LoneEntry carries the surviving marker's version_id and identity.

3. Latest pointer. checkSoleSurvivorMarker only checked count and
   filename. The worker can race createDeleteMarker between the file
   write and the .versions/ directory metadata update. Require
   ExtLatestVersionIdKey == versionId; missing pointer returns
   retry-later instead of deleting.

Adds a null-version exists check on the dispatch path too.

* fix(s3/lifecycle): normalize null-version lookup errors and detect dir markers

Two correctness bugs in the null-version check.

Route-side: filerSiblingLister called client.LookupDirectoryEntry
directly. SeaweedList wraps not-found via filer_pb.LookupEntry, which
normalizes the gRPC string-mapped not-found into ErrNotFound. The raw
client returns it as a generic error instead, so every absent
null-version (the common case) bubbled up as an error and the router
suppressed every otherwise-valid match. Use filer_pb.LookupEntry.

Both sides: explicit S3 directory-key markers (object names ending in /)
are stored as directory entries with Attributes.Mime set;
processExplicitDirectory in the listing path treats them as null
versions. The previous check was !IsDirectory only, which let the
marker delete proceed and re-expose the directory key. Add
IsDirectoryKeyObject() to both predicates. Also use
util.NewFullPath(...).DirAndName() for the parent/name split so a
trailing-slash key resolves to the same underlying entry path as the
listing code.

* fix(s3/lifecycle): EXP_DM ctx propagation, nil-entry guard, fast-path skip

Three small follow-ups on the EXP_DM dispatch path.

checkSoleSurvivorMarker now takes ctx instead of context.Background()
so worker shutdown / deadlines cancel the SeaweedList RPC instead of
stalling.

If SeaweedList fires the lone callback with entry==nil, firstName
stays empty and the marker-replaced check would short-circuit; that's
the one shape that bypasses the dispatch guard, so retry-later instead.

routeSoleSurvivorMarker now skips the Survivors RPC on regular
non-marker version creates — those always have Count >= 2, so the
listing was wasted load on every versioned write under an EXP_DM rule.
Hard-delete events (NewEntry==nil) and marker creates still flow
through. Added a regression test asserting the regular-create case
doesn't consult the lister.

Documented that logicalKeyFromVersionPath rejects bucket-root markers
intentionally.
2026-05-08 23:24:08 -07:00
Chris Lu
196c41d21a test(s3/lifecycle): cover scheduler/bootstrap walker + MPU detection (#9380)
Locks down isMPUInitDir, walkBucketDir (regular/nested/MPU/error/cancel
paths), and BucketBootstrapper.KickOffNew (per-bucket fanout and
in-process dedup) against a fake SeaweedFilerClient.
2026-05-08 22:14:45 -07:00
Chris Lu
ee1d8f9e8c refactor(s3api): drop filer.conf TTL routing from PUT lifecycle (#9379)
PutBucketLifecycleConfiguration used to install /buckets/<bucket>/<prefix>
day-TTL entries in filer.conf so the volume server's RocksDB compaction
filter would expire matching writes. With 9377 the s3api server now stamps
volume TTL per-write via LifecycleTTLResolver off the stored XML, which
covers the same prefix-only Expiration.Days subset and additionally
handles size filters and AWS overlapping-rule precedence. Maintaining
both paths means a rule change has to mutate two stores in lockstep, and
the filer.conf path can't represent everything the resolver does.

Drop the add path. Keep a one-way cleanup loop so an upgrade still wipes
day-TTL entries written by older builds — otherwise a stale entry would
silently double-stamp writes (volume server expires under the old rule)
or contradict the new XML after a rule change.

Also removes resolveLifecycleDefaultsFromFilerConf (no longer needed) and
the versioning-fast-path guard (the resolver itself returns nil for
versioned/object-lock buckets, covered by
TestNewLifecycleTTLResolver_NilOnVersionedBucket). Tests covering the
deleted helpers are deleted with them; the GET fallback that synthesizes
lifecycle rules from existing filer.conf TTLs is unchanged so users who
historically configured TTL via filer.conf directly still see a rule.
2026-05-08 21:54:39 -07:00
Chris Lu
2458f6c81c feat(s3api): apply lifecycle TTL at write time (#9377)
* feat(s3api): apply lifecycle TTL at write time

The S3 server already has the bucket's lifecycle XML at PUT time (via
the cached BucketConfig), so volume-TTL routing is just a per-write
decision instead of something that needs a separate filer.conf
projection kept in sync via operator commands.

- BucketConfig caches the canonical Rules parsed from the lifecycle
  XML once on load (BucketConfigCache invalidates on Put/Delete
  Lifecycle, so the rules stay current automatically).
- resolveLifecycleTTLForWrite walks the cached rules: longest-prefix
  match, applies tag and size filters against the request, returns
  Days * 86400. Versioned buckets, non-Expiration.Days rules, and
  unevaluable size filters (no Content-Length) yield 0 — the
  lifecycle worker handles those at scan time.
- putToFiler resolves TTL once and passes it through both the
  AssignVolumeRequest (so chunks land on a TTL volume) and the new
  entry's Attributes.TtlSec (so the filer's RocksDB compaction also
  expires the metadata).

Lifecycle XML PUT/DELETE now influences write routing immediately —
no operator command, no filer.conf bookkeeping. The lifecycle worker
remains authoritative for the cases the fast path can't cover (existing
objects via bootstrap, versioned buckets, noncurrent retention,
abort-MPU, tag/size filters that didn't hold at PUT time).

CompleteMultipartUpload and CopyObject still need wiring; left for
follow-ups so this PR stays scoped.

* perf(s3api): pre-filter and sort lifecycle rules for the per-PUT TTL walk

resolveLifecycleTTLForWrite walked every lifecycle rule on every
PutObject, including disabled / non-Expiration.Days rules that could
never fire on the fast path, and computed "longest prefix wins" via a
running max instead of an early exit.

Cache a pre-filtered + pre-sorted slice in BucketConfig:
- buildTTLFastPathRules drops everything except Status=Enabled +
  ExpirationDays>0;
- sorts by descending prefix length (stable, so equal-length rules
  keep their XML order).

The resolver returns on first prefix+filter match. A bucket whose
lifecycle XML has no Expiration.Days rules is now O(1); a typical
bucket with one Expiration.Days rule walks one HasPrefix per PUT.

The cache is built once per bucket-config load. PutBucketLifecycle /
DeleteBucketLifecycle already invalidate the cache, so the fast-path
slice stays current automatically.

* refactor(s3api): LifecycleTTLResolver object + four review fixes

Pulls the per-PUT TTL resolution into a dedicated type so the bucket
config holds one object instead of a slice + magic-walk function:

- LifecycleTTLResolver wraps the pre-filtered, pre-sorted rules.
  nil-safe Resolve so the call site doesn't have to special-case
  buckets with no eligible rules.

Four review findings:

1. (high) drop tag-filtered rules from the fast path. Tags are mutable
   post-PUT via PutObjectTagging but volume TTL is irreversible — an
   object that matched at write time would still expire after the tag
   was removed. Worker re-evaluates current tags at scan time. Fast
   path now keeps only stable predicates: prefix and size.

2. (high) move TTL resolution out of putToFiler. MPU parts, copy-part
   destinations, and other transient writes called putToFiler with
   object="" — bucket-wide rules (empty Prefix) matched and bound a
   TTL clock starting at part-upload time, before
   CompleteMultipartUpload existed. putToFiler now takes an explicit
   ttlSec parameter; only the user-visible PutObject paths
   (PutObjectHandler, postpolicy) feed it from the resolver. MPU and
   copy-part pass 0.

3. (medium) AWS overlapping-rule precedence is "shorter expiration
   wins", not "longest prefix wins". Sort by ExpirationDays ascending
   so the first prefix match is also the shortest applicable rule.

4. (medium) overflow no longer caps at math.MaxInt32 seconds (~68y).
   A longer policy would have expired early. Return 0 instead so the
   worker enforces the actual policy on its own schedule.

Versioning gate moves into the resolver constructor — versioned
buckets get a nil resolver. The five putToFiler callers all updated:
PutObjectHandler + postpolicy resolve via lifecycleTTLForObjectWrite,
suspended/versioned wrappers pass 0 by construction, MPU part and
copy-part SSE pass 0 with a one-line comment about why.

* refactor(s3api): drop unused BucketConfig.LifecycleRules field

The full canonical rule set was set on every bucket-config load but
never read — resolveLifecycleTTLForWrite worked off the resolver's
filtered slice, and the lifecycle worker reads bucket entries straight
off the meta-log instead of this cache. Remove the field and its
s3lifecycle import.

* perf(s3api): pre-compute LifecycleTTLResolver hot-path fields

Resolve was doing per-call work that's actually constant per bucket-
config load: int64 multiplication, max-int32 overflow check, field
indirections through *s3lifecycle.Rule. Move it to the constructor
and pack the rule into a compact ttlRule (prefix + ttlSec int32 +
sizeGT/sizeLT) so the inner loop is HasPrefix → optional size check
→ return.

Drop overflowing rules at construction rather than handling per-
resolve: capping would expire long policies early, and returning 0
in the inner loop would prevent any shorter overlapping rule from
firing. Drop-at-construction composes correctly with the ascending
sort.

Benchmarks (Apple M4):
  NilReceiver           0.99 ns/op   0 B/op
  OneRuleMatching       2.75 ns/op   0 B/op
  FiveRulesNoMatch     13.5  ns/op   0 B/op

* fix(s3api): refresh LifecycleTTL resolver on bucket-config update

storeBucketLifecycleConfiguration writes to Entry.Extended via
updateBucketConfig, which clones the cached BucketConfig and calls
the user fn, then caches the result. The clone inherits the prior
LifecycleTTL pointer and nothing rebuilt it from the new XML, so
add/replace/delete of a lifecycle policy left the wrong resolver in
cache until eviction. Same gap on the meta-log side: peer-driven
updates flowed through updateBucketConfigCacheFromEntry without
re-deriving the resolver.

Centralize the Entry -> derived-field mapping in one helper that
resets every Extended-backed field then repopulates from the entry,
and call it from getBucketConfig (initial load), updateBucketConfig
(after updateEntry succeeds, before caching), and
updateBucketConfigCacheFromEntry (meta-log path). Reset is the
load-bearing part: deleting the lifecycle XML must yield a nil
resolver, since stamping a stale TTL onto subsequent writes is
irreversible.

* fix(s3api): PostPolicy passes object size, not multipart wire size

lifecycleTTLForObjectWrite was reading r.ContentLength, which on the
PostPolicy path is the multipart envelope (form fields + boundaries),
not the uploaded object body. A size-filtered rule would evaluate
against that inflated total and stamp (or skip) a TTL the policy
didn't intend.

Take the object size as an explicit parameter. PutObject still passes
r.ContentLength (correct there); PostPolicy passes the fileSize
already extracted from the form part. Negative size means unknown
and continues to skip any size-filtered rule.

* fix(s3api): treat Object Lock as versioned for lifecycle TTL fast path

Object Lock requires versioning at the API level, but it can be
enabled at create time without S3 ever writing the explicit
Versioning header. The lifecycle resolver construction site only
checked Versioning, so an Object-Lock bucket with no Versioning byte
would still get a fast-path resolver and stamp volume TTL onto
writes — destroying noncurrent versions when the volume expires.

Mirror the OR already used in BucketIsVersioned: ObjectLockConfig
non-nil counts as versioned for resolver construction. Existing
explicit-Versioning paths are unchanged.
2026-05-08 21:35:27 -07:00
Chris Lu
e55db58ca9 feat(s3/lifecycle): expose Prometheus metrics (Phase 7) (#9375)
* feat(s3/lifecycle): expose Prometheus metrics (Phase 7)

Five new gauges/counters under the s3_lifecycle subsystem so operators
can see what the worker is doing without grepping logs:

- dispatch_total{bucket,kind,outcome} — every LifecycleDelete RPC
  bumps this. Outcome is the proto enum name (DONE, NOOP_RESOLVED,
  RETRY_LATER, BLOCKED, …) plus a synthetic "RPC_ERROR" for transport
  failures classified as RETRY_LATER.
- schedule_depth{shard} — pending matches in each shard's schedule,
  sampled on the dispatcher tick.
- cursor_min_ts_ns{shard} — per-shard min cursor timestamp; lag is
  derived as (now - min) by the scrape side.
- events_total{shard} — meta-log events the reader fed to the router.
- bootstrap_dispatch_total{bucket,kind} — bootstrap-walk dispatches.

Test asserts the dispatch counter increments for both DONE and
RPC_ERROR paths.

* fix(stats): purge lifecycle bucket label series in DeleteBucketMetrics

The two new bucket-labeled lifecycle counters
(S3LifecycleDispatchCounter, S3LifecycleBootstrapDispatchCounter)
weren't included in DeleteBucketMetrics, so explicit bucket teardown
left their label series behind — same cardinality leak the existing
counters above already avoid. Tack them onto the same DeletePartialMatch
chain.
2026-05-08 17:49:10 -07:00
Chris Lu
05d31a04b6 fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite

The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.

Three changes that work together:

- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
  the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
  the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
  tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
  Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
  land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
  -runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
  shell command runs the full pipeline (reader + scheduler + dispatcher) for the
  duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
  out for now — versioned-bucket expiration via the worker still needs its own
  pass.

Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (8b87ceb0d updated `mini_plugin_test.go` for the s3_lifecycle
plugin but missed this twin test).

Two retention-gate engine tests `t.Skip` under the s3tests build because they
rely on absolute lookback-vs-retention math the day-rescale collapses; the prod
build still covers them.

* review: harden lifecycle worker spawn + assert handler identity

- Workflow: aliveness check on the backgrounded `weed shell` (a bad command
  exits in <1s and the suite would otherwise just opaque-timeout); move
  worker/server teardown into a `trap cleanup EXIT` so failure paths still
  print the worker log and reap the data dir.
- worker_test: check the actual job-type set by name, not just the count.

* fix(shell): keep s3.lifecycle.run-shard alive when no rules exist yet

The s3-tests CI runs the worker BEFORE any test creates a bucket, so
LoadCompileInputs returns empty and the shell command was bailing out
with "no buckets with enabled lifecycle rules found" within ~1s. The
aliveness check then fired exit 1 before tox ever started.

Two changes:

- Don't early-exit on empty inputs. Compile against the empty set, log a
  one-liner, and let the pipeline run normally — the meta-log subscription
  is already up, so events for buckets created later DO arrive; they just
  need the engine to know about them when they do.
- Add `-refresh <duration>` (default 5m, 2s in s3tests CI) that
  periodically re-runs LoadCompileInputs + engine.Compile so rules added
  after startup land in the snapshot the dispatcher reads on its next
  tick. Production deployments keep the 5m default; only the CI workflow
  drops to 2s.

Workflow passes `-refresh 2s` in both basic and SQL blocks.

* fix(shell): backfill pre-rule entries via bootstrap walker

The reader-driven path only sees meta-log events created AFTER its
engine snapshot knows the rule. The s3-tests CI scenario PUTs objects
first, then PUTs the lifecycle config, so by the time the engine
refresh picks up the new bucket the object events have already been
seen-and-dropped (BucketActionKeys returned empty for the bucket).

Wire bootstrap.Walk into the shell command:

- bucketBootstrapper tracks buckets seen so far. kickOffNew spawns one
  loop goroutine per fresh bucket.
- Each goroutine re-walks the bucket every walkInterval (defaults to
  the same value as -refresh, i.e. 2s in s3tests CI, 5m in prod) and
  feeds each entry through bootstrap.Walk; due actions dispatch via a
  direct LifecycleDelete RPC. Not-yet-due entries are silently skipped
  and picked up on a later iteration once they age past their (rescaled
  or real) threshold.
- LifecycleDelete is called with no expected_identity; the server-side
  identityMatches treats nil as "skip CAS", which is the right call
  for bootstrap (the bootstrap entry doesn't carry chunk fid /
  extended hash anyway).

The dispatcher's pkg-private toProtoActionKind is duplicated in the
shell file rather than exported, since the shape is six lines and the
reverse import would pull a proto dep into the s3lifecycle root.

* refactor(s3/lifecycle): hoist bucket bootstrapper into scheduler pkg

The shell command got the backfill in the previous commit but the worker
plugin (weed/worker/tasks/s3_lifecycle/handler.go) drives Scheduler.Run
directly and missed it — same root cause: the reader-driven path only
sees events created after the rule lands, so a daily cron picking up a
freshly-PUT rule wouldn't expire any pre-rule object.

Move the looping bucket walker into scheduler.BucketBootstrapper:

- Scheduler.Run now constructs one and calls KickOffNew on every engine
  refresh. Per-bucket goroutines re-walk every BootstrapWalkInterval
  (defaults to RefreshInterval — 5m in prod, 2s under s3tests).
- The shell command consumes the same struct instead of its own copy
  so the two paths can't drift in semantics.

* refactor(s3/lifecycle): walk-once + schedule via event injection

Previous per-bucket walker re-listed every WalkInterval forever. For a
bucket with N objects under a long rule, the worker did O(N * runtime /
walkInterval) listings even when nothing was newly due — way too much
for production-scale buckets.

New approach: walk each bucket exactly once on first sight, synthesize
one *reader.Event per existing entry, push it onto Pipeline.events.
Router.Route builds a Match with DueTime=mtime+delay; future-due matches
sit in the per-shard Schedule and fire when their DueTime arrives.
Currently-due matches fire on the very next dispatch tick.

Wiring:

- dispatcher.Pipeline lifts its events channel into a struct field
  with sync.Once init, and exposes InjectEvent(ctx, ev). Reader no
  longer closes the channel — the dispatch goroutine exits on runCtx
  cancellation, which works the same as channel-close did.
- scheduler.BucketBootstrapper drops the WalkInterval ticker. KickOffNew
  spawns one walker goroutine per fresh bucket; the goroutine lists,
  synthesizes events, then exits.
- scheduler.Scheduler builds its pipelines up front and exposes a
  pipelineFanout (shard -> Pipeline) as the EventInjector, so a multi-
  worker scheduler routes each synthesized event to the pipeline that
  owns its shard.
- Shell command's single-pipeline path passes pipeline.InjectEvent
  directly.

Synthesized events carry TsNs=0; dispatcher.advance treats that as a
no-op so the reader's persisted cursor isn't ratcheted past unprocessed
meta-log events. Identity (HeadFid + ExtendedHash) is still computed
from the real filer entry, so the server's identity-CAS catches an
overwrite between bootstrap and dispatch.

* debug(s3tests): make lifecycle worker progress visible in CI logs

The previous CI failure dumped an empty $LC_LOG even though the worker
was running. Two reasons:

1. weed shell suppresses glog by default (logtostderr / alsologtostderr
   set to false). Pass `-debug` so the bootstrapper's V(0) lines reach
   stderr instead of disappearing into /tmp/weed.*.log.
2. cleanup used `kill -9` which skips Go's stdout flush. SIGTERM first
   with a 1s grace, then SIGKILL the holdout, then read the log.

While here: bump the bootstrap walker's two informational logs to V(0)
so the diagnosis from CI doesn't require -v=1 on the worker.

* fix(s3/lifecycle/dispatcher): refresh snap on every event

Pipeline.Run captured snap at startup and only refreshed it on the
dispatch tick. With bootstrap event injection, the walker pushes events
seconds after engine.Compile sees the bucket — typically WITHIN the
same dispatch interval. Routing against the cached (empty) snap then
silently dropped every match because BucketActionKeys returned nil for
the bucket-not-yet-in-snapshot case.

Re-fetch on each event. Engine.Snapshot is an atomic.Pointer.Load, so
the cost is negligible. The dispatch-tick branch keeps using a fresh
local read for its own loop, so its semantics are unchanged.
2026-05-08 17:29:47 -07:00
Chris Lu
159cfc97ce feat(s3/lifecycle): classify versioned events by storage path (Phase 5b/1) (#9373)
* feat(s3/lifecycle/router): classify versioned events by storage path

Phase 5b first slice. Pass the bucket's Versioned flag from the engine
snapshot into buildObjectInfo and:

- Recognize <key>.versions/<vid> events as noncurrent versions.
  IsLatest=false, info.Key strips the .versions/<vid> suffix so a
  rule's Filter.Prefix matches the user's logical key, and the
  AWS-visible version_id rides on Match.VersionID for the dispatcher
  to target a single version on the server.
- Read IsDeleteMarker from Extended unconditionally — the engine
  rejects ExpiredObjectDeleteMarker when NumVersions != 1, so without
  sibling listing the marker case stays correctly suppressed (a
  separate PR will add the listing).
- Non-versioned buckets keep the existing behavior even when an
  object literally named "*.versions/v1" exists; Versioned=false
  short-circuits the path classification.

Time-based NoncurrentDays now fires on noncurrent events.
NewerNoncurrent and ExpiredObjectDeleteMarker still need sibling
listing — left for a follow-up.

* fix(s3/lifecycle/router): require ExtVersionIdKey to confirm noncurrent

Path classification alone misclassifies a literal-key collision: a
versioned bucket holding an object with key "logs/backup.versions/2023"
would be flagged noncurrent and have its key stripped to "logs/backup",
losing the user's actual rule-prefix-matching path. SeaweedFS doesn't
reserve the .versions/ segment, so the path shape is necessary but
not sufficient.

Add an authoritative confirmation: the entry must declare the same
version_id via ExtVersionIdKey (the field SeaweedFS sets when storing
a tracked version). Also reject idx==0 paths so ".versions/<vid>"
can't yield an empty logical key.

Tests:
- collision: versioned bucket + .versions/ in literal key + no
  metadata (and the mismatched-vid variant) → still classified as a
  current-version object;
- root-versions: .versions/v1 (idx==0) → treated as a regular key;
- existing noncurrent test now sets ExtVersionIdKey to mirror the
  storage shape.

* fix(s3/lifecycle/router): skip versioned-bucket version-folder events

The previous attempt tried to classify <key>.versions/<vid> events as
noncurrent versions by storage path. That's broken on three counts:

- SeaweedFS stores version files as v_<id> (getVersionFileName), so
  comparing the path suffix to the raw ExtVersionIdKey never matches.
- The "current latest" version on a versioned bucket lives at the
  same .versions/v_<id> path shape as noncurrent versions; the latest
  pointer is on the parent .versions/ directory's
  Extended[ExtLatestVersionIdKey], which the router doesn't see.
- Even with a correct vid match, IsLatest=false plus the storage path
  as ObjectKey would have the dispatcher recompose <storagepath>.versions/v_<id>
  and no-op (or worse, target the wrong file).

Until we route from .versions/ directory pointer-transition events
(or supply IsLatest/SuccessorModTime/index from sibling listing),
skip every event under a *.versions/ folder. Bare-key events (null
versions) still route normally; bootstrap walking covers the
versioned-storage cases.

Tests assert the skip across tracked, literal-collision, and
bucket-root .versions paths.

* feat(s3api): refuse noncurrent-kind delete on the current latest version

Defense-in-depth for the noncurrent kinds: even when bootstrap (or a
future event-driven path) thinks a version is noncurrent, the server
must verify against the .versions/ directory's
Extended[ExtLatestVersionIdKey] before deleting. If the target version
matches the latest pointer the action is silently dropped as
NOOP_RESOLVED:VERSION_IS_LATEST instead of deleting the live data.

* refactor(s3/lifecycle): tidy versioning gates per review

- router: skip directory entries (other than MPU init) in
  buildObjectInfo so .versions/ folder events never become
  ObjectInfo. Subtest "versions dir itself" added.
- s3api: switch isCurrentLatestVersion's path split from
  filepath.Split (OS-dependent) to path.Split so filer paths
  always use '/'.
2026-05-08 14:15:32 -07:00
Lars Lehtonen
935fb42e1d chore(weed/util/chunk_cache): remove unused functions (#9372)
* chore(weed/util/chunk_cache): remove unused functions

* fix(chunk_cache): bound ReadAt buffer in readNeedleSliceAt

When the caller-provided buffer is larger than the remaining needle
bytes, ReadAt would spill into the next needle and trigger the
n != wanted error. Slice to data[:wanted] so the read stops at the
needle boundary.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-05-08 13:12:11 -07:00
Chris Lu
fd463155e4 fix(ec): planner treats each (server, disk_id) as a distinct target (#9369) (#9371)
* fix(ec): planner treats each (server, disk_id) as a distinct target (#9369)

master_pb.DataNodeInfo.DiskInfos is keyed by disk type, so a volume
server with multiple physical disks of the same type collapses into a
single DiskInfo. Per-disk attribution survives only inside the
VolumeInfos[].DiskId / EcShardInfos[].DiskId records, and the active
topology never put it back together. The EC planner saw N candidates
instead of N×disks, returned a short plan, and createECTargets
round-robined extra shards onto the same (server, disk_id) — colliding
with the #9185 disk_id-aware ReceiveFile.

Reconstruct per-physical-disk view in UpdateTopology by splitting each
DiskInfo into one entry per observed disk_id, and index volumes / EC
shards by their own DiskId so lookups stay aligned. Refuse to plan an
EC task when fewer than totalShards distinct disks are available rather
than packing shards onto the same disk.

Threads dataShards/parityShards through planECDestinations,
createECTargets and createECTaskParams so the helpers don't depend on
the OSS 10+4 constants — keeps enterprise merges clean.

* trim verbose comments

* align EC param signatures with enterprise

- dataShards/parityShards: uint32 → int (matches enterprise's ratio API)
- drop unused multiPlan from createECTaskParams
- minTotalDisks: total/parity+1 → ceil(total/parity), correct for non-default ratios

Reduces merge surface when this PR lands in seaweed-enterprise.
2026-05-08 12:59:02 -07:00
Chris Lu
194dce27bf fix(mount): preserve user-set mtime through async/periodic flush (#9363) (#9370)
* fix(mount): preserve user-set mtime through async/periodic flush (#9363)

flushMetadataToFiler and flushFileMetadata both stamped time.Now() onto
the entry before sending it to the filer, clobbering any mtime SetAttr
had stored from utimes()/touch -m -d. The reproducer hit this ~1s after
touch because the writebackCache deferred close from the prior write
ran flushMetadataToFiler after the user's utimes call.

Flush has no business inventing timestamps. Move the write-time stamp
into Write (where it always belonged for POSIX correctness) and let
flush persist whatever Write or SetAttr already put on the entry.

* test(mount): tighten mtime regression test, drop tautological one

- userMtime now has non-zero nanoseconds, so the *Ns assertions catch a
  regression that would zero the field.
- Add CtimeNs assertion (was missing).
- Drop TestWriteStampsEntryMtime: it duplicated the implementation it
  was supposed to test, so a regression in Write would not have failed
  it. Driving the real Write path needs a full PageWriter, which is out
  of scope for this fix; TestFlushFileMetadataPreservesUserMtime is the
  meaningful regression for #9363.
2026-05-08 12:37:23 -07:00
Chris Lu
89aab30821 feat(s3/lifecycle): wire AbortIncompleteMultipartUpload (Phase 5a) (#9368)
* feat(s3/lifecycle/router): emit ABORT_MPU events for .uploads/<id> init dirs

Detect a meta-log event at exactly .uploads/<upload_id> (a directory)
and build the ObjectInfo from its destination key (entry.Extended[key])
so a rule with Filter.Prefix=foo/ matches an MPU uploading to foo/bar.
Sub-events under .uploads/<id>/<part> ride a different mtime and would
over-fire the ABORT_MPU schedule, so they're rejected explicitly.

m.ObjectKey stays as ev.Key (.uploads/<upload_id>) — the dispatcher
needs the upload directory path, not the destination key, to actually
remove the in-flight upload.

* feat(s3api): wire LifecycleDelete ABORT_MPU to remove the upload dir

Replaces the retryLater stub. Validates the .uploads/<upload_id> shape
of req.ObjectPath (so a malformed event can't escalate to a wider rm),
then deletes the upload directory under <bucket>/.uploads/<id>. Maps
NotFound to NOOP_RESOLVED, transport errors to RETRY_LATER, success to
DONE.

* refactor(s3api): drop redundant exists check before lifecycle ABORT_MPU rm

s3a.rm already does a NotFound-returning lookup, so the pre-check just
adds a round-trip. Map filer_pb.ErrNotFound to NOOP_RESOLVED on rm,
keep transport errors as RETRY_LATER.

* refactor(s3/lifecycle/router): use s3_constants for MPU paths + Extended key

Drop the hardcoded ".uploads/" and "key" string literals; the symbols
already exist as s3_constants.MultipartUploadsFolder and
ExtMultipartObjectKey, and the server side reaches them through the
same constants. Keeping the test helpers tied to those names also makes
the negative-result tests meaningful — they'd otherwise still pass if
the lookup constant drifted.

* fix(s3api): close lifecycle ABORT_MPU traversal + NOT_FOUND gaps

Two issues with the recent ABORT_MPU plumbing:

- "." and ".." passed the no-slash check but resolve to the bucket root
  via util.JoinPath, so .uploads/.. could rm the wrong directory.
- filer.DeleteEntry suppresses ErrNotFound and returns success, so the
  rm path can't distinguish missing from deleted; the previous version
  reported DONE for an already-aborted upload instead of NOOP_RESOLVED.

Reject the two reserved names explicitly and restore the existence
pre-check so the outcome map stays correct. Add a table-test covering
the rejected paths.

* fix(s3/lifecycle/bootstrap): walk MPU init dirs by destination key

A real MPU init record is a directory under .uploads/<id> created by
mkdir; the bootstrap walker was skipping every directory entry, so an
MPU that existed before the meta-log subscription was never aborted.
Even with the skip relaxed, MatchPath used the .uploads/<id> path, so
a rule with Filter.Prefix=logs/ would never fire on an MPU uploading
to logs/foo.txt.

Add Entry.DestKey, let IsMPUInit directories through, and use DestKey
for both MatchPath and ObjectInfo.Key. A bare init directory with no
DestKey means metadata hasn't landed yet — skip rather than guess.

* fix(s3/lifecycle): gate (kind, info) shape so MPU init only fires ABORT_MPU

An MPU init record carries IsMPUInit=true and IsLatest=false. Without
gating, the router and bootstrap walker matched it against every active
ActionKey for the bucket, so NONCURRENT_DAYS / NEWER_NONCURRENT fired
(IsLatest=false reads as a noncurrent version). The dispatcher would
then BLOCK on empty version_id and freeze the cursor.

Add a shape gate at both call sites:
  - IsMPUInit + non-ABORT_MPU kind → continue
  - regular object + ABORT_MPU kind → continue

Plus a defense-in-depth check at the top of EvaluateAction so future
callers can't reintroduce the bug. Tests cover all three layers.

* test(s3/lifecycle): tighten dual-action coverage at the call sites

- Walk multi-action: replace the kinds-as-set check with an exact-shape
  DeepEqual on (path, kind) tuples. The set check would have missed an
  MPU init wrongly firing NONCURRENT_DAYS — exactly the regression the
  (kind, info) gate fixes.
- Router: add a converse case for the dual ExpirationDays +
  AbortIncompleteMultipartUpload rule. A regular current-version object
  must fire only EXPIRATION_DAYS; without the gate the dispatcher would
  also receive ABORT_MPU and rm the object via the MPU code path.
2026-05-08 12:12:42 -07:00
Chris Lu
8b87ceb0d1 refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration (Phase 4) (#9367)
* refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration

The handler used to walk every existing entry under the rule's prefix
and stamp entry.Attributes.TtlSec + the SeaweedFSExpiresS3 flag so that
the filer's compaction filter would expire them. With the event-driven
lifecycle worker live, that retroactive walk is redundant — the worker
drives expiration off the meta-log and a one-time bootstrap scan, so a
PUT lifecycle stays O(rules) instead of O(objects).

New writes still inherit TTL from the filer.conf location entry above;
that volume-routing path is unchanged here and will move to an explicit
operator command later (Phase 11).

Drops updateEntriesTTL + processDirectoryTTL + processTTLBatch +
updateEntryTTL from filer_util.go.

* fix(s3api): clear stale lifecycle TTL entries on PUT

PutBucketLifecycleConfiguration only ever appended/updated filer.conf
entries — it never cleared ones the operator removed, renamed-prefix on,
disabled, retagged with a tag filter, or bucket-versioned out of the
fast path. The stale day-TTL kept routing new writes (and would expire
old ones if any landed under the prefix) after the policy was updated.

Treat PUT as a full replacement: walk this bucket's existing day-TTL
entries, clear them, then add fresh entries from the new rule set.

* test(command): bump mini default plugin job-type count to 7

The s3_lifecycle plugin handler registered in #9362 is the seventh
default; the test still asserted six.

* fix(s3api): delete stale lifecycle PathConf instead of blanking Ttl

Just clearing pathConf.Ttl leaves the rule's Collection, Replication,
and VolumeGrowthCount in place, so new writes still match the stale
prefix and inherit outdated routing/placement. Use
fc.DeleteLocationConf so the lifecycle-owned PathConf goes away
entirely. Same fix in DeleteBucketLifecycleHandler, which had the
same bug.
2026-05-08 11:03:03 -07:00
Chris Lu
5d43f84df7 refactor(plugin): rename detection_interval_seconds → detection_interval_minutes (#9366)
Minutes is the natural granularity for detection cadence — every
production handler already set the seconds field to a 60-multiple
(17*60, 30*60, 3600, 24*60*60). Switching to minutes drops the *60
arithmetic and matches the unit conventions used elsewhere in the
plugin worker forms.

- Proto: AdminRuntimeDefaults + AdminRuntimeConfig.detection_interval_*
  field renamed.
- Helpers: durationFromMinutes / minutesFromDuration alongside the
  existing seconds variants in plugin_scheduler.go.
- Handlers: vacuum, ec_balance, balance, erasure_coding, iceberg,
  admin_script, s3_lifecycle now declare DetectionIntervalMinutes.
- Admin: scheduler_status + types + UI templ + plugin_api.go pass
  through the new field; UI label and table cells switch to "min".
2026-05-08 10:33:02 -07:00
Chris Lu
7f254e158e feat(worker/s3_lifecycle): plugin handler with admin UI config (#9362)
* feat(s3/lifecycle): scheduler — N pipelines over an even shard split

Scheduler.Run spawns Workers Pipeline goroutines plus one engine-refresh
ticker. Each worker owns a contiguous AssignShards(idx, total) slice of
[0, ShardCount) and runs Pipeline.Run with EventBudget bounding each
iteration; brief RetryBackoff between iterations avoids hot-loop on
errors. The refresh ticker rebuilds the engine snapshot from the filer's
bucket configs every RefreshInterval.

LoadCompileInputs / IsBucketVersioned / AllActivePriorStates are
exported from a configload.go sibling so the shell command can move to
this shared implementation in a follow-up.

* refactor(shell): reuse scheduler.LoadCompileInputs in run-shard

Drop the local copies of loadLifecycleCompileInputs / isBucketVersioned
/ allActivePriorStates / lifecycleParseError that the new
scheduler package now exports. Same behavior, one source of truth.

* feat(worker/s3_lifecycle): plugin handler with admin UI config

Registers a JobHandler for s3_lifecycle via pluginworker.RegisterHandler.
Admin pulls the descriptor over the worker plugin gRPC and renders the
AdminConfigForm + WorkerConfigForm in the existing UI:

  Admin form (cluster shape):
    - workers (1..16, default 1)
    - s3_grpc_endpoints (comma list)

  Worker form (operational tuning):
    - dispatch_tick_ms (default 5000)
    - checkpoint_tick_ms (default 30000)
    - refresh_interval_ms (default 300000)
    - event_budget (default 0 = unbounded)

Detect emits a single proposal whenever S3 endpoints + filer addresses
are configured. MaxExecutionConcurrency=1 so admin only ever runs one
lifecycle daemon per worker; a fresh proposal next cycle restarts it
if the prior Execute exits.

Execute dials the configured S3 endpoint + filer, builds a
scheduler.Scheduler with the parsed config, and runs it until
ctx cancellation. Reuses the existing scheduler / dispatcher /
reader / engine packages — the handler is the thin glue that
parses descriptor values and wires the long-running daemon.

* proto(plugin): add s3_grpc_addresses to ClusterContext

So workers can dial s3 servers discovered by the master rather than a
hand-typed list in the admin form.

* feat(admin): populate ClusterContext.s3_grpc_addresses from master

ListClusterNodes(S3Type) returns the live S3 servers; the plugin
scheduler now hands these to job handlers alongside filer/volume
addresses.

* feat(worker/s3_lifecycle): discover s3 endpoints from cluster context

Drop the s3_grpc_endpoints admin form field and read the master-supplied
ClusterContext.S3GrpcAddresses instead. Operators no longer maintain a
hand-typed list, and a stale entry self-heals when the master's view
updates.

* feat(worker/s3_lifecycle): time-based runtime cap, friendlier cadence units

- dispatch_tick_minutes (was *_ms): minutes is the natural granularity
  for a daily batch; default 1 minute.
- checkpoint_tick_seconds: seconds for the durable cursor write; default
  30 seconds.
- refresh_interval_minutes: minutes for the engine snapshot rebuild.
- max_runtime_minutes replaces event_budget. Each daily run is bounded
  by wall clock — typical run wraps in well under an hour because the
  cursor persists and the meta-log streams fast. Default 60 minutes.
- AdminRuntimeDefaults.DetectionIntervalSeconds = 86400 so the admin
  schedules one job per day.
2026-05-08 10:30:02 -07:00
Chris Lu
85abf3ca88 feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch

Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash)
shard, routes events through the compiled lifecycle engine, and dispatches
due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard
cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume.

Operator-runnable harness for end-to-end Phase 3 validation while the
plugin-worker auto-scheduler is still pending. EventBudget bounds a single
invocation; flags expose dispatch + checkpoint cadence.

Discovers buckets by walking the configured DirBuckets path and reading
each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml]
through lifecycle_xml.ParseCanonical. All compiled actions are seeded
BootstrapComplete=true so the run dispatches whatever fires immediately;
production bootstrap walks set this incrementally per bucket.

* test(s3/lifecycle): integration test driving the run-shard shell command

Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix,
PUTs the target object, then rewrites the entry's Mtime via filer
UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every
shard via 'weed shell' subprocess and asserts the backdated object is
deleted within 30s, and the in-prefix-but-recent object remains.

The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable.
Backdating via the filer's gRPC sidesteps that constraint while still
exercising the real Reader -> Router -> Schedule -> Dispatcher ->
LifecycleDelete RPC path end-to-end.

Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs
all 16 shards because ShardID(bucket, key) is hash-based and the test
shouldn't couple to that detail; running every shard keeps the test
independent of the hash function.

* fix(shell/s3.lifecycle.run-shard): address review findings

- Reject negative -events explicitly. Help text already defines 0 as
  unbounded; negative budgets created ambiguous behavior in pipeline.Run.
- Bound the gRPC dial with a 30s timeout instead of context.Background()
  so an unreachable S3 endpoint doesn't hang the shell.
- Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList
  takes a single-RPC limit; the prior 4096 silently dropped buckets
  past that page on large clusters. Loop with startFrom until a page
  comes back short.
- Surface parse errors instead of swallowing them. Buckets with
  malformed lifecycle XML now print the first three errors verbatim
  and a count for the rest, so an operator running this command for
  diagnostics can find what's wrong.

* feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription

Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads
the same model through Reader and Pipeline.

- reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs;
  ShardID stays for the single-shard short form. Event carries the
  computed ShardID so consumers can route per-shard without rehashing.
- dispatcher.Pipeline gains Shards []int. When set, Run holds one
  Cursor + Schedule + Dispatcher per shard, opens one filer
  SubscribeMetadata stream with a predicate covering the whole set,
  and routes events into the matching shard's schedule from a single
  dispatch goroutine — no per-shard goroutine fan-out.
- shell command parses -shard or -shards (mutually exclusive),
  formats progress messages with a contiguous-range label when
  applicable, and validates against ShardCount.

Integration test now uses -shards 0-15 (one subprocess invocation)
instead of a 16-iteration loop.

* fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil

The reader rejected the legitimate 'fresh subscription from epoch'
state when called from a fresh Pipeline.Run on a multi-shard worker
(no cursor file yet, all shards' MinTsNs=0). The downstream
SubscribeMetadata call handles SinceNs=0 fine; the up-front check
was over-defensive and broke the auto-scheduler completely (CI
showed 5-second-cadence retries with this exact error).

* fix(s3/lifecycle): schedule from ModTime not eventTime

A backdated or out-of-band entry update has eventTime ≈ now while
ModTime is far in the past; eventTime+Delay would push the dispatch
into the future even though the rule already fires. ModTime+Delay
is the correct fire moment. The dispatcher's identity-CAS still
catches drift between schedule and dispatch.

* fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards

The CI integration test sets -events 200 expecting the subprocess to
return after 200 in-shard events. But -events counts only events that
pass the shard filter; the test produces ~5 such events (bucket
create, lifecycle PUT, two object PUTs, mtime backdate), so the
reader stays in stream.Recv forever and runShellCommand hangs the
test deadline.

- weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag.
  When > 0, Pipeline.Run runs under context.WithTimeout(D); on
  expiry the reader/dispatcher drain cleanly and the cursor saves.
- weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat
  context.DeadlineExceeded the same as context.Canceled at exit
  (both are graceful shutdown signals).

* test(s3/lifecycle): pass -runtime 10s to run-shard

Pair with the new -runtime flag so the subprocess exits cleanly
after 10s instead of waiting for an event budget that never lands
on quiet shards.

* refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg

The worker's router needs the same length-prefixed sha256 of the entry's
Extended map; pulling it out of the s3api private file lets both sides
import it.

* fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS

Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil
while the live entry on the server has a non-nil hash, so every dispatch
returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted.

* fix(s3/lifecycle): identity HeadFid via GetFileIdString

Meta-log events go through BeforeEntrySerialization, which clears
FileChunk.FileId and writes the Fid struct instead. Reading .FileId
directly returns "" on the worker side while the server's freshly
fetched entry still has a populated string, so the identity-CAS would
mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY.

* fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit

errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx
errors, so a subscription that ends because runCtx was canceled was
being logged as a fatal reader error. Check status.Code as well so the
shell's -runtime cap exits cleanly.

* fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard

run-shard's -s3 flag dials the LifecycleDelete gRPC service, which
listens on s3.port + 10000. The integration test was passing the HTTP
port instead, so the dispatcher's RPC just timed out and the shell
command exited under -runtime with no work done.

* chore(test/s3/lifecycle): drop emoji from Makefile output

* docs(test/s3/lifecycle): correct '-shards 0-15' wording

* fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run

The shell's parseShardsSpec already validates, but a programmatic caller
(scheduler, future worker config) shouldn't be able to silently produce
no-op states by passing -1 or 99.

* fix(s3/lifecycle): bound drain + final-save with their own timeouts

Shutdown was using context.Background, so a stuck dispatcher RPC or
filer save could keep Pipeline.Run from ever returning.

* fix(test/s3/lifecycle): drop self-killing pkill in stop-server

The pkill pattern \"weed mini -dir=...\" is also in the running shell's
argv (it's the recipe body), so pkill -f matches its own bash and the
recipe exits with Terminated. CI test job passed but the cleanup step
failed with exit 2. The PID file is sufficient on its own.

* docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var
2026-05-08 09:59:10 -07:00
dependabot[bot]
c918660901 build(deps): bump io.netty:netty-transport-native-epoll from 4.1.132.Final to 4.2.13.Final in /test/java/spark (#9365) 2026-05-08 06:00:51 -07:00
dependabot[bot]
9cb103cd35 build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364) 2026-05-08 05:59:26 -07:00
Chris Lu
b7928637a0 refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml (#9360)
* refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml

The structs S3 PutBucketLifecycleConfiguration parses and the canonical
conversion to s3lifecycle.Rule lived in package s3api, which transitively
imports weed/server (which imports weed/shell). Any caller outside
weed/s3api — the shell, the future lifecycle worker — that wanted to
parse a bucket's lifecycle XML hit an import cycle.

Moves:
  weed/s3api/s3api_policy.go              -> lifecycle_xml/types.go
  weed/s3api/s3api_lifecycle_canonical.go -> lifecycle_xml/canonical.go
  s3api_lifecycle_canonical_test.go       -> lifecycle_xml/canonical_test.go
  s3api_policy_test.go                    -> lifecycle_xml/round_trip_test.go

Renames the public RuleStatus type (was unexported ruleStatus) and adds
small accessor methods (Set/Val/AndSet/TagSet) for fields the s3api
handler needs to read across the package boundary. Adds NewPrefix and
NewExpirationDays constructors so the GET handler can build response
rules without poking at unexported fields. Adds a Tag struct local to
the package so it has zero internal seaweed deps. Adds a one-shot
ParseCanonical(xml []byte) helper for non-server callers.

s3api_policy.go was misnamed — its content is lifecycle XML, not S3
bucket policy. The new package name reflects the actual scope.

* test(s3api/lifecycle_xml): exercise public API in tests

- canonical_test.go's parseLifecycle helper went through xml.Unmarshal
  directly; route it through the package's exported Parse so tests
  validate the public entrypoint.
- round_trip_test.go asserted internal flags (rule.Filter.tagSet,
  rule.Filter.andSet, Transition.set, NoncurrentVersionTransition.set);
  switch to TagSet(), AndSet(), Set() — exercises the public contract
  that downstream callers (s3api handler, future shell command) rely on.
2026-05-07 18:54:06 -07:00
Chris Lu
c567da7164 feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359)
Phase 2 added the LifecycleDelete handler on S3ApiServer but never
registered it on a running gRPC server, so workers had no endpoint to
dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on
S3ApiServer and register it on the s3 command's grpc server alongside
SeaweedS3IamCacheServer.
2026-05-07 18:19:42 -07:00
Chris Lu
35e3fe89bc feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358)
* feat(s3/lifecycle): filer-backed cursor Persister

FilerPersister persists per-shard cursor maps as JSON to
/etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler.
One file per shard keeps Save atomic — the filer writes the entry
in a single mutation, so a crash mid-write doesn't leak partial
state. Pipeline.Run loads on start; the periodic checkpoint and
graceful-shutdown save go through this implementation.

A small FilerStore interface wraps the SeaweedFilerClient surface
the persister needs, so tests inject an in-memory fake instead of
mocking the full gRPC client.

* refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block

A frozen cursor doesn't advance, so the durable cursor (FilerPersister)
encodes the blocked state on its own. On worker restart the reader
re-encounters the poison event at MinTsNs, the dispatcher walks the
same retry budget to BLOCKED, and the cursor freezes at the same
EventTs. Other in-flight events between freeze tsNs and prior cursor
positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the
underlying objects were already deleted on the prior pass.

Removed:
  - BlockerStore interface + InMemoryBlockerStore + BlockerRecord
  - Dispatcher.Blockers + Dispatcher.ReplayBlockers
  - the BlockerStore.Put call in handleBlocked
  - Pipeline.Blockers field + the ReplayBlockers call on startup

Added a TestDispatchRestartReFreezesNaturally that pins the
self-recovery property: a fresh Dispatcher with a fresh Cursor, fed
the same poison event, reaches the same frozen state at the same
EventTs without any durable blocker store.

Operator visibility: a cursor whose MinTsNs hasn't advanced is the
signal — surfaced via the durable cursor file.

* refactor(filer): SaveInsideFiler accepts ctx

ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background()
internally and silently dropped the caller's ctx. Symmetric API now;
cancellation/deadlines propagate through LookupEntry / CreateEntry /
UpdateEntry. Mechanical update of all callers — most pass
context.Background() since the existing call sites have no ctx in scope.

* fix(s3/lifecycle): deterministic order in cursor save

Iterating Go maps yields random order, so json.Encode produced a different
byte sequence on each save even when the state hadn't changed. Sort
entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk
file diffs cleanly. New test pins byte-identical output across two saves
of the same map.

* fix(s3/lifecycle): log reason when freezing cursor in handleBlocked

handleBlocked dropped the reason via _ = reason with a comment claiming
the caller logged it; none of the three callers do. A frozen cursor is
the only surface where the operator finds out something stuck, so the
reason has to land somewhere. glog.Warningf with shard, key, eventTs,
and the original reason — same shape the rest of the package uses.
2026-05-07 17:45:04 -07:00
Chris Lu
ec83a87d68 perf(s3/lifecycle): defer pool Put on ShardID hasher
defer guarantees the hasher returns to the pool even if h.Write or
h.Sum panic, preventing pool leak under unexpected failure modes.
2026-05-07 17:08:50 -07:00
Chris Lu
3a192c6c57 fix(s3/lifecycle): address Phase 3 post-merge review (#9354 #9355 #9356) (#9357)
* fix(s3/lifecycle): reader handles bare /buckets parent and pre-normalizes prefix

extractBucketKey accepted /buckets/ but rejected /buckets (no trailing
slash); some delete events emit the bare form, so bucket-root events
were silently dropped. Pre-normalize BucketsPath once on Run instead
of recomputing per event.

* perf(s3/lifecycle): pool sha256 hashers in ShardID

ShardID runs on every meta-log event before the shard filter; a fresh
sha256.New per call produces measurable allocator pressure under load.
sync.Pool reuses hashers across calls.

* fix(s3/lifecycle): router skips hard deletes and missing-attribute events

A hard delete carries no schedule-relevant state — Expiration would hit
NOOP_RESOLVED at dispatch and ExpiredObjectDeleteMarker fires from a
Create on the latest version. Skip rather than burn a schedule slot.

Missing Attributes leaves ModTime at year 0001, which makes
ExpirationDays fire immediately at dispatch. Skip the event instead.

Drop the unused 'versioned' parameter from buildObjectInfo; the
dispatcher's identity-CAS handles version drift in Phase 5.

* fix(s3/lifecycle): EntryIdentity.MtimeNs holds true nanoseconds

Both computeEntryIdentity (server) and buildIdentity (router) wrote
entry.Attributes.Mtime (seconds) into a field named MtimeNs. The CAS
worked because both sides agreed, but the encoding contradicted the
field name and would break if either side later started using true
nanoseconds. Combine Mtime*1e9 + the FuseAttributes.MtimeNs nanosecond
component on both sides; the test was updated to match.

* fix(s3/lifecycle): dispatcher distinguishes ctx cancel from transport errors

A canceled or deadline-exceeded RPC is shutdown, not a transport
failure: re-queue the Match at its original DueTime with no retry-budget
burn so a quick restart can't escalate it to BLOCKED.

* fix(s3/lifecycle): reader fallback prefix normalization mirrors Run

The fallback path that builds prefix from r.BucketsPath when
bucketsPathSlash is empty (test-only entry into extractBucketKey) was
appending an unconditional '/', producing '//' if BucketsPath already
ended with one. Use the same normalization Run does.

* fix(s3/lifecycle): ObjectInfo.ModTime carries the nanosecond component

ModTime dropped FuseAttributes.MtimeNs, leaving ExpirationDays one
nanosecond off relative to EntryIdentity.MtimeNs. Pass both to
time.Unix so the precision matches the CAS witness.
2026-05-07 16:54:24 -07:00
Chris Lu
5c991f38f5 feat(s3/lifecycle): dispatcher + per-shard pipeline (Phase 3 PR-D) (#9356)
feat(s3/lifecycle): dispatcher + blocker store + per-shard pipeline

Dispatcher consumes due Matches from the schedule, calls LifecycleDelete,
and routes outcomes:
  DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK -> Cursor.Advance
  RETRY_LATER (within budget)                 -> re-schedule with backoff
  RETRY_LATER (budget exhausted) / BLOCKED    -> BlockerStore.Put + Freeze

BlockerStore is a small interface with InMemoryBlockerStore for tests;
the filer-backed impl follows when the worker task registration lands.

Pipeline composes Reader + Router + Dispatcher into a single Run loop
keyed by shard. Cursor is restored on start, blockers are replayed as
freezes, checkpoints write at a configurable cadence, and a final save
fires on shutdown. The meta-log itself is the durable buffer for in-flight
schedule entries — restart re-derives them from the cursor's MinTsNs.
2026-05-07 15:44:09 -07:00
Chris Lu
8425c42858 feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355)
feat(s3/lifecycle): event router + DueTime schedule

Router consumes per-shard reader events, looks up matching ActionKeys via
the engine's BucketActionKeys index, and emits Matches with DueTime =
event_time + action.Delay. Evaluation runs at DueTime so the age gate
passes for fresh events; the dispatcher's identity-CAS catches drift.

Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles
the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor
added to engine.Snapshot.
2026-05-07 15:43:27 -07:00
Chris Lu
0f6c6b0524 feat(s3/lifecycle): shard-aware meta-log reader (Phase 3 PR-B) (#9354)
feat(s3/lifecycle): shard-aware meta-log reader

- ShardCount=16; ShardID(bucket,key)=top-4-bits of sha256(bucket||/||key)
- Reader subscribes via SubscribeMetadata starting at Cursor.MinTsNs(),
  filters events by shard, emits to caller-owned Events channel
- Cursor: per-(shard, ActionKey) position with monotonic Advance, Freeze
  for blocked actions, MinTsNs for subscription resume
- Persister interface with InMemoryPersister for tests; filer-backed
  impl lands with the worker integration
2026-05-07 15:42:37 -07:00
Chris Lu
3a76c0b027 feat(worker/proto): per-shard READ — add S3LifecycleParams.shard_id (#9353)
READ moves from a cluster-singleton to one task per (bucket, key-prefix-hash)
shard. Cluster has 16 shards; workers receive one READ task per owned shard.
Required for READ; ignored for BOOTSTRAP and DRAIN.
2026-05-07 15:29:12 -07:00
Chris Lu
4f79d8e358 feat(s3/lifecycle): bucket-level bootstrap walker (#9350)
* feat(worker): add TaskTypeS3Lifecycle constant

Single job type for the lifecycle worker; the S3LifecycleParams.Subtype
field (READ / BOOTSTRAP / DRAIN) dispatches inside the handler. The
"s3_lifecycle" string is already wired to LaneLifecycle in
admin/plugin/scheduler_lane.go so adding the constant doesn't change
runtime behavior — it lets future commits reference the type name
without sprinkling string literals.

* feat(s3/lifecycle): bucket-level bootstrap walker

Iterates entries in a bucket, evaluates every active ActionKey in the
engine snapshot against each entry, and dispatches inline-delete for
currently-due actions. Date-kind actions and pending_bootstrap actions
are skipped — the former are handled by their own SCAN_AT_DATE
bootstrap, the latter aren't IsActive() yet.

Walker is callback-driven so callers supply the listing source
(real filer_pb.SeaweedList or test fake) and the dispatcher (real
LifecycleDelete client or test fake). This keeps the walker free of
filer_pb dependencies and makes the per-action evaluation flow
unit-testable in isolation.

Checkpoint state (LastScannedPath, Completed) is returned to the
caller, who is responsible for persisting it under
/etc/s3/lifecycle/<bucket>/_bootstrap. Walk() honours opts.Resume so
a kill-resumed task picks up where the previous walker stopped.

Tests cover: prefix-mismatched skip, not-yet-due skip (reader's job),
date-kind skip, pending_bootstrap skip, multi-action rule (one rule
with three actions dispatches three times — the regression that
per-action keying fixes), dispatch error halts at last-successful
checkpoint, Resume skips entries up to and including the resume
path.

* test(s3/lifecycle): walker test uses bucket-scoped ActionKey

Mechanical follow-up to the bucket-scoped ActionKey on
lifecycle-engine: the bootstrap walker tests construct ActionKeys to
seed PriorStates and need the Bucket field to match what
engine.Compile keys against.

* fix(s3/lifecycle): walker quick wins

Two minor cleanups noted on review:

- Drop the redundant Resume re-filter inside the Walk callback.
  ListFunc's contract already promises "skip entries with Path <=
  start"; trusting that contract avoids divergence if the filter
  logic ever changes on one side and not the other.

- Hoist the ObjectInfo allocation out of the per-action loop in
  walkEntry. Multi-action rules previously allocated one ObjectInfo
  per (entry, kind) pair; now it's one per entry, reused across all
  matching kinds.

* fix(s3/lifecycle): walker Entry.NoncurrentIndex tracks ObjectInfo's *int

ObjectInfo.NoncurrentIndex is now *int so unset is unambiguous;
mirror that on bootstrap.Entry so the per-entry construction stays
type-clean. Phase 5 (versioned-bucket walks) is the first caller
that will populate the field.

* refactor(s3/lifecycle): trim narration from bootstrap walker

Drop the inline step-by-step on Walk and the multi-paragraph package
preamble; the function names already say it. Keep one-liner WHYs at
the SCAN_AT_DATE skip and the once-per-entry ObjectInfo build.

* fix(s3/lifecycle): walker skips directories and ModeDisabled actions

Two safety findings from review:

1. SeaweedFS directory entries can appear in the listing alongside
   objects; without an IsDirectory check the walker would treat a dir
   like any other entry and could dispatch a delete against it. Add
   IsDirectory to bootstrap.Entry and short-circuit it before
   walkEntry.

2. ModeDisabled is set by the operator (e.g. shell pause) independent
   of the XML rule's Status field. EvaluateAction gates on Status and
   would still fire for an operator-disabled action whose XML status
   is "Enabled". Skip ModeDisabled explicitly in walkEntry alongside
   the existing SCAN_AT_DATE skip.

Two regression tests pin both cases.

* perf(s3/lifecycle): reuse ObjectInfo across walker entries

Walker allocated one ObjectInfo struct per entry. For buckets with
millions of objects that's measurable GC pressure. Hoist the
allocation out of the per-entry callback (one per Walk) and reuse
via field assignment in walkEntry.

EvaluateAction reads ObjectInfo synchronously and doesn't retain a
reference, so the reuse is safe — the next iteration's overwrite
can't corrupt an in-flight evaluation.

* refactor(s3/lifecycle): trim narration on walker

Drop the multi-line Entry / ObjectInfo-reuse / SCAN_AT_DATE+DISABLED
explanations. The walker's structure is small enough that the
condition itself reads as the documentation.
2026-05-07 15:04:51 -07:00
Chris Lu
5ab3860005 feat(s3/lifecycle): LifecycleDelete RPC server (#9349)
* feat(s3/lifecycle): SeaweedS3LifecycleInternal.LifecycleDelete RPC

Adds the worker-to-S3 RPC the lifecycle worker calls to execute one
(rule, action) verdict against one entry. The S3 server is the only
component allowed to mutate filer state, so it gets the final word:
re-fetch, identity CAS, object-lock check, dispatch.

LifecycleDeleteRequest carries the routing tuple (bucket, object_path,
version_id, rule_hash, action_kind), the per-stream context echoed
into BlockerRecord on FATAL outcomes, and an EntryIdentity CAS witness
that lets the server detect mid-flight drift (mtime, size, head fid,
sorted-Extended hash).

LifecycleDeleteOutcome covers all five worker-actionable verdicts:
DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK / RETRY_LATER / BLOCKED.
Cursor-advance / pending-mutation rules per outcome are documented
inline.

Makefile updated to emit the gRPC stubs alongside the message types.

* feat(s3/lifecycle): LifecycleDelete server handler

Server-side handler for the worker-to-S3 RPC. Five-step flow:

1. Re-fetch live entry (NOT_FOUND -> NOOP_RESOLVED).
2. CAS on EntryIdentity (mtime / size / head fid / sorted-Extended
   sha256). Mismatch -> NOOP_RESOLVED with reason STALE_IDENTITY.
3. enforceObjectLockProtections with governanceBypassAllowed=false.
   Lifecycle never bypasses legal-hold or compliance retention; the
   safety scan re-attempts after the hold lifts. Lock refusal ->
   SKIPPED_OBJECT_LOCK (logged + counted, cursor advances).
4. Dispatch by action_kind:
   - EXPIRATION_DAYS / EXPIRATION_DATE: createDeleteMarker on
     versioned buckets, deleteUnversionedObjectWithClient otherwise.
   - NONCURRENT_DAYS / NEWER_NONCURRENT / EXPIRED_DELETE_MARKER:
     deleteSpecificObjectVersion (the marker is just a version).
   - ABORT_MPU: stub returning RETRY_LATER pending Phase 5 wiring.
5. Helper failures the server can't classify as transient -> BLOCKED
   with reason "FATAL_EVENT_ERROR: <detail>"; the worker writes a
   durable BlockerRecord and pauses the failing stream. Filer fetch
   transport errors -> RETRY_LATER (sustained transients eventually
   promote to BLOCKED via the worker's retry budget).

hashExtended uses length-prefixed encoding so a forged Extended
payload (e.g. one tag value containing the byte sequence of two real
tags) can't collide with a real two-tag map. Regression test pins
this.

Tests cover identity-comparison fields, hashExtended order/delimiter
stability, empty-request rejection. Full filer-integration tests come
in Layer 2.

* fix(s3/lifecycle): use stdlib bytes.Equal for ExtendedHash compare

Replaces a hand-rolled byte slice comparator with bytes.Equal —
idiomatic, slightly faster (the stdlib version is intrinsified on
amd64/arm64), and one fewer test surface to maintain.

* refactor(s3api): trim narration from lifecycle RPC + canonicalizer

Drop step-by-step doc-block narrations on LifecycleDelete and the
helpers that reproduced what the code already says. Keep WHY one-
liners at non-obvious spots: the http.Request=nil safety claim,
why hashExtended is length-prefixed, the safety-scan revisit
contract for SKIPPED_OBJECT_LOCK, the TODO marker for ABORT_MPU.

* refactor(s3/lifecycle): retryLater helper, drop inline literals

Adds retryLater() alongside done/noopResolved/blocked so the four
LifecycleDeleteOutcome constructions look the same. Replaces the two
inline RETRY_LATER literal returns (live-fetch transport error and
the ABORT_MPU stub) with the helper. No behavior change.

* fix(s3/lifecycle): default filer-error classification to RETRY_LATER

Versioning lookup, createDeleteMarker, deleteUnversionedObjectWithClient,
and deleteSpecificObjectVersion all do filer round-trips. Most failures
are transient (filer unavailable, slow, network blip), not deterministic
per-event errors. Returning BLOCKED for them was too aggressive: BLOCKED
halts the stream until an operator intervenes, which would surface
on every transient filer hiccup.

Default these paths to RETRY_LATER. The worker's retry budget already
promotes sustained transients to BLOCKED after a configured threshold,
which is the right place for that escalation. BLOCKED stays for the
truly deterministic error: the request-shape check (missing version_id
on noncurrent delete) and the unknown-action-kind dispatch fallback.

* fix(s3/lifecycle): honor versioning Suspended on current-version expiration

Treating Suspended-versioning buckets like never-versioned ones was
wrong: per AWS S3, current-version expiration on Suspended must
remove the existing null version and insert a new delete marker (the
same behavior a user DELETE produces). The previous code branched
only on isVersioningEnabled() — which returns false for Suspended —
and dropped through to the actual-delete path, losing the version
history that Suspended buckets are still expected to keep.

Switch to getVersioningState() and dispatch on three branches:

  Enabled    -> createDeleteMarker (current becomes non-current)
  Suspended  -> deleteSpecificObjectVersion("null"), createDeleteMarker
  Off / never configured -> deleteUnversionedObjectWithClient

The Suspended path's null-version deletion tolerates NotFound (the
null version may not exist when this is the first delete after a
versioning toggle); other errors classify as RETRY_LATER like the
rest of the dispatch.

* refactor(s3/lifecycle): trim narration on LifecycleDelete

Drop the 6-line versioning-state docblock and the inline
"transient -> RETRY_LATER" rationales; keep one-liners.

* fix(s3/lifecycle): bucket-not-found is NOOP; include ErrObjectNotFound

Three follow-ups from review (the rest were already addressed by
prior commits):

- Versioning lookup now distinguishes filer_pb.ErrNotFound
  (BUCKET_NOT_FOUND -> NOOP_RESOLVED) from genuinely transient
  failures (RETRY_LATER). A bucket deleted between live-fetch and
  versioning-lookup is benign, not transient.
- deleteUnversionedObjectWithClient and deleteSpecificObjectVersion
  now also recognise ErrObjectNotFound as a resolved-NOOP, matching
  the live-fetch step's classification.
2026-05-07 15:03:33 -07:00
Chris Lu
7f2b20d577 feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348)
* feat(s3/lifecycle): XML lifecycle config to canonical Rule

LifecycleToCanonical takes a parsed *Lifecycle and returns
[]*s3lifecycle.Rule, the flat shape the engine compiles against.
Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags +
size filters) flatten into the canonical Rule's individual fields;
single <Tag> filter populates FilterTags with one entry; <Prefix>
filter takes precedence over the rule's top-level <Prefix>.

Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on
the same XML <Rule>) populate every action field they declare.
RuleActionKinds expands the canonical rule into its compiled actions
downstream.

* feat(s3/lifecycle): engine snapshot skeleton + ActionKey type

Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's
primary identity, and adds the engine package's Snapshot type.
Snapshot is immutable after Compile (atomic-swapped on rebuild) and
holds the ActionKey-keyed routing indexes:

  - originalDelayGroups: map[time.Duration][]ActionKey
  - predicateActions:    []ActionKey
  - dateActions:         map[ActionKey]time.Time
  - actions:             map[ActionKey]*CompiledAction

CompiledAction.engineState is an atomic.Uint32 so MarkActive (called
after the durable bootstrap_complete + mode write commits) is visible
to in-flight reader passes without a recompile. The reader filters on
IsActive() before dispatching, so stale-snapshot dispatches are
prevented.

No callers yet; downstream commits add Compile, decideMode, and the
Match functions.

* feat(s3/lifecycle): decideMode + retention gate

decideMode picks the scheduling mode for one (rule, kind) compiled
action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE;
reader-driven kind whose eventLogHorizon + bootstrapLookbackMin
exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The
gate runs per (rule, kind), so a 90d ExpirationDays sibling can
degrade to scan_only while its 7d AbortMPU sibling stays active.

MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS
default (Phase 0 verified that meta-log files are written without
TtlSec by default), so the gate doesn't trip until an operator opts
in to volume-TTL pruning of /topics/.system/log/.

RuleMode is a Go-level enum here, separate from the wire-form
LifecycleState.RuleMode in the proto package; the worker maps between
them when reading/writing the durable state file.

* feat(s3/lifecycle): Compile builds the engine snapshot per-action

Compile produces a fresh Snapshot from per-bucket canonical rules.
Each input rule expands into N CompiledActions via RuleActionKinds;
mode comes from decideMode; activation requires both
bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN.

Routing indexes are populated by mode:
- SCAN_AT_DATE: always indexed in dateActions (detector schedules at
  rule.date regardless of bootstrap status; the action runs once on
  the date and is then done).
- EVENT_DRIVEN + active: indexed in originalDelayGroups (and in
  predicateActions when the rule has tag/size filters).
- SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan
  tick or operator action handle these.

snapshot_id is monotonic per process; pending writes stamp it. The
new snapshot replaces the engine's atomic pointer; in-flight reader
passes continue against their loaded snapshot.

Tests cover: single-action rule, multi-action expansion (one rule ->
three CompiledActions with three distinct delay groups), pending
bootstrap exclusion from indexes, retention gate, sibling actions
degrading independently under partial retention, ExpirationDate path,
disabled rule, MarkActive flipping IsActive(), Compile producing
monotonic snapshot ids.

* feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath

The reader feeds events through the engine's match functions to find
the active ActionKeys whose filter applies. The minimal Event shape
the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker,
IsMPUInit) keeps engine free of filer_pb dependencies; the reader
extracts these fields from the persisted *filer_pb.LogEntry payload
in Phase 3.

- MatchOriginalWrite: per-delay-group sweep entry. Filters on shape =
  EventShapeOriginalWrite, prefix, tag, size, then per-kind shape
  gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on
  IsLatest+IsDeleteMarker).
- MatchPredicateChange: single near-now sweep. Returns only the
  predicate-sensitive subset of active ActionKeys.
- MatchPath: bucket-level walker entry. Returns every active action
  whose filter matches; bootstrap iterates these per object and calls
  EvaluateAction per kind.

All filter on a.IsActive() at routing time so MarkActive flips become
visible without recompile.

* fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile

Three findings on the engine PR addressed:

1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind}
   only. Two buckets with rules whose XML is identical produce the same
   RuleHash; the second bucket's Compile would overwrite the first
   bucket's CompiledAction in snap.actions. Add Bucket to ActionKey
   so the engine's identity matches the on-disk path layout
   /etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression
   test pins it.

2. Major (immutability leak): OriginalDelayGroups, PredicateActions,
   DateActions returned the snapshot's internal maps/slices by
   reference, letting an external caller mutate routing state and
   break the documented immutability contract. Return defensive
   copies.

3. Minor (redundant condition): mode==EVENT_DRIVEN already implies
   kind != EXPIRATION_DATE because decideMode routes the date kind
   to SCAN_AT_DATE. Drop the redundant check.

Tests updated to construct ActionKey with the new Bucket field.

* fix(s3/lifecycle): drop size filters from rulePredicateSensitive

An object's size is immutable once written: any content change is a
fresh write that flows through the original-write stream, not the
predicate-change one. Tagging rules really can flip post-PUT
(operator adds/removes a tag without rewriting), so they belong; size
filters do not.

Including size filters here was adding rules to predicateActions for
no purpose — every predicate-change sweep would waste cycles
re-evaluating size predicates that physically can't have changed.

* perf(s3/lifecycle): pre-sort AllActions at Compile time

Snapshot is immutable after Compile (engineState bit-flips don't
change membership), so the (bucket, rule_hash, action_kind) ordering
is stable for the snapshot's lifetime. Build the sorted slice once
and serve every AllActions() call from it; drop the per-call
sort.Slice. The bootstrap walker is the primary caller and may
iterate this on every task entry.

* docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity

Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan>
explicitly excludes 0-byte objects, but with the int64 zero value as
the unset sentinel we can't distinguish that from omitted-and-default.
Document the limitation inline so a future deployment that needs the
distinction can switch to *int64 (or a paired set-bool) and update
the matchers / RuleHash accordingly. Not fixing now: the explicit-zero
configuration is unusual, the canonical Rule shape mirrors the same
zero-as-unset convention as s3api.Filter, and a structural fix
touches every filter-using site (evaluator, due_at, match, RuleHash).

* fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex *int

The previous int field had a zero-value collision: 0 is both "newest
non-current version" (a valid index) and "uninitialised by ObjectInfo{}
literal." A caller who built &ObjectInfo{IsLatest: false} without
explicitly setting NoncurrentIndex would have it implicitly read as
"newest non-current," and the count-based NewerNoncurrent retention
would use that bogus 0 to decide eligibility.

Switch to *int so nil is explicitly "not a non-current version /
index not yet computed." The evaluator's NoncurrentDays and
NewerNoncurrent paths conservatively return ActionNone when the
index is nil — the safety scan will revisit once the index is
supplied. This removes a class of latent footguns in test setup and
in any future code path that constructs ObjectInfo without a
versioning-aware builder.

idx() helper added in tests to keep the call sites a one-liner.

* refactor(s3/lifecycle): trim narration from engine + helpers

Drop "what" comments where well-named identifiers already say it
(IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph
"why" docs to one-liners where the design rationale is already in
the design doc. Keep WHY comments only at non-obvious load-bearing
spots: the routing-index activation predicate, the *int rationale on
NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay
horizon rule.

Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go,
min_trigger_age.go, event_log_horizon.go, engine/engine.go,
engine/compile.go, engine/match.go, engine/mode.go.

No behavior change; tests untouched and pass.

* fix(s3/lifecycle): durable PriorState.Mode wins over decideMode

PriorState.Mode was declared but never read; Compile recomputed mode
via decideMode and stored that on every CompiledAction. Effect: an
action durably persisted as SCAN_ONLY (lag fallback or operator
pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the
next engine rebuild as soon as decideMode's XML+retention predicate
said so. Defeats the durability of mode state.

Use prior.Mode when set; fall through to decideMode only for new
actions (no prior at all) and for legacy entries persisted before
Mode existed (zero value). Regression test pins both branches.

* fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key

MarkActive's documented contract was "flip visible without a
recompile," but the routing indexes (originalDelayGroups,
predicateActions) were only populated when active && mode ==
EVENT_DRIVEN at compile time. So a key compiled with
BootstrapComplete=false would never enter the indexes; a later
MarkActive flipped engineState but MatchOriginalWrite /
MatchPredicateChange iterated the indexes and never saw the key.
Only MatchPath (which walks bi.actionKeys) and DateActions worked.

Index every EVENT_DRIVEN key regardless of `active`. The runtime
IsActive() filter inside filterMatching already gates dispatch, so
inactive entries are matched-but-not-fired; flipping MarkActive
makes them routable without recompile, matching the documented
contract.

Tests updated: TestCompile_BootstrapPendingIndexedButInactive
asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable
asserts a MarkActive flip routes the next match.

* test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior

Two regression tests for the *int pointer migration: nil index
combined with NewerNoncurrent (either paired with NoncurrentDays or
standalone) must short-circuit to ActionNone rather than guess at
the version's position in the keep-N window.

* refactor(s3/lifecycle): trim follow-up narration on engine + helpers

Comments accumulated since the last sweep — the durable-Mode rationale,
the MarkActive routability note, the routing-index doc, the
NoncurrentIndex pointer rationale, and the EvaluateAction docblock.
Trimmed each to one or two terse lines; the underlying contracts live
in the design doc.

* docs(s3/lifecycle): note CompileInput one-per-bucket invariant
2026-05-07 15:00:49 -07:00
Chris Lu
b9bf45cb2e fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347)
* fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection

Closes #9345.

-volumeId only filtered which volume .idx files were pulled; the filer-side
BFS still walked from "/", printing every directory under -v and making it
look like the flag was ignored. When all requested volumes share a single
non-empty collection that maps to an existing <bucketsPath>/<collection>
directory, restrict the BFS root to that bucket. Empty-collection volumes
or multi-collection selections fall back to the full walk, since chunks
for those can live anywhere.

* trim comments

* address review: collapse getCollectFilerFilePath; unshadow receiver in loop
2026-05-07 10:04:01 -07:00
Chris Lu
4e10669221 docs(s3/lifecycle): event-driven redesign (#9346)
* docs(s3/lifecycle): event-driven redesign

Replaces the synchronous PUT-handler walk with an event-driven worker
model: meta-log reader subscribed to one filer, client-side heap merge
with per-filer-shard MessagePosition cursors, bucket-level bootstrap
with inline delete, blocked-cursor handling for fatal events, durable
retry budget for sustained-transient promotion, retention mode gate
that downgrades reader-driven rules to scan_only when log retention
falls below the rule's event-log horizon.

* docs(s3/lifecycle): record Phase 0 verified assumptions

ReadPersistedLogBuffer payload carries Extended (event marshaled via
SubscribeMetadataResponse → ToProtoEntry). Meta-log files at
topics/.system/log/<date>/<HH-MM>.<filerId> are written without TtlSec
in filer_notify_append.go; retention is unbounded by default and only
shrinks if an operator sets a filer.conf rule with a TTL on the
SystemLogDir prefix. .versions/ filenames are v_<16h-ts><16h-rand>
with old/new (inverted) format distinguished by threshold
0x4000000000000000; getVersionTimestamp / compareVersionIds give
format-agnostic ordering for successor-version discovery.

* feat(s3/lifecycle): rule evaluator and dueAt helper

Evaluate(rule, info, now) returns the EvalResult by object shape:
IsMPUInit -> AbortMultipartUpload, IsDeleteMarker -> ExpireDeleteMarker
when sole survivor, IsLatest -> DeleteObject (Days or Date), non-current
-> DeleteVersion (Days fallback to ModTime when SuccessorModTime is
zero; NewerNoncurrent retention enforced when both are set).

ComputeDueAt mirrors Evaluate's shape and returns the earliest eligible
wall-clock time for the same (rule, info), used by the reader/bootstrap
to decide pending vs inline-delete.

Adds StatusEnabled/Disabled and SmallDelay consts and an IsMPUInit
flag on ObjectInfo so .uploads/<id>/ entries route to AbortMPU without
overloading IsLatest.

No callers; package compiles standalone.

* feat(s3/lifecycle): MinTriggerAge for safety-scan cadence

Returns the smallest non-zero day threshold across the rule's actions.
Used as max(MinTriggerAge, kindFloor) for the per-kind cadence; date,
count-only, and delete-marker-only rules return 0 so callers fall
through to their kind-specific floor.

* feat(s3/lifecycle): EventLogHorizon for retention mode gate

Returns the maximum event age the reader needs for a rule. Days-based
kinds return their day threshold; pure NewerNoncurrent (count) and
ExpiredObjectDeleteMarker return SmallDelay. Date rules return 0 (the
gate skips them). Multi-action rules take the max — strictest horizon
wins.

Drives the Phase 2 mode gate: metaLogRetention < EventLogHorizon(rule)
+ bootstrapLookbackMin -> scan_only with RETENTION_BELOW_HORIZON.

* feat(s3/lifecycle): RuleHash for per-rule state CAS

sha256 over canonicalized form, first 8 bytes. Stable across tag-key
reorder, prefix trailing-slash variation, ID renames, and Status flips
(state continuity is preserved when an operator toggles
Enabled/Disabled). Different action shapes — different days, filter,
or action type — hash differently.

Used by the per-rule state directory layout
/etc/s3/lifecycle/<bucket>/<rule_hash_hex>/ and by the bootstrap
detector's reconcile-on-PUT.

* feat(s3/lifecycle): add s3_lifecycle.proto storage schema

Defines the durable types backing the lifecycle worker:
LifecycleState (per-rule mode + bootstrap_complete + degraded_reason
incl. RETENTION_BELOW_HORIZON / LOST_LOG), PendingItem, EntryIdentity,
BootstrapState, ReaderState (per-filer-shard cursors plus
tail_drained_streams marker), BlockerRecord (rule_hash optional for
pre-evaluation failures), RetryBudgetEntry with the four-shape
StreamKey oneof, and RetryTarget (no action / no expected_identity —
retry replays handler against current state).

No callers; schema only. Wired into the pb Makefile.

* feat(s3/lifecycle): add S3LifecycleParams to TaskParams

Wires lifecycle subroutines into the existing worker dispatch.
Subtype is READ (cluster-singleton meta-log reader), BOOTSTRAP
(per-bucket walker), or DRAIN (per-rule pending). bucket / rule_hash
populated for the latter two; ContinuationHint is an advisory resume
point for kill-resumable BOOTSTRAP / READ.

oneof tag = 14, after the existing ec_balance_params at 13.

* fix(s3/lifecycle): noncurrent delete markers honor NoncurrentDays

A non-current delete marker is just another version per AWS S3 spec
and is eligible under NoncurrentVersionExpirationDays. The
IsDeleteMarker special case is meant only for the *current* delete
marker (sole-survivor ExpiredObjectDeleteMarker action), so guard
that switch arm with IsLatest. Without IsLatest, the bootstrap walker
silently skipped pre-existing non-current delete markers under a
NoncurrentDays rule because ComputeDueAt returned zero.

Mirrors the same fix in evaluate.go so the runtime decision matches.

* fix(s3/lifecycle): preserve prefix trailing slash in RuleHash

"logs" and "logs/" match different object sets under literal
strings.HasPrefix semantics: "logs" matches "logsmore/x", "logs/" does
not. Collapsing them in the hash would let an XML edit silently bind
the new rule to the previous rule's durable state directory, causing
stale bootstrap_complete and stale pending entries against a rule
that now matches a different set.

* fix(s3/lifecycle): add UNSPECIFIED sentinels to enum zero values

Proto3 best practice: enum 0 should be an _UNSPECIFIED sentinel, not
an active value. Persisted state schemas care most: a partially
populated payload (or one written by an older binary that didn't set
the field) would otherwise silently default to a semantically active
value.

- LifecycleState.RuleKind: shifts EXPIRATION_DAYS..EXPIRED_DELETE_MARKER
  from 0..5 to 1..6, with RULE_KIND_UNSPECIFIED at 0.
- LifecycleState.RuleMode: shifts EVENT_DRIVEN..PENDING_BOOTSTRAP from
  0..4 to 1..5.
- LifecycleState.DegradedReason: replaces NONE=0 with
  DEGRADED_REASON_UNSPECIFIED=0 (operators treat both as healthy).
- StreamKind: shifts ORIGINAL..PENDING from 0..3 to 1..4.
- S3LifecycleParams.Subtype: shifts READ..DRAIN from 0..2 to 1..3, so
  an unset subtype no longer routes into the cluster-singleton READ task.

No on-disk state has been written yet; renumbering is safe.

* docs(s3/lifecycle): per-action state, not per-rule

A single AWS lifecycle XML <Rule> can declare multiple actions in
parallel (e.g. ExpirationDays=90 + AbortMPU=7 + NoncurrentDays=30).
Each must drive its own delay/horizon/mode/pending stream
independently. Modeling the rule as one compiled entry with one kind
collapses these — picking the smallest delay (7d MPU) means the 90d
expiration cursor advances past objects that aren't yet due, and the
90d action never re-fires.

Restructure storage to per-action: every XML rule expands into N
compiled actions; state lives at <bucket>/<rule_hash>/<action_kind>/.
The intermediate rule_hash directory keeps a rule's actions grouped
for operator listing. Each action has its own state file with its own
mode + bootstrap_complete + degraded_reason; sibling actions of the
same rule can degrade independently.

* fix(s3/lifecycle): per-action proto schema + safety-scan counters

Realigns the durable schema with the per-action storage model and the
safety-scan contract in the design doc.

- Promote LifecycleState.RuleKind to a top-level ActionKind enum and
  rename rule_kind -> action_kind. The same enum now also keys
  BootstrapKey, PendingKey, and BlockerRecord so per-action streams
  under one rule never collapse.
- LifecycleState now keyed by (rule_hash, action_kind). Added
  last_safety_scan_ts_ns / next_safety_scan_ts_ns and the four
  observability counters the design specifies (evaluated_total,
  expired_total, metadata_only_total, error_total). Dropped
  last_evaluated_ns, deleted_total, skipped_object_lock_total, and
  pending_size — subsumed or out-of-band.
- BlockerRecord.action_kind is OPTIONAL (UNSPECIFIED for
  pre-evaluation failures, just like rule_hash).

No on-disk state has been written yet; the renumbering / rename is
free.

* fix(s3/lifecycle): per-action MinTriggerAge / EventLogHorizon helpers

The retention mode gate and safety-scan cadence run on each compiled
action independently. Taking a "min/max across all actions in the
rule" — as the previous helpers did — was wrong: a 90d ExpirationDays
sibling alongside a 7d AbortMPU action would cause the 90d cursor to
advance at 7d (because MinTriggerAge picked the smallest), and the
ExpirationDays action would never re-fire on objects that aged past
the 7d sweep window before reaching 90d.

Both helpers now take an ActionKind and return the threshold for that
specific action only. Returns 0 if the rule does not declare the
requested kind, which makes the gate a no-op and the cadence fall
through to the kind floor.

Also adds RuleActionKinds(rule), the canonical expansion that the
engine uses at compile time to turn one XML rule into N compiled
actions. NewerNoncurrentVersions paired with NoncurrentDays is
subsumed into a single NONCURRENT_DAYS action (AWS-paired
conditions); only when NewerNoncurrent stands alone does it become
NEWER_NONCURRENT.

* fix(s3/lifecycle): length-prefix RuleHash to remove delimiter ambiguity

The previous encoding used "tag=K=V\n" lines, which is ambiguous: a
tag (a=b, c) serializes identically to (a, b=c). A prefix containing
"\nexp_days=99" could likewise forge an action field. Either could
silently bind two semantically different rules to the same per-rule
state directory.

Switch to a length-prefixed canonical form: each scalar is written as
<field-tag-byte> <uvarint-length> <bytes>. Field-tag bytes namespace
each scalar so ("a=b" tag-key) and ("a" tag-key with "=b" leakage)
can't collide. Three regression tests pin the resistance.

* docs(s3/lifecycle): ActionKey is the engine identity, not rule_hash

Finishes the per-action restructure across the engine pseudocode,
bootstrap completion, drain locks, detector paths, policy CAS, and
Phase 2 plan. Every per-action data structure — engine indexes,
target modes, newly-completed sets, bootstrap completion bits, drain
keys, locks, metrics, status — is now keyed by
ActionKey{rule_hash, action_kind}, not by rule_hash alone.

Without this, sibling actions under one XML rule still shared
scheduling state in the engine: originalDelayGroups holding
[]ruleHash means a rule's 7d AbortMPU and 90d ExpirationDays
collapse into one entry, the smaller delay wins for cursor advance,
and the larger sibling never re-fires.

Helper API tightened: EvaluateAction(rule, kind, info, now) and
ComputeDueAt(rule, kind, info) replace the aggregate-rule signatures
so a caller asks one specific action's eligibility against one
entry, never "any action of this rule." Drain task lock includes
action_kind so siblings have independent re-arm timers. Policy CAS
moves from rule_hash to ActionKey membership.

* fix(s3/lifecycle): kind-aware EvaluateAction / ComputeDueAt

Replaces the aggregate-rule signatures with per-action ones:
EvaluateAction(rule, kind, info, now) and ComputeDueAt(rule, kind,
info). The old Evaluate(rule, info, now) could return the verdict of
ANY action declared on the rule, which sat wrong with the per-action
engine indexing (each ActionKey has its own delay group, mode, and
pending stream).

Each helper now decides exactly one (rule, kind) compiled action's
fate against one entry. Asking for a kind the rule doesn't declare,
or asking against the wrong object shape for that kind, returns
ActionNone / zero — never silently routes to a sibling.

Multi-action regression test: a rule with ExpirationDays=90 and
AbortMPU=7 evaluates each kind independently for the same entry; the
7d window has no influence on the 90d eligibility decision.
2026-05-07 10:02:32 -07:00
Minsoo Kim
a1e5eb9dad Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes

* Fix filer UI navigation for URL-sensitive object prefixes

* Clarify filer UI path escaping test name

Rename the legacy filer UI
  path test to describe the actual behavior being checked.

  The printpath helper preserves timestamp characters that are valid in URL path
  components, while the PR fix is focused on query-string escaping for path and cursor
  parameters.
2026-05-06 19:14:36 -07:00
Chris Lu
487b93eb49 fix(volume): don't panic on read when needle map is nil (#9342)
* fix(volume): don't panic on read when needle map is nil

A failed CommitCompact reload (and #9335's new error path for a
remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil
on a volume that's still in the store. readNeedle / readNeedleDataInto
dereferenced v.nm with no guard, so the next GET segfaulted the
http handler instead of returning an error the client could retry on
another replica.

Add the same v.nm == nil check the other Volume accessors already use,
including the slow-read inner loop where the lock is released between
iterations and a failed reload can race in.

Fixes #9339.

* match rust nm-nil read behavior; trim comments

seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset
already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound
on the Go side too instead of a generic 500-mapped error so both volume
servers respond identically when v.nm is nil.

* log once when reads hit nil needle map

ErrorNotFound alone hides the real cause: a half-loaded volume just
returns 404s and the operator has nothing to grep for. Add a once-per-
volume Errorf on the nil path, reset on successful load. Mirror the
same in seaweed-volume via nm_or_not_found().

* trim comments

* drop once-flag, log inline on every nil-nm read
2026-05-06 18:23:06 -07:00
Chris Lu
e96190d128 fix(mount): skip pressure-eviction of gappy page chunks (#9330) (#9334)
* fix(mount): skip pressure-eviction of gappy page chunks (#9330)

A page chunk whose written-interval list has an internal hole was being
sealed under buffer-pressure eviction, then SaveContent would emit one
volume chunk per maximal adjacent run with no chunk covering the hole;
reads then silently zero-fill the gap (filer/stream.go:177-186). On a
sequential cp through FUSE, that bakes in-flight 4 KiB writes into split
volume chunks and leaves chunk-sized blocks of zeros on the destination.

Filter the pressure-driven sealers (SaveDataAt's over-limit path,
EvictOneWritableChunk, ProactiveFlush) to only seal chunks whose written
intervals form one unbroken run. The flush-on-close path (FlushAll) is
unchanged: at close every gap is by definition a sparse-file write that
the app legitimately never made.

* fix(mount): also gate IsContiguouslyWritten on leading zero-offset

Tighten IsContiguouslyWritten to also reject empty lists and lists whose
first interval does not start at offset 0. The internal-gap and
leading-gap cases are symmetric for pressure-driven sealing: both put
in-flight FUSE writeback for the missing range at risk of being baked
into split volume chunks. The flush-on-close path is still unfiltered
(sparse writes are sealed legitimately at FlushAll).

Also align EvictOneWritableChunk's bestBytes initialization with
SaveDataAt (start at 0) so an empty chunk is never picked, matching
the new semantic.

Addresses gemini-code-assist review on PR #9334.

* fix(mount): preserve cap-pressure liveness in EvictOneWritableChunk

The previous version of this fix had EvictOneWritableChunk return false
whenever every dirty chunk was gappy. That broke the accountant's
Reserve loop: cond.Wait only wakes on Release, Release only fires on
upload completion, and refusing to seal anything means no upload starts
— the writer hangs at the -writeBufferSizeMB cap forever.

Two-pass selection: prefer the fullest gap-free chunk (issue #9330: this
is what protects sequential cp from racing FUSE writeback), fall back to
the oldest non-empty writer when nothing is gap-free. Oldest-first
maximizes the chance that FUSE writeback for the gap range has already
settled. The actual sealing path is unchanged — SaveContent still emits
one volume chunk per maximal adjacent run; pages that arrive after the
seal land in a fresh MemChunk for the same logicChunkIndex and are
sealed in turn, so coverage is reconstructed at read time by
readResolvedChunks.

Sequential cp at default settings always hits the strict pass (writes
arrive contiguous-from-0 within their logicChunkIndex), so the bug-fix
behavior is preserved; the fallback only runs under genuinely sparse
workloads or under FUSE writeback so backed up that no chunk has
settled, where forced progress is preferable to a hung mount.

* test(mount): pin ProactiveFlush gap-skip behavior (#9330)

Sibling regression test for the ProactiveFlush guard added in this
series: same 3-chunk setup as TestEvictOneWritableChunk_SkipsGappyChunks
(internal gap, leading gap, contiguous). Verifies ProactiveFlush picks
the contiguous chunk when staleness criteria are otherwise satisfied,
returns false when only gappy chunks remain (no liveness fallback like
EvictOneWritableChunk has — failing here is just a missed optimization),
and that filling the holes lets the chunks auto-seal via maybeMoveToSealed.

* style(mount): trim verbose comments on #9330 fix
2026-05-06 15:26:56 -07:00
Chris Lu
7b0b64db65 fix(admin/view): wrap plugin history URL with basePath (#9341)
Plugin tabs/sub-tabs use history.pushState/replaceState to keep the
URL bar in sync with the active view, but updateURL fed it the raw
output of buildPluginURL ("/plugin/lanes/<lane>/..."). Under a
urlPrefix deployment that strips the prefix, so reloading the page
hit /plugin/... directly and 404'd at the proxy.

Wrap with basePath() so the rewritten URL keeps the deployment
prefix.

Reported at #9240.
2026-05-06 15:25:06 -07:00
Chris Lu
1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu
12f283357f fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333)
* fix(iam): scope IAM-managed OIDC provider lookup by role account

Two account-scoped OIDC records sharing an issuer were collapsed into a
single map slot keyed only by the URL. The last-write-wins entry then
served every AssumeRoleWithWebIdentity, so a token destined for account
B's role could be validated by account A's record (its clientIDs and
thumbprints), defeating the per-account isolation the records exist for.

The role-account check in enforceProviderAccountScope still rejected
the cross-account assumption, but only after the wrong record's
audience and TLS pin had already accepted the token.

Refresh now keys IAM-managed records as (issuer, account), and
validation parses the requested role's account up front and matches
the record under that issuer in this order: exact account, global
(account-less), static-config fallback. An unknown account hint
deliberately skips account-scoped entries — picking one arbitrarily is
the bug this commit fixes — and falls through to global or static.

* fix(iam): route public AssumeRoleWithWebIdentity through IAMManager

handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity
directly, bypassing the IAMManager wrapper. The wrapper is where
enforceProviderAccountScope rejects cross-account assumption attempts
and capDurationByRole clamps to the role's MaxSessionDuration; both
silently became no-ops for any AWS-SDK caller hitting the public
endpoint.

Dispatch through the IAMManager (via the existing IAMManagerProvider
interface that other handlers in this file already use) when one is
wired. Embedded test setups without an IAM integration fall back to
the bare STS service unchanged.

* fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config

initOIDCProviderStore mirrored only URL and ClientIDs. Once
RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on
boot once the metadata-subscribe loop kicked in),
buildOIDCProviderFromRecord rebuilt the runtime provider from this
truncated record. Because IAM-managed entries take precedence over the
static-config map, the rebuild silently shadowed the bootstrap with a
weaker provider:

- Thumbprints: dropped, so TLS-pinned issuers fell back to the system
  trust store.
- AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped
  reaching the session.
- PolicyClaim: dropped, so claim-based policy mode stopped triggering.

Pull all three from the provider's static Config map at mirror time so
the stored record round-trips to a runtime provider equivalent to the
one the static config produced directly.

* fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler

Phase 3b advertises that RoleArn MAY be omitted in claim-based policy
mode — the STS service then derives the assumed-role ARN from the
configured policy claim. The HTTP handler still rejected empty RoleArn
up front with MissingParameter, so SDK callers using the documented
omitted-role flow never reached the STS layer.

Drop the pre-check; STS still validates that claim-based mode is
configured and that the IDP emits policies, returning a precise error
when either is missing. The existing error mapping below this point
surfaces those as InvalidParameterValue, matching what an AWS SDK
expects.

* test(iam): update missing-RoleArn STS integration test for the new contract

The previous commit drops the HTTP-layer RoleArn pre-check so claim-based
mode can derive the ARN from a JWT claim. The integration test still
asserted MissingParameter for the missing-RoleArn case, which now
reaches the STS layer and surfaces a JWT-parse error instead. Update
the assertion to match: missing RoleArn alone must no longer surface
as MissingParameter, but a bogus JWT must still be rejected.
2026-05-05 19:14:44 -07:00
Chris Lu
9af1b212d3 feat(iam): OIDC provider audit trail (Phase 3e) (#9325)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.

* feat(iam): opt-in session revocation via JTI blocklist

Add SessionRevocationStore (memory + filer implementations) and wire
it into the IAMManager.IsActionAllowed path so a revoked session is
rejected on the next signed request. Session JWTs now embed the
session id as the JTI claim, giving the blocklist a stable key
without requiring a second secret.

Operators who don't configure a store keep the existing fully-stateless
behaviour: every session stays valid until natural expiry. Operators
who do configure one accept one filer lookup per signed request in
exchange for being able to invalidate compromised tokens before
expiry. Revocation entries carry the original session expiry so the
blocklist self-trims via PurgeRevokedSessions.

* fix(iam): hash JTI filenames + paginate Purge with proper EOF handling

Three reviewer-flagged issues on the filer-backed revocation store:

1. Path traversal (security-medium): RevokeSession is exported and takes
   an arbitrary string. Using the JTI verbatim as a filename meant a
   caller could pass "../../etc/passwd" to write outside the basePath.
   SHA-1 hash the JTI to a fixed-width hex name; lookups still find the
   entry because Revoke and IsRevoked share the same hash function.

2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed
   the outer err and the loop just broke on any failure, so a mid-stream
   gRPC error returned (count, nil) and the caller had no idea the
   purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream
   and propagate everything else.

3. Purge had a hardcoded 10000-entry cap. Stream-paginate via
   StartFromFileName so the operator-cron can clean a backlog larger
   than that without losing rows.

* feat(iam): OIDC provider audit trail

Emit one structured event per IAM-managed OIDC provider lifecycle
mutation (Create, Delete, Add/Remove ClientID, UpdateThumbprints,
Tag, Untag). Three sinks ship in-tree:

- GlogAuditSink — default; events become structured log lines.
- MemoryAuditSink — in-process buffer for tests / inspection.
- FilerAuditSink — durable, one file per event under
  /etc/iam/audit/oidc-providers (operator-overridable basePath).

Audit emission is best-effort: a failing sink never blocks an IAM
mutation that has already succeeded. Use-events (per token validation)
are intentionally not emitted yet — too hot for unconditional sinks.

* fix(iam): collision-free audit filenames + correct file mtime

Two issues gemini flagged on FilerAuditSink.Emit:

1. Filename was %d-%s.json (UnixNano + Type). Two mutations at the
   same nano against different ARNs (think batch script touching
   several providers) collided on the same path and the second
   CreateEntry failed silently. Append a short ARN hash to make the
   name unique per-event without leaking the ARN.

2. File Mtime/Crtime were set to time.Now() at write time. For audit
   integrity those should reflect the event's occurrence time so
   filer-level "ls -lt" output matches the contents.

Addresses three medium-priority gemini reviews on PR #9325.
2026-05-05 13:37:24 -07:00
Chris Lu
9d6a699b94 feat(iam): opt-in session revocation via JTI blocklist (Phase 3d) (#9324)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.

* feat(iam): opt-in session revocation via JTI blocklist

Add SessionRevocationStore (memory + filer implementations) and wire
it into the IAMManager.IsActionAllowed path so a revoked session is
rejected on the next signed request. Session JWTs now embed the
session id as the JTI claim, giving the blocklist a stable key
without requiring a second secret.

Operators who don't configure a store keep the existing fully-stateless
behaviour: every session stays valid until natural expiry. Operators
who do configure one accept one filer lookup per signed request in
exchange for being able to invalidate compromised tokens before
expiry. Revocation entries carry the original session expiry so the
blocklist self-trims via PurgeRevokedSessions.

* fix(iam): hash JTI filenames + paginate Purge with proper EOF handling

Three reviewer-flagged issues on the filer-backed revocation store:

1. Path traversal (security-medium): RevokeSession is exported and takes
   an arbitrary string. Using the JTI verbatim as a filename meant a
   caller could pass "../../etc/passwd" to write outside the basePath.
   SHA-1 hash the JTI to a fixed-width hex name; lookups still find the
   entry because Revoke and IsRevoked share the same hash function.

2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed
   the outer err and the loop just broke on any failure, so a mid-stream
   gRPC error returned (count, nil) and the caller had no idea the
   purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream
   and propagate everything else.

3. Purge had a hardcoded 10000-entry cap. Stream-paginate via
   StartFromFileName so the operator-cron can clean a backlog larger
   than that without losing rows.
2026-05-05 13:25:22 -07:00
Chris Lu
6483583491 feat(iam): account-scoped OIDC providers (Phase 3c) (#9323)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.
2026-05-05 13:06:53 -07:00