Commit Graph

13727 Commits

Author SHA1 Message Date
dependabot[bot]
9cb103cd35 build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364) 2026-05-08 05:59:26 -07:00
Chris Lu
b7928637a0 refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml (#9360)
* refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml

The structs S3 PutBucketLifecycleConfiguration parses and the canonical
conversion to s3lifecycle.Rule lived in package s3api, which transitively
imports weed/server (which imports weed/shell). Any caller outside
weed/s3api — the shell, the future lifecycle worker — that wanted to
parse a bucket's lifecycle XML hit an import cycle.

Moves:
  weed/s3api/s3api_policy.go              -> lifecycle_xml/types.go
  weed/s3api/s3api_lifecycle_canonical.go -> lifecycle_xml/canonical.go
  s3api_lifecycle_canonical_test.go       -> lifecycle_xml/canonical_test.go
  s3api_policy_test.go                    -> lifecycle_xml/round_trip_test.go

Renames the public RuleStatus type (was unexported ruleStatus) and adds
small accessor methods (Set/Val/AndSet/TagSet) for fields the s3api
handler needs to read across the package boundary. Adds NewPrefix and
NewExpirationDays constructors so the GET handler can build response
rules without poking at unexported fields. Adds a Tag struct local to
the package so it has zero internal seaweed deps. Adds a one-shot
ParseCanonical(xml []byte) helper for non-server callers.

s3api_policy.go was misnamed — its content is lifecycle XML, not S3
bucket policy. The new package name reflects the actual scope.

* test(s3api/lifecycle_xml): exercise public API in tests

- canonical_test.go's parseLifecycle helper went through xml.Unmarshal
  directly; route it through the package's exported Parse so tests
  validate the public entrypoint.
- round_trip_test.go asserted internal flags (rule.Filter.tagSet,
  rule.Filter.andSet, Transition.set, NoncurrentVersionTransition.set);
  switch to TagSet(), AndSet(), Set() — exercises the public contract
  that downstream callers (s3api handler, future shell command) rely on.
2026-05-07 18:54:06 -07:00
Chris Lu
c567da7164 feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359)
Phase 2 added the LifecycleDelete handler on S3ApiServer but never
registered it on a running gRPC server, so workers had no endpoint to
dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on
S3ApiServer and register it on the s3 command's grpc server alongside
SeaweedS3IamCacheServer.
2026-05-07 18:19:42 -07:00
Chris Lu
35e3fe89bc feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358)
* feat(s3/lifecycle): filer-backed cursor Persister

FilerPersister persists per-shard cursor maps as JSON to
/etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler.
One file per shard keeps Save atomic — the filer writes the entry
in a single mutation, so a crash mid-write doesn't leak partial
state. Pipeline.Run loads on start; the periodic checkpoint and
graceful-shutdown save go through this implementation.

A small FilerStore interface wraps the SeaweedFilerClient surface
the persister needs, so tests inject an in-memory fake instead of
mocking the full gRPC client.

* refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block

A frozen cursor doesn't advance, so the durable cursor (FilerPersister)
encodes the blocked state on its own. On worker restart the reader
re-encounters the poison event at MinTsNs, the dispatcher walks the
same retry budget to BLOCKED, and the cursor freezes at the same
EventTs. Other in-flight events between freeze tsNs and prior cursor
positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the
underlying objects were already deleted on the prior pass.

Removed:
  - BlockerStore interface + InMemoryBlockerStore + BlockerRecord
  - Dispatcher.Blockers + Dispatcher.ReplayBlockers
  - the BlockerStore.Put call in handleBlocked
  - Pipeline.Blockers field + the ReplayBlockers call on startup

Added a TestDispatchRestartReFreezesNaturally that pins the
self-recovery property: a fresh Dispatcher with a fresh Cursor, fed
the same poison event, reaches the same frozen state at the same
EventTs without any durable blocker store.

Operator visibility: a cursor whose MinTsNs hasn't advanced is the
signal — surfaced via the durable cursor file.

* refactor(filer): SaveInsideFiler accepts ctx

ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background()
internally and silently dropped the caller's ctx. Symmetric API now;
cancellation/deadlines propagate through LookupEntry / CreateEntry /
UpdateEntry. Mechanical update of all callers — most pass
context.Background() since the existing call sites have no ctx in scope.

* fix(s3/lifecycle): deterministic order in cursor save

Iterating Go maps yields random order, so json.Encode produced a different
byte sequence on each save even when the state hadn't changed. Sort
entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk
file diffs cleanly. New test pins byte-identical output across two saves
of the same map.

* fix(s3/lifecycle): log reason when freezing cursor in handleBlocked

handleBlocked dropped the reason via _ = reason with a comment claiming
the caller logged it; none of the three callers do. A frozen cursor is
the only surface where the operator finds out something stuck, so the
reason has to land somewhere. glog.Warningf with shard, key, eventTs,
and the original reason — same shape the rest of the package uses.
2026-05-07 17:45:04 -07:00
Chris Lu
ec83a87d68 perf(s3/lifecycle): defer pool Put on ShardID hasher
defer guarantees the hasher returns to the pool even if h.Write or
h.Sum panic, preventing pool leak under unexpected failure modes.
2026-05-07 17:08:50 -07:00
Chris Lu
3a192c6c57 fix(s3/lifecycle): address Phase 3 post-merge review (#9354 #9355 #9356) (#9357)
* fix(s3/lifecycle): reader handles bare /buckets parent and pre-normalizes prefix

extractBucketKey accepted /buckets/ but rejected /buckets (no trailing
slash); some delete events emit the bare form, so bucket-root events
were silently dropped. Pre-normalize BucketsPath once on Run instead
of recomputing per event.

* perf(s3/lifecycle): pool sha256 hashers in ShardID

ShardID runs on every meta-log event before the shard filter; a fresh
sha256.New per call produces measurable allocator pressure under load.
sync.Pool reuses hashers across calls.

* fix(s3/lifecycle): router skips hard deletes and missing-attribute events

A hard delete carries no schedule-relevant state — Expiration would hit
NOOP_RESOLVED at dispatch and ExpiredObjectDeleteMarker fires from a
Create on the latest version. Skip rather than burn a schedule slot.

Missing Attributes leaves ModTime at year 0001, which makes
ExpirationDays fire immediately at dispatch. Skip the event instead.

Drop the unused 'versioned' parameter from buildObjectInfo; the
dispatcher's identity-CAS handles version drift in Phase 5.

* fix(s3/lifecycle): EntryIdentity.MtimeNs holds true nanoseconds

Both computeEntryIdentity (server) and buildIdentity (router) wrote
entry.Attributes.Mtime (seconds) into a field named MtimeNs. The CAS
worked because both sides agreed, but the encoding contradicted the
field name and would break if either side later started using true
nanoseconds. Combine Mtime*1e9 + the FuseAttributes.MtimeNs nanosecond
component on both sides; the test was updated to match.

* fix(s3/lifecycle): dispatcher distinguishes ctx cancel from transport errors

A canceled or deadline-exceeded RPC is shutdown, not a transport
failure: re-queue the Match at its original DueTime with no retry-budget
burn so a quick restart can't escalate it to BLOCKED.

* fix(s3/lifecycle): reader fallback prefix normalization mirrors Run

The fallback path that builds prefix from r.BucketsPath when
bucketsPathSlash is empty (test-only entry into extractBucketKey) was
appending an unconditional '/', producing '//' if BucketsPath already
ended with one. Use the same normalization Run does.

* fix(s3/lifecycle): ObjectInfo.ModTime carries the nanosecond component

ModTime dropped FuseAttributes.MtimeNs, leaving ExpirationDays one
nanosecond off relative to EntryIdentity.MtimeNs. Pass both to
time.Unix so the precision matches the CAS witness.
2026-05-07 16:54:24 -07:00
Chris Lu
5c991f38f5 feat(s3/lifecycle): dispatcher + per-shard pipeline (Phase 3 PR-D) (#9356)
feat(s3/lifecycle): dispatcher + blocker store + per-shard pipeline

Dispatcher consumes due Matches from the schedule, calls LifecycleDelete,
and routes outcomes:
  DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK -> Cursor.Advance
  RETRY_LATER (within budget)                 -> re-schedule with backoff
  RETRY_LATER (budget exhausted) / BLOCKED    -> BlockerStore.Put + Freeze

BlockerStore is a small interface with InMemoryBlockerStore for tests;
the filer-backed impl follows when the worker task registration lands.

Pipeline composes Reader + Router + Dispatcher into a single Run loop
keyed by shard. Cursor is restored on start, blockers are replayed as
freezes, checkpoints write at a configurable cadence, and a final save
fires on shutdown. The meta-log itself is the durable buffer for in-flight
schedule entries — restart re-derives them from the cursor's MinTsNs.
2026-05-07 15:44:09 -07:00
Chris Lu
8425c42858 feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355)
feat(s3/lifecycle): event router + DueTime schedule

Router consumes per-shard reader events, looks up matching ActionKeys via
the engine's BucketActionKeys index, and emits Matches with DueTime =
event_time + action.Delay. Evaluation runs at DueTime so the age gate
passes for fresh events; the dispatcher's identity-CAS catches drift.

Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles
the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor
added to engine.Snapshot.
2026-05-07 15:43:27 -07:00
Chris Lu
0f6c6b0524 feat(s3/lifecycle): shard-aware meta-log reader (Phase 3 PR-B) (#9354)
feat(s3/lifecycle): shard-aware meta-log reader

- ShardCount=16; ShardID(bucket,key)=top-4-bits of sha256(bucket||/||key)
- Reader subscribes via SubscribeMetadata starting at Cursor.MinTsNs(),
  filters events by shard, emits to caller-owned Events channel
- Cursor: per-(shard, ActionKey) position with monotonic Advance, Freeze
  for blocked actions, MinTsNs for subscription resume
- Persister interface with InMemoryPersister for tests; filer-backed
  impl lands with the worker integration
2026-05-07 15:42:37 -07:00
Chris Lu
3a76c0b027 feat(worker/proto): per-shard READ — add S3LifecycleParams.shard_id (#9353)
READ moves from a cluster-singleton to one task per (bucket, key-prefix-hash)
shard. Cluster has 16 shards; workers receive one READ task per owned shard.
Required for READ; ignored for BOOTSTRAP and DRAIN.
2026-05-07 15:29:12 -07:00
Chris Lu
4f79d8e358 feat(s3/lifecycle): bucket-level bootstrap walker (#9350)
* feat(worker): add TaskTypeS3Lifecycle constant

Single job type for the lifecycle worker; the S3LifecycleParams.Subtype
field (READ / BOOTSTRAP / DRAIN) dispatches inside the handler. The
"s3_lifecycle" string is already wired to LaneLifecycle in
admin/plugin/scheduler_lane.go so adding the constant doesn't change
runtime behavior — it lets future commits reference the type name
without sprinkling string literals.

* feat(s3/lifecycle): bucket-level bootstrap walker

Iterates entries in a bucket, evaluates every active ActionKey in the
engine snapshot against each entry, and dispatches inline-delete for
currently-due actions. Date-kind actions and pending_bootstrap actions
are skipped — the former are handled by their own SCAN_AT_DATE
bootstrap, the latter aren't IsActive() yet.

Walker is callback-driven so callers supply the listing source
(real filer_pb.SeaweedList or test fake) and the dispatcher (real
LifecycleDelete client or test fake). This keeps the walker free of
filer_pb dependencies and makes the per-action evaluation flow
unit-testable in isolation.

Checkpoint state (LastScannedPath, Completed) is returned to the
caller, who is responsible for persisting it under
/etc/s3/lifecycle/<bucket>/_bootstrap. Walk() honours opts.Resume so
a kill-resumed task picks up where the previous walker stopped.

Tests cover: prefix-mismatched skip, not-yet-due skip (reader's job),
date-kind skip, pending_bootstrap skip, multi-action rule (one rule
with three actions dispatches three times — the regression that
per-action keying fixes), dispatch error halts at last-successful
checkpoint, Resume skips entries up to and including the resume
path.

* test(s3/lifecycle): walker test uses bucket-scoped ActionKey

Mechanical follow-up to the bucket-scoped ActionKey on
lifecycle-engine: the bootstrap walker tests construct ActionKeys to
seed PriorStates and need the Bucket field to match what
engine.Compile keys against.

* fix(s3/lifecycle): walker quick wins

Two minor cleanups noted on review:

- Drop the redundant Resume re-filter inside the Walk callback.
  ListFunc's contract already promises "skip entries with Path <=
  start"; trusting that contract avoids divergence if the filter
  logic ever changes on one side and not the other.

- Hoist the ObjectInfo allocation out of the per-action loop in
  walkEntry. Multi-action rules previously allocated one ObjectInfo
  per (entry, kind) pair; now it's one per entry, reused across all
  matching kinds.

* fix(s3/lifecycle): walker Entry.NoncurrentIndex tracks ObjectInfo's *int

ObjectInfo.NoncurrentIndex is now *int so unset is unambiguous;
mirror that on bootstrap.Entry so the per-entry construction stays
type-clean. Phase 5 (versioned-bucket walks) is the first caller
that will populate the field.

* refactor(s3/lifecycle): trim narration from bootstrap walker

Drop the inline step-by-step on Walk and the multi-paragraph package
preamble; the function names already say it. Keep one-liner WHYs at
the SCAN_AT_DATE skip and the once-per-entry ObjectInfo build.

* fix(s3/lifecycle): walker skips directories and ModeDisabled actions

Two safety findings from review:

1. SeaweedFS directory entries can appear in the listing alongside
   objects; without an IsDirectory check the walker would treat a dir
   like any other entry and could dispatch a delete against it. Add
   IsDirectory to bootstrap.Entry and short-circuit it before
   walkEntry.

2. ModeDisabled is set by the operator (e.g. shell pause) independent
   of the XML rule's Status field. EvaluateAction gates on Status and
   would still fire for an operator-disabled action whose XML status
   is "Enabled". Skip ModeDisabled explicitly in walkEntry alongside
   the existing SCAN_AT_DATE skip.

Two regression tests pin both cases.

* perf(s3/lifecycle): reuse ObjectInfo across walker entries

Walker allocated one ObjectInfo struct per entry. For buckets with
millions of objects that's measurable GC pressure. Hoist the
allocation out of the per-entry callback (one per Walk) and reuse
via field assignment in walkEntry.

EvaluateAction reads ObjectInfo synchronously and doesn't retain a
reference, so the reuse is safe — the next iteration's overwrite
can't corrupt an in-flight evaluation.

* refactor(s3/lifecycle): trim narration on walker

Drop the multi-line Entry / ObjectInfo-reuse / SCAN_AT_DATE+DISABLED
explanations. The walker's structure is small enough that the
condition itself reads as the documentation.
2026-05-07 15:04:51 -07:00
Chris Lu
5ab3860005 feat(s3/lifecycle): LifecycleDelete RPC server (#9349)
* feat(s3/lifecycle): SeaweedS3LifecycleInternal.LifecycleDelete RPC

Adds the worker-to-S3 RPC the lifecycle worker calls to execute one
(rule, action) verdict against one entry. The S3 server is the only
component allowed to mutate filer state, so it gets the final word:
re-fetch, identity CAS, object-lock check, dispatch.

LifecycleDeleteRequest carries the routing tuple (bucket, object_path,
version_id, rule_hash, action_kind), the per-stream context echoed
into BlockerRecord on FATAL outcomes, and an EntryIdentity CAS witness
that lets the server detect mid-flight drift (mtime, size, head fid,
sorted-Extended hash).

LifecycleDeleteOutcome covers all five worker-actionable verdicts:
DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK / RETRY_LATER / BLOCKED.
Cursor-advance / pending-mutation rules per outcome are documented
inline.

Makefile updated to emit the gRPC stubs alongside the message types.

* feat(s3/lifecycle): LifecycleDelete server handler

Server-side handler for the worker-to-S3 RPC. Five-step flow:

1. Re-fetch live entry (NOT_FOUND -> NOOP_RESOLVED).
2. CAS on EntryIdentity (mtime / size / head fid / sorted-Extended
   sha256). Mismatch -> NOOP_RESOLVED with reason STALE_IDENTITY.
3. enforceObjectLockProtections with governanceBypassAllowed=false.
   Lifecycle never bypasses legal-hold or compliance retention; the
   safety scan re-attempts after the hold lifts. Lock refusal ->
   SKIPPED_OBJECT_LOCK (logged + counted, cursor advances).
4. Dispatch by action_kind:
   - EXPIRATION_DAYS / EXPIRATION_DATE: createDeleteMarker on
     versioned buckets, deleteUnversionedObjectWithClient otherwise.
   - NONCURRENT_DAYS / NEWER_NONCURRENT / EXPIRED_DELETE_MARKER:
     deleteSpecificObjectVersion (the marker is just a version).
   - ABORT_MPU: stub returning RETRY_LATER pending Phase 5 wiring.
5. Helper failures the server can't classify as transient -> BLOCKED
   with reason "FATAL_EVENT_ERROR: <detail>"; the worker writes a
   durable BlockerRecord and pauses the failing stream. Filer fetch
   transport errors -> RETRY_LATER (sustained transients eventually
   promote to BLOCKED via the worker's retry budget).

hashExtended uses length-prefixed encoding so a forged Extended
payload (e.g. one tag value containing the byte sequence of two real
tags) can't collide with a real two-tag map. Regression test pins
this.

Tests cover identity-comparison fields, hashExtended order/delimiter
stability, empty-request rejection. Full filer-integration tests come
in Layer 2.

* fix(s3/lifecycle): use stdlib bytes.Equal for ExtendedHash compare

Replaces a hand-rolled byte slice comparator with bytes.Equal —
idiomatic, slightly faster (the stdlib version is intrinsified on
amd64/arm64), and one fewer test surface to maintain.

* refactor(s3api): trim narration from lifecycle RPC + canonicalizer

Drop step-by-step doc-block narrations on LifecycleDelete and the
helpers that reproduced what the code already says. Keep WHY one-
liners at non-obvious spots: the http.Request=nil safety claim,
why hashExtended is length-prefixed, the safety-scan revisit
contract for SKIPPED_OBJECT_LOCK, the TODO marker for ABORT_MPU.

* refactor(s3/lifecycle): retryLater helper, drop inline literals

Adds retryLater() alongside done/noopResolved/blocked so the four
LifecycleDeleteOutcome constructions look the same. Replaces the two
inline RETRY_LATER literal returns (live-fetch transport error and
the ABORT_MPU stub) with the helper. No behavior change.

* fix(s3/lifecycle): default filer-error classification to RETRY_LATER

Versioning lookup, createDeleteMarker, deleteUnversionedObjectWithClient,
and deleteSpecificObjectVersion all do filer round-trips. Most failures
are transient (filer unavailable, slow, network blip), not deterministic
per-event errors. Returning BLOCKED for them was too aggressive: BLOCKED
halts the stream until an operator intervenes, which would surface
on every transient filer hiccup.

Default these paths to RETRY_LATER. The worker's retry budget already
promotes sustained transients to BLOCKED after a configured threshold,
which is the right place for that escalation. BLOCKED stays for the
truly deterministic error: the request-shape check (missing version_id
on noncurrent delete) and the unknown-action-kind dispatch fallback.

* fix(s3/lifecycle): honor versioning Suspended on current-version expiration

Treating Suspended-versioning buckets like never-versioned ones was
wrong: per AWS S3, current-version expiration on Suspended must
remove the existing null version and insert a new delete marker (the
same behavior a user DELETE produces). The previous code branched
only on isVersioningEnabled() — which returns false for Suspended —
and dropped through to the actual-delete path, losing the version
history that Suspended buckets are still expected to keep.

Switch to getVersioningState() and dispatch on three branches:

  Enabled    -> createDeleteMarker (current becomes non-current)
  Suspended  -> deleteSpecificObjectVersion("null"), createDeleteMarker
  Off / never configured -> deleteUnversionedObjectWithClient

The Suspended path's null-version deletion tolerates NotFound (the
null version may not exist when this is the first delete after a
versioning toggle); other errors classify as RETRY_LATER like the
rest of the dispatch.

* refactor(s3/lifecycle): trim narration on LifecycleDelete

Drop the 6-line versioning-state docblock and the inline
"transient -> RETRY_LATER" rationales; keep one-liners.

* fix(s3/lifecycle): bucket-not-found is NOOP; include ErrObjectNotFound

Three follow-ups from review (the rest were already addressed by
prior commits):

- Versioning lookup now distinguishes filer_pb.ErrNotFound
  (BUCKET_NOT_FOUND -> NOOP_RESOLVED) from genuinely transient
  failures (RETRY_LATER). A bucket deleted between live-fetch and
  versioning-lookup is benign, not transient.
- deleteUnversionedObjectWithClient and deleteSpecificObjectVersion
  now also recognise ErrObjectNotFound as a resolved-NOOP, matching
  the live-fetch step's classification.
2026-05-07 15:03:33 -07:00
Chris Lu
7f2b20d577 feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348)
* feat(s3/lifecycle): XML lifecycle config to canonical Rule

LifecycleToCanonical takes a parsed *Lifecycle and returns
[]*s3lifecycle.Rule, the flat shape the engine compiles against.
Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags +
size filters) flatten into the canonical Rule's individual fields;
single <Tag> filter populates FilterTags with one entry; <Prefix>
filter takes precedence over the rule's top-level <Prefix>.

Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on
the same XML <Rule>) populate every action field they declare.
RuleActionKinds expands the canonical rule into its compiled actions
downstream.

* feat(s3/lifecycle): engine snapshot skeleton + ActionKey type

Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's
primary identity, and adds the engine package's Snapshot type.
Snapshot is immutable after Compile (atomic-swapped on rebuild) and
holds the ActionKey-keyed routing indexes:

  - originalDelayGroups: map[time.Duration][]ActionKey
  - predicateActions:    []ActionKey
  - dateActions:         map[ActionKey]time.Time
  - actions:             map[ActionKey]*CompiledAction

CompiledAction.engineState is an atomic.Uint32 so MarkActive (called
after the durable bootstrap_complete + mode write commits) is visible
to in-flight reader passes without a recompile. The reader filters on
IsActive() before dispatching, so stale-snapshot dispatches are
prevented.

No callers yet; downstream commits add Compile, decideMode, and the
Match functions.

* feat(s3/lifecycle): decideMode + retention gate

decideMode picks the scheduling mode for one (rule, kind) compiled
action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE;
reader-driven kind whose eventLogHorizon + bootstrapLookbackMin
exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The
gate runs per (rule, kind), so a 90d ExpirationDays sibling can
degrade to scan_only while its 7d AbortMPU sibling stays active.

MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS
default (Phase 0 verified that meta-log files are written without
TtlSec by default), so the gate doesn't trip until an operator opts
in to volume-TTL pruning of /topics/.system/log/.

RuleMode is a Go-level enum here, separate from the wire-form
LifecycleState.RuleMode in the proto package; the worker maps between
them when reading/writing the durable state file.

* feat(s3/lifecycle): Compile builds the engine snapshot per-action

Compile produces a fresh Snapshot from per-bucket canonical rules.
Each input rule expands into N CompiledActions via RuleActionKinds;
mode comes from decideMode; activation requires both
bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN.

Routing indexes are populated by mode:
- SCAN_AT_DATE: always indexed in dateActions (detector schedules at
  rule.date regardless of bootstrap status; the action runs once on
  the date and is then done).
- EVENT_DRIVEN + active: indexed in originalDelayGroups (and in
  predicateActions when the rule has tag/size filters).
- SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan
  tick or operator action handle these.

snapshot_id is monotonic per process; pending writes stamp it. The
new snapshot replaces the engine's atomic pointer; in-flight reader
passes continue against their loaded snapshot.

Tests cover: single-action rule, multi-action expansion (one rule ->
three CompiledActions with three distinct delay groups), pending
bootstrap exclusion from indexes, retention gate, sibling actions
degrading independently under partial retention, ExpirationDate path,
disabled rule, MarkActive flipping IsActive(), Compile producing
monotonic snapshot ids.

* feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath

The reader feeds events through the engine's match functions to find
the active ActionKeys whose filter applies. The minimal Event shape
the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker,
IsMPUInit) keeps engine free of filer_pb dependencies; the reader
extracts these fields from the persisted *filer_pb.LogEntry payload
in Phase 3.

- MatchOriginalWrite: per-delay-group sweep entry. Filters on shape =
  EventShapeOriginalWrite, prefix, tag, size, then per-kind shape
  gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on
  IsLatest+IsDeleteMarker).
- MatchPredicateChange: single near-now sweep. Returns only the
  predicate-sensitive subset of active ActionKeys.
- MatchPath: bucket-level walker entry. Returns every active action
  whose filter matches; bootstrap iterates these per object and calls
  EvaluateAction per kind.

All filter on a.IsActive() at routing time so MarkActive flips become
visible without recompile.

* fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile

Three findings on the engine PR addressed:

1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind}
   only. Two buckets with rules whose XML is identical produce the same
   RuleHash; the second bucket's Compile would overwrite the first
   bucket's CompiledAction in snap.actions. Add Bucket to ActionKey
   so the engine's identity matches the on-disk path layout
   /etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression
   test pins it.

2. Major (immutability leak): OriginalDelayGroups, PredicateActions,
   DateActions returned the snapshot's internal maps/slices by
   reference, letting an external caller mutate routing state and
   break the documented immutability contract. Return defensive
   copies.

3. Minor (redundant condition): mode==EVENT_DRIVEN already implies
   kind != EXPIRATION_DATE because decideMode routes the date kind
   to SCAN_AT_DATE. Drop the redundant check.

Tests updated to construct ActionKey with the new Bucket field.

* fix(s3/lifecycle): drop size filters from rulePredicateSensitive

An object's size is immutable once written: any content change is a
fresh write that flows through the original-write stream, not the
predicate-change one. Tagging rules really can flip post-PUT
(operator adds/removes a tag without rewriting), so they belong; size
filters do not.

Including size filters here was adding rules to predicateActions for
no purpose — every predicate-change sweep would waste cycles
re-evaluating size predicates that physically can't have changed.

* perf(s3/lifecycle): pre-sort AllActions at Compile time

Snapshot is immutable after Compile (engineState bit-flips don't
change membership), so the (bucket, rule_hash, action_kind) ordering
is stable for the snapshot's lifetime. Build the sorted slice once
and serve every AllActions() call from it; drop the per-call
sort.Slice. The bootstrap walker is the primary caller and may
iterate this on every task entry.

* docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity

Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan>
explicitly excludes 0-byte objects, but with the int64 zero value as
the unset sentinel we can't distinguish that from omitted-and-default.
Document the limitation inline so a future deployment that needs the
distinction can switch to *int64 (or a paired set-bool) and update
the matchers / RuleHash accordingly. Not fixing now: the explicit-zero
configuration is unusual, the canonical Rule shape mirrors the same
zero-as-unset convention as s3api.Filter, and a structural fix
touches every filter-using site (evaluator, due_at, match, RuleHash).

* fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex *int

The previous int field had a zero-value collision: 0 is both "newest
non-current version" (a valid index) and "uninitialised by ObjectInfo{}
literal." A caller who built &ObjectInfo{IsLatest: false} without
explicitly setting NoncurrentIndex would have it implicitly read as
"newest non-current," and the count-based NewerNoncurrent retention
would use that bogus 0 to decide eligibility.

Switch to *int so nil is explicitly "not a non-current version /
index not yet computed." The evaluator's NoncurrentDays and
NewerNoncurrent paths conservatively return ActionNone when the
index is nil — the safety scan will revisit once the index is
supplied. This removes a class of latent footguns in test setup and
in any future code path that constructs ObjectInfo without a
versioning-aware builder.

idx() helper added in tests to keep the call sites a one-liner.

* refactor(s3/lifecycle): trim narration from engine + helpers

Drop "what" comments where well-named identifiers already say it
(IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph
"why" docs to one-liners where the design rationale is already in
the design doc. Keep WHY comments only at non-obvious load-bearing
spots: the routing-index activation predicate, the *int rationale on
NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay
horizon rule.

Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go,
min_trigger_age.go, event_log_horizon.go, engine/engine.go,
engine/compile.go, engine/match.go, engine/mode.go.

No behavior change; tests untouched and pass.

* fix(s3/lifecycle): durable PriorState.Mode wins over decideMode

PriorState.Mode was declared but never read; Compile recomputed mode
via decideMode and stored that on every CompiledAction. Effect: an
action durably persisted as SCAN_ONLY (lag fallback or operator
pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the
next engine rebuild as soon as decideMode's XML+retention predicate
said so. Defeats the durability of mode state.

Use prior.Mode when set; fall through to decideMode only for new
actions (no prior at all) and for legacy entries persisted before
Mode existed (zero value). Regression test pins both branches.

* fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key

MarkActive's documented contract was "flip visible without a
recompile," but the routing indexes (originalDelayGroups,
predicateActions) were only populated when active && mode ==
EVENT_DRIVEN at compile time. So a key compiled with
BootstrapComplete=false would never enter the indexes; a later
MarkActive flipped engineState but MatchOriginalWrite /
MatchPredicateChange iterated the indexes and never saw the key.
Only MatchPath (which walks bi.actionKeys) and DateActions worked.

Index every EVENT_DRIVEN key regardless of `active`. The runtime
IsActive() filter inside filterMatching already gates dispatch, so
inactive entries are matched-but-not-fired; flipping MarkActive
makes them routable without recompile, matching the documented
contract.

Tests updated: TestCompile_BootstrapPendingIndexedButInactive
asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable
asserts a MarkActive flip routes the next match.

* test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior

Two regression tests for the *int pointer migration: nil index
combined with NewerNoncurrent (either paired with NoncurrentDays or
standalone) must short-circuit to ActionNone rather than guess at
the version's position in the keep-N window.

* refactor(s3/lifecycle): trim follow-up narration on engine + helpers

Comments accumulated since the last sweep — the durable-Mode rationale,
the MarkActive routability note, the routing-index doc, the
NoncurrentIndex pointer rationale, and the EvaluateAction docblock.
Trimmed each to one or two terse lines; the underlying contracts live
in the design doc.

* docs(s3/lifecycle): note CompileInput one-per-bucket invariant
2026-05-07 15:00:49 -07:00
Chris Lu
b9bf45cb2e fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347)
* fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection

Closes #9345.

-volumeId only filtered which volume .idx files were pulled; the filer-side
BFS still walked from "/", printing every directory under -v and making it
look like the flag was ignored. When all requested volumes share a single
non-empty collection that maps to an existing <bucketsPath>/<collection>
directory, restrict the BFS root to that bucket. Empty-collection volumes
or multi-collection selections fall back to the full walk, since chunks
for those can live anywhere.

* trim comments

* address review: collapse getCollectFilerFilePath; unshadow receiver in loop
2026-05-07 10:04:01 -07:00
Chris Lu
4e10669221 docs(s3/lifecycle): event-driven redesign (#9346)
* docs(s3/lifecycle): event-driven redesign

Replaces the synchronous PUT-handler walk with an event-driven worker
model: meta-log reader subscribed to one filer, client-side heap merge
with per-filer-shard MessagePosition cursors, bucket-level bootstrap
with inline delete, blocked-cursor handling for fatal events, durable
retry budget for sustained-transient promotion, retention mode gate
that downgrades reader-driven rules to scan_only when log retention
falls below the rule's event-log horizon.

* docs(s3/lifecycle): record Phase 0 verified assumptions

ReadPersistedLogBuffer payload carries Extended (event marshaled via
SubscribeMetadataResponse → ToProtoEntry). Meta-log files at
topics/.system/log/<date>/<HH-MM>.<filerId> are written without TtlSec
in filer_notify_append.go; retention is unbounded by default and only
shrinks if an operator sets a filer.conf rule with a TTL on the
SystemLogDir prefix. .versions/ filenames are v_<16h-ts><16h-rand>
with old/new (inverted) format distinguished by threshold
0x4000000000000000; getVersionTimestamp / compareVersionIds give
format-agnostic ordering for successor-version discovery.

* feat(s3/lifecycle): rule evaluator and dueAt helper

Evaluate(rule, info, now) returns the EvalResult by object shape:
IsMPUInit -> AbortMultipartUpload, IsDeleteMarker -> ExpireDeleteMarker
when sole survivor, IsLatest -> DeleteObject (Days or Date), non-current
-> DeleteVersion (Days fallback to ModTime when SuccessorModTime is
zero; NewerNoncurrent retention enforced when both are set).

ComputeDueAt mirrors Evaluate's shape and returns the earliest eligible
wall-clock time for the same (rule, info), used by the reader/bootstrap
to decide pending vs inline-delete.

Adds StatusEnabled/Disabled and SmallDelay consts and an IsMPUInit
flag on ObjectInfo so .uploads/<id>/ entries route to AbortMPU without
overloading IsLatest.

No callers; package compiles standalone.

* feat(s3/lifecycle): MinTriggerAge for safety-scan cadence

Returns the smallest non-zero day threshold across the rule's actions.
Used as max(MinTriggerAge, kindFloor) for the per-kind cadence; date,
count-only, and delete-marker-only rules return 0 so callers fall
through to their kind-specific floor.

* feat(s3/lifecycle): EventLogHorizon for retention mode gate

Returns the maximum event age the reader needs for a rule. Days-based
kinds return their day threshold; pure NewerNoncurrent (count) and
ExpiredObjectDeleteMarker return SmallDelay. Date rules return 0 (the
gate skips them). Multi-action rules take the max — strictest horizon
wins.

Drives the Phase 2 mode gate: metaLogRetention < EventLogHorizon(rule)
+ bootstrapLookbackMin -> scan_only with RETENTION_BELOW_HORIZON.

* feat(s3/lifecycle): RuleHash for per-rule state CAS

sha256 over canonicalized form, first 8 bytes. Stable across tag-key
reorder, prefix trailing-slash variation, ID renames, and Status flips
(state continuity is preserved when an operator toggles
Enabled/Disabled). Different action shapes — different days, filter,
or action type — hash differently.

Used by the per-rule state directory layout
/etc/s3/lifecycle/<bucket>/<rule_hash_hex>/ and by the bootstrap
detector's reconcile-on-PUT.

* feat(s3/lifecycle): add s3_lifecycle.proto storage schema

Defines the durable types backing the lifecycle worker:
LifecycleState (per-rule mode + bootstrap_complete + degraded_reason
incl. RETENTION_BELOW_HORIZON / LOST_LOG), PendingItem, EntryIdentity,
BootstrapState, ReaderState (per-filer-shard cursors plus
tail_drained_streams marker), BlockerRecord (rule_hash optional for
pre-evaluation failures), RetryBudgetEntry with the four-shape
StreamKey oneof, and RetryTarget (no action / no expected_identity —
retry replays handler against current state).

No callers; schema only. Wired into the pb Makefile.

* feat(s3/lifecycle): add S3LifecycleParams to TaskParams

Wires lifecycle subroutines into the existing worker dispatch.
Subtype is READ (cluster-singleton meta-log reader), BOOTSTRAP
(per-bucket walker), or DRAIN (per-rule pending). bucket / rule_hash
populated for the latter two; ContinuationHint is an advisory resume
point for kill-resumable BOOTSTRAP / READ.

oneof tag = 14, after the existing ec_balance_params at 13.

* fix(s3/lifecycle): noncurrent delete markers honor NoncurrentDays

A non-current delete marker is just another version per AWS S3 spec
and is eligible under NoncurrentVersionExpirationDays. The
IsDeleteMarker special case is meant only for the *current* delete
marker (sole-survivor ExpiredObjectDeleteMarker action), so guard
that switch arm with IsLatest. Without IsLatest, the bootstrap walker
silently skipped pre-existing non-current delete markers under a
NoncurrentDays rule because ComputeDueAt returned zero.

Mirrors the same fix in evaluate.go so the runtime decision matches.

* fix(s3/lifecycle): preserve prefix trailing slash in RuleHash

"logs" and "logs/" match different object sets under literal
strings.HasPrefix semantics: "logs" matches "logsmore/x", "logs/" does
not. Collapsing them in the hash would let an XML edit silently bind
the new rule to the previous rule's durable state directory, causing
stale bootstrap_complete and stale pending entries against a rule
that now matches a different set.

* fix(s3/lifecycle): add UNSPECIFIED sentinels to enum zero values

Proto3 best practice: enum 0 should be an _UNSPECIFIED sentinel, not
an active value. Persisted state schemas care most: a partially
populated payload (or one written by an older binary that didn't set
the field) would otherwise silently default to a semantically active
value.

- LifecycleState.RuleKind: shifts EXPIRATION_DAYS..EXPIRED_DELETE_MARKER
  from 0..5 to 1..6, with RULE_KIND_UNSPECIFIED at 0.
- LifecycleState.RuleMode: shifts EVENT_DRIVEN..PENDING_BOOTSTRAP from
  0..4 to 1..5.
- LifecycleState.DegradedReason: replaces NONE=0 with
  DEGRADED_REASON_UNSPECIFIED=0 (operators treat both as healthy).
- StreamKind: shifts ORIGINAL..PENDING from 0..3 to 1..4.
- S3LifecycleParams.Subtype: shifts READ..DRAIN from 0..2 to 1..3, so
  an unset subtype no longer routes into the cluster-singleton READ task.

No on-disk state has been written yet; renumbering is safe.

* docs(s3/lifecycle): per-action state, not per-rule

A single AWS lifecycle XML <Rule> can declare multiple actions in
parallel (e.g. ExpirationDays=90 + AbortMPU=7 + NoncurrentDays=30).
Each must drive its own delay/horizon/mode/pending stream
independently. Modeling the rule as one compiled entry with one kind
collapses these — picking the smallest delay (7d MPU) means the 90d
expiration cursor advances past objects that aren't yet due, and the
90d action never re-fires.

Restructure storage to per-action: every XML rule expands into N
compiled actions; state lives at <bucket>/<rule_hash>/<action_kind>/.
The intermediate rule_hash directory keeps a rule's actions grouped
for operator listing. Each action has its own state file with its own
mode + bootstrap_complete + degraded_reason; sibling actions of the
same rule can degrade independently.

* fix(s3/lifecycle): per-action proto schema + safety-scan counters

Realigns the durable schema with the per-action storage model and the
safety-scan contract in the design doc.

- Promote LifecycleState.RuleKind to a top-level ActionKind enum and
  rename rule_kind -> action_kind. The same enum now also keys
  BootstrapKey, PendingKey, and BlockerRecord so per-action streams
  under one rule never collapse.
- LifecycleState now keyed by (rule_hash, action_kind). Added
  last_safety_scan_ts_ns / next_safety_scan_ts_ns and the four
  observability counters the design specifies (evaluated_total,
  expired_total, metadata_only_total, error_total). Dropped
  last_evaluated_ns, deleted_total, skipped_object_lock_total, and
  pending_size — subsumed or out-of-band.
- BlockerRecord.action_kind is OPTIONAL (UNSPECIFIED for
  pre-evaluation failures, just like rule_hash).

No on-disk state has been written yet; the renumbering / rename is
free.

* fix(s3/lifecycle): per-action MinTriggerAge / EventLogHorizon helpers

The retention mode gate and safety-scan cadence run on each compiled
action independently. Taking a "min/max across all actions in the
rule" — as the previous helpers did — was wrong: a 90d ExpirationDays
sibling alongside a 7d AbortMPU action would cause the 90d cursor to
advance at 7d (because MinTriggerAge picked the smallest), and the
ExpirationDays action would never re-fire on objects that aged past
the 7d sweep window before reaching 90d.

Both helpers now take an ActionKind and return the threshold for that
specific action only. Returns 0 if the rule does not declare the
requested kind, which makes the gate a no-op and the cadence fall
through to the kind floor.

Also adds RuleActionKinds(rule), the canonical expansion that the
engine uses at compile time to turn one XML rule into N compiled
actions. NewerNoncurrentVersions paired with NoncurrentDays is
subsumed into a single NONCURRENT_DAYS action (AWS-paired
conditions); only when NewerNoncurrent stands alone does it become
NEWER_NONCURRENT.

* fix(s3/lifecycle): length-prefix RuleHash to remove delimiter ambiguity

The previous encoding used "tag=K=V\n" lines, which is ambiguous: a
tag (a=b, c) serializes identically to (a, b=c). A prefix containing
"\nexp_days=99" could likewise forge an action field. Either could
silently bind two semantically different rules to the same per-rule
state directory.

Switch to a length-prefixed canonical form: each scalar is written as
<field-tag-byte> <uvarint-length> <bytes>. Field-tag bytes namespace
each scalar so ("a=b" tag-key) and ("a" tag-key with "=b" leakage)
can't collide. Three regression tests pin the resistance.

* docs(s3/lifecycle): ActionKey is the engine identity, not rule_hash

Finishes the per-action restructure across the engine pseudocode,
bootstrap completion, drain locks, detector paths, policy CAS, and
Phase 2 plan. Every per-action data structure — engine indexes,
target modes, newly-completed sets, bootstrap completion bits, drain
keys, locks, metrics, status — is now keyed by
ActionKey{rule_hash, action_kind}, not by rule_hash alone.

Without this, sibling actions under one XML rule still shared
scheduling state in the engine: originalDelayGroups holding
[]ruleHash means a rule's 7d AbortMPU and 90d ExpirationDays
collapse into one entry, the smaller delay wins for cursor advance,
and the larger sibling never re-fires.

Helper API tightened: EvaluateAction(rule, kind, info, now) and
ComputeDueAt(rule, kind, info) replace the aggregate-rule signatures
so a caller asks one specific action's eligibility against one
entry, never "any action of this rule." Drain task lock includes
action_kind so siblings have independent re-arm timers. Policy CAS
moves from rule_hash to ActionKey membership.

* fix(s3/lifecycle): kind-aware EvaluateAction / ComputeDueAt

Replaces the aggregate-rule signatures with per-action ones:
EvaluateAction(rule, kind, info, now) and ComputeDueAt(rule, kind,
info). The old Evaluate(rule, info, now) could return the verdict of
ANY action declared on the rule, which sat wrong with the per-action
engine indexing (each ActionKey has its own delay group, mode, and
pending stream).

Each helper now decides exactly one (rule, kind) compiled action's
fate against one entry. Asking for a kind the rule doesn't declare,
or asking against the wrong object shape for that kind, returns
ActionNone / zero — never silently routes to a sibling.

Multi-action regression test: a rule with ExpirationDays=90 and
AbortMPU=7 evaluates each kind independently for the same entry; the
7d window has no influence on the 90d eligibility decision.
2026-05-07 10:02:32 -07:00
Minsoo Kim
a1e5eb9dad Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes

* Fix filer UI navigation for URL-sensitive object prefixes

* Clarify filer UI path escaping test name

Rename the legacy filer UI
  path test to describe the actual behavior being checked.

  The printpath helper preserves timestamp characters that are valid in URL path
  components, while the PR fix is focused on query-string escaping for path and cursor
  parameters.
2026-05-06 19:14:36 -07:00
Chris Lu
487b93eb49 fix(volume): don't panic on read when needle map is nil (#9342)
* fix(volume): don't panic on read when needle map is nil

A failed CommitCompact reload (and #9335's new error path for a
remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil
on a volume that's still in the store. readNeedle / readNeedleDataInto
dereferenced v.nm with no guard, so the next GET segfaulted the
http handler instead of returning an error the client could retry on
another replica.

Add the same v.nm == nil check the other Volume accessors already use,
including the slow-read inner loop where the lock is released between
iterations and a failed reload can race in.

Fixes #9339.

* match rust nm-nil read behavior; trim comments

seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset
already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound
on the Go side too instead of a generic 500-mapped error so both volume
servers respond identically when v.nm is nil.

* log once when reads hit nil needle map

ErrorNotFound alone hides the real cause: a half-loaded volume just
returns 404s and the operator has nothing to grep for. Add a once-per-
volume Errorf on the nil path, reset on successful load. Mirror the
same in seaweed-volume via nm_or_not_found().

* trim comments

* drop once-flag, log inline on every nil-nm read
2026-05-06 18:23:06 -07:00
Chris Lu
e96190d128 fix(mount): skip pressure-eviction of gappy page chunks (#9330) (#9334)
* fix(mount): skip pressure-eviction of gappy page chunks (#9330)

A page chunk whose written-interval list has an internal hole was being
sealed under buffer-pressure eviction, then SaveContent would emit one
volume chunk per maximal adjacent run with no chunk covering the hole;
reads then silently zero-fill the gap (filer/stream.go:177-186). On a
sequential cp through FUSE, that bakes in-flight 4 KiB writes into split
volume chunks and leaves chunk-sized blocks of zeros on the destination.

Filter the pressure-driven sealers (SaveDataAt's over-limit path,
EvictOneWritableChunk, ProactiveFlush) to only seal chunks whose written
intervals form one unbroken run. The flush-on-close path (FlushAll) is
unchanged: at close every gap is by definition a sparse-file write that
the app legitimately never made.

* fix(mount): also gate IsContiguouslyWritten on leading zero-offset

Tighten IsContiguouslyWritten to also reject empty lists and lists whose
first interval does not start at offset 0. The internal-gap and
leading-gap cases are symmetric for pressure-driven sealing: both put
in-flight FUSE writeback for the missing range at risk of being baked
into split volume chunks. The flush-on-close path is still unfiltered
(sparse writes are sealed legitimately at FlushAll).

Also align EvictOneWritableChunk's bestBytes initialization with
SaveDataAt (start at 0) so an empty chunk is never picked, matching
the new semantic.

Addresses gemini-code-assist review on PR #9334.

* fix(mount): preserve cap-pressure liveness in EvictOneWritableChunk

The previous version of this fix had EvictOneWritableChunk return false
whenever every dirty chunk was gappy. That broke the accountant's
Reserve loop: cond.Wait only wakes on Release, Release only fires on
upload completion, and refusing to seal anything means no upload starts
— the writer hangs at the -writeBufferSizeMB cap forever.

Two-pass selection: prefer the fullest gap-free chunk (issue #9330: this
is what protects sequential cp from racing FUSE writeback), fall back to
the oldest non-empty writer when nothing is gap-free. Oldest-first
maximizes the chance that FUSE writeback for the gap range has already
settled. The actual sealing path is unchanged — SaveContent still emits
one volume chunk per maximal adjacent run; pages that arrive after the
seal land in a fresh MemChunk for the same logicChunkIndex and are
sealed in turn, so coverage is reconstructed at read time by
readResolvedChunks.

Sequential cp at default settings always hits the strict pass (writes
arrive contiguous-from-0 within their logicChunkIndex), so the bug-fix
behavior is preserved; the fallback only runs under genuinely sparse
workloads or under FUSE writeback so backed up that no chunk has
settled, where forced progress is preferable to a hung mount.

* test(mount): pin ProactiveFlush gap-skip behavior (#9330)

Sibling regression test for the ProactiveFlush guard added in this
series: same 3-chunk setup as TestEvictOneWritableChunk_SkipsGappyChunks
(internal gap, leading gap, contiguous). Verifies ProactiveFlush picks
the contiguous chunk when staleness criteria are otherwise satisfied,
returns false when only gappy chunks remain (no liveness fallback like
EvictOneWritableChunk has — failing here is just a missed optimization),
and that filling the holes lets the chunks auto-seal via maybeMoveToSealed.

* style(mount): trim verbose comments on #9330 fix
2026-05-06 15:26:56 -07:00
Chris Lu
7b0b64db65 fix(admin/view): wrap plugin history URL with basePath (#9341)
Plugin tabs/sub-tabs use history.pushState/replaceState to keep the
URL bar in sync with the active view, but updateURL fed it the raw
output of buildPluginURL ("/plugin/lanes/<lane>/..."). Under a
urlPrefix deployment that strips the prefix, so reloading the page
hit /plugin/... directly and 404'd at the proxy.

Wrap with basePath() so the rewritten URL keeps the deployment
prefix.

Reported at #9240.
2026-05-06 15:25:06 -07:00
Chris Lu
1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu
12f283357f fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333)
* fix(iam): scope IAM-managed OIDC provider lookup by role account

Two account-scoped OIDC records sharing an issuer were collapsed into a
single map slot keyed only by the URL. The last-write-wins entry then
served every AssumeRoleWithWebIdentity, so a token destined for account
B's role could be validated by account A's record (its clientIDs and
thumbprints), defeating the per-account isolation the records exist for.

The role-account check in enforceProviderAccountScope still rejected
the cross-account assumption, but only after the wrong record's
audience and TLS pin had already accepted the token.

Refresh now keys IAM-managed records as (issuer, account), and
validation parses the requested role's account up front and matches
the record under that issuer in this order: exact account, global
(account-less), static-config fallback. An unknown account hint
deliberately skips account-scoped entries — picking one arbitrarily is
the bug this commit fixes — and falls through to global or static.

* fix(iam): route public AssumeRoleWithWebIdentity through IAMManager

handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity
directly, bypassing the IAMManager wrapper. The wrapper is where
enforceProviderAccountScope rejects cross-account assumption attempts
and capDurationByRole clamps to the role's MaxSessionDuration; both
silently became no-ops for any AWS-SDK caller hitting the public
endpoint.

Dispatch through the IAMManager (via the existing IAMManagerProvider
interface that other handlers in this file already use) when one is
wired. Embedded test setups without an IAM integration fall back to
the bare STS service unchanged.

* fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config

initOIDCProviderStore mirrored only URL and ClientIDs. Once
RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on
boot once the metadata-subscribe loop kicked in),
buildOIDCProviderFromRecord rebuilt the runtime provider from this
truncated record. Because IAM-managed entries take precedence over the
static-config map, the rebuild silently shadowed the bootstrap with a
weaker provider:

- Thumbprints: dropped, so TLS-pinned issuers fell back to the system
  trust store.
- AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped
  reaching the session.
- PolicyClaim: dropped, so claim-based policy mode stopped triggering.

Pull all three from the provider's static Config map at mirror time so
the stored record round-trips to a runtime provider equivalent to the
one the static config produced directly.

* fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler

Phase 3b advertises that RoleArn MAY be omitted in claim-based policy
mode — the STS service then derives the assumed-role ARN from the
configured policy claim. The HTTP handler still rejected empty RoleArn
up front with MissingParameter, so SDK callers using the documented
omitted-role flow never reached the STS layer.

Drop the pre-check; STS still validates that claim-based mode is
configured and that the IDP emits policies, returning a precise error
when either is missing. The existing error mapping below this point
surfaces those as InvalidParameterValue, matching what an AWS SDK
expects.

* test(iam): update missing-RoleArn STS integration test for the new contract

The previous commit drops the HTTP-layer RoleArn pre-check so claim-based
mode can derive the ARN from a JWT claim. The integration test still
asserted MissingParameter for the missing-RoleArn case, which now
reaches the STS layer and surfaces a JWT-parse error instead. Update
the assertion to match: missing RoleArn alone must no longer surface
as MissingParameter, but a bogus JWT must still be rejected.
2026-05-05 19:14:44 -07:00
Chris Lu
9af1b212d3 feat(iam): OIDC provider audit trail (Phase 3e) (#9325)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.

* feat(iam): opt-in session revocation via JTI blocklist

Add SessionRevocationStore (memory + filer implementations) and wire
it into the IAMManager.IsActionAllowed path so a revoked session is
rejected on the next signed request. Session JWTs now embed the
session id as the JTI claim, giving the blocklist a stable key
without requiring a second secret.

Operators who don't configure a store keep the existing fully-stateless
behaviour: every session stays valid until natural expiry. Operators
who do configure one accept one filer lookup per signed request in
exchange for being able to invalidate compromised tokens before
expiry. Revocation entries carry the original session expiry so the
blocklist self-trims via PurgeRevokedSessions.

* fix(iam): hash JTI filenames + paginate Purge with proper EOF handling

Three reviewer-flagged issues on the filer-backed revocation store:

1. Path traversal (security-medium): RevokeSession is exported and takes
   an arbitrary string. Using the JTI verbatim as a filename meant a
   caller could pass "../../etc/passwd" to write outside the basePath.
   SHA-1 hash the JTI to a fixed-width hex name; lookups still find the
   entry because Revoke and IsRevoked share the same hash function.

2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed
   the outer err and the loop just broke on any failure, so a mid-stream
   gRPC error returned (count, nil) and the caller had no idea the
   purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream
   and propagate everything else.

3. Purge had a hardcoded 10000-entry cap. Stream-paginate via
   StartFromFileName so the operator-cron can clean a backlog larger
   than that without losing rows.

* feat(iam): OIDC provider audit trail

Emit one structured event per IAM-managed OIDC provider lifecycle
mutation (Create, Delete, Add/Remove ClientID, UpdateThumbprints,
Tag, Untag). Three sinks ship in-tree:

- GlogAuditSink — default; events become structured log lines.
- MemoryAuditSink — in-process buffer for tests / inspection.
- FilerAuditSink — durable, one file per event under
  /etc/iam/audit/oidc-providers (operator-overridable basePath).

Audit emission is best-effort: a failing sink never blocks an IAM
mutation that has already succeeded. Use-events (per token validation)
are intentionally not emitted yet — too hot for unconditional sinks.

* fix(iam): collision-free audit filenames + correct file mtime

Two issues gemini flagged on FilerAuditSink.Emit:

1. Filename was %d-%s.json (UnixNano + Type). Two mutations at the
   same nano against different ARNs (think batch script touching
   several providers) collided on the same path and the second
   CreateEntry failed silently. Append a short ARN hash to make the
   name unique per-event without leaking the ARN.

2. File Mtime/Crtime were set to time.Now() at write time. For audit
   integrity those should reflect the event's occurrence time so
   filer-level "ls -lt" output matches the contents.

Addresses three medium-priority gemini reviews on PR #9325.
2026-05-05 13:37:24 -07:00
Chris Lu
9d6a699b94 feat(iam): opt-in session revocation via JTI blocklist (Phase 3d) (#9324)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.

* feat(iam): opt-in session revocation via JTI blocklist

Add SessionRevocationStore (memory + filer implementations) and wire
it into the IAMManager.IsActionAllowed path so a revoked session is
rejected on the next signed request. Session JWTs now embed the
session id as the JTI claim, giving the blocklist a stable key
without requiring a second secret.

Operators who don't configure a store keep the existing fully-stateless
behaviour: every session stays valid until natural expiry. Operators
who do configure one accept one filer lookup per signed request in
exchange for being able to invalidate compromised tokens before
expiry. Revocation entries carry the original session expiry so the
blocklist self-trims via PurgeRevokedSessions.

* fix(iam): hash JTI filenames + paginate Purge with proper EOF handling

Three reviewer-flagged issues on the filer-backed revocation store:

1. Path traversal (security-medium): RevokeSession is exported and takes
   an arbitrary string. Using the JTI verbatim as a filename meant a
   caller could pass "../../etc/passwd" to write outside the basePath.
   SHA-1 hash the JTI to a fixed-width hex name; lookups still find the
   entry because Revoke and IsRevoked share the same hash function.

2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed
   the outer err and the loop just broke on any failure, so a mid-stream
   gRPC error returned (count, nil) and the caller had no idea the
   purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream
   and propagate everything else.

3. Purge had a hardcoded 10000-entry cap. Stream-paginate via
   StartFromFileName so the operator-cron can clean a backlog larger
   than that without losing rows.
2026-05-05 13:25:22 -07:00
Chris Lu
6483583491 feat(iam): account-scoped OIDC providers (Phase 3c) (#9323)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.

* feat(iam): account-scoped OIDC providers

Add OIDCProviderRecord.AccountID enforcement: when a role lives in
account A, the OIDC provider validating the assume-role token must be
either global (AccountID="") or also live in account A. Cross-account
use is rejected at the IAM-manager layer before reaching the trust
policy validator.

OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in-
memory and filer-backed stores implement it. Static-config-only
deployments are unaffected since they don't populate the store.

* fix(iam): account-scoped lookup for cross-account check

enforceProviderAccountScope was calling GetProviderByIssuer, which
returns the first match arbitrarily when multiple providers share an
issuer (one global + one per tenant is the canonical setup). On a
two-record collision the wrong record could come back first and
falsely reject a valid same-account or global-provider request.

Use GetProviderByIssuerAndAccount as the primary lookup so the
filter happens in the store. On miss, fall back to
GetProviderByIssuer purely to distinguish "issuer entirely unknown"
(let the STS layer reject) from "issuer registered in a different
account" (surface a precise cross-account error).

Addresses gemini high-priority review on PR #9323.
2026-05-05 13:06:53 -07:00
Chris Lu
bc1d458fe6 fix(iam): reject empty issuer in ComputeParentUser (#9326)
Without iss, the same `sub` from two different IDPs would collapse to
the same parent_user hash. Short-circuit to empty when either input
is missing so callers see "no identity" instead of a colliding hash.

Addresses coderabbit review on PR #9318.
2026-05-05 13:01:33 -07:00
Chris Lu
1d3454ca5c feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity (Phase 3b) (#9322)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity

When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based
(or omits it entirely) and the matched OIDC provider has policyClaim set,
mint a session whose effective policies come from that JWT claim instead
of from a server-side role mapping. Accepts string, comma-separated
string, or array shapes — MinIO-compatible behaviour for IDPs that
already attach policies to the user.

Trust-policy validation is skipped in claim-mode: the IDP is the sole
authority for both authentication and authorization, mirroring the
contract MinIO documents for its DummyRoleARN flow. Concrete-role mode
is unchanged and still requires the role definition + trust policy.

* fix(iam): trim policy-claim array elements + clean up stale comments

Three medium-priority cleanups gemini flagged on the claim-based path:

- extractClaimPolicies's array branch was leaving whitespace on each
  element while the string/comma-separated branch trimmed via
  splitPolicyClaimString. An IDP that emits ["readonly", " billing "]
  would create a "billing" policy lookup that didn't match the stored
  name. Trim every array element, drop empties.
- The "synthetic ARN keyed on the session name" comment was wrong —
  effectiveRoleArn here is the literal sentinel; it's the assumed-role
  ARN generated downstream that's session-keyed. Reword.
- The empty if/else block at the start of
  validateAssumeRoleWithWebIdentityRequest existed only to host a
  comment about deferred validation; the comment now lives in the
  function godoc and the empty branch is gone.

Addresses three gemini medium reviews on PR #9322.
2026-05-05 12:21:55 -07:00
Chris Lu
6554ab7928 feat(iam): principal session tags from OIDC tokens (Phase 3a) (#9321)
* feat(iam): principal session tags from OIDC tokens

Extract the AWS principal-tags namespace claim
(`https://aws.amazon.com/tags/principal_tags`) from validated OIDC
tokens, filter through a per-provider AllowedPrincipalTagKeys allowlist,
and surface as `aws:PrincipalTag/<key>` in the STS session request
context. Empty allowlist means "no tags surfaced" — operators must opt
keys in explicitly so a misconfigured IDP can't pollute policy
evaluation.

Policy engine now accepts `aws:PrincipalTag/...` and
`aws:RequestTag/...` as substitutable variables so resource-level
ABAC policies can reference them.

* fix(iam): case-insensitive principal-tag allowlist + sharper comment

filterPrincipalTags compared keys case-sensitively, but AWS IAM
session tag keys are case-insensitive (the docs are explicit). An
IDP whose claim casing drifts from the operator-configured allowlist
string would silently filter the value out — surprising failure mode.
Lowercase both sides during the lookup; the original key casing is
preserved on the output so policy variables still match what the
caller sees.

Also reword the "anything the IDP signs is acceptable" comment in
sts_service.go: it predates the per-provider allowlist that's
already filtering before this point. The reality is now "everything
reaching here is on the operator's opt-in list, dropped entirely if
the allowlist is empty."

Addresses two gemini medium reviews on PR #9321.
2026-05-05 11:43:05 -07:00
Chris Lu
f8973b3ed6 feat(iam): OIDC provider mutations + multi-client + TLS thumbprints (Phase 2b) (#9320)
* feat(iam): OIDC provider mutations + multi-client + TLS thumbprints

- Mutating IAM actions: CreateOpenIDConnectProvider,
  DeleteOpenIDConnectProvider, AddClientIDToOpenIDConnectProvider,
  RemoveClientIDFromOpenIDConnectProvider,
  UpdateOpenIDConnectProviderThumbprint, TagOpenIDConnectProvider,
  UntagOpenIDConnectProvider. Each enforces AWS-shape input bounds and
  the read-only mode rejects all mutations.
- Multiple client_ids per provider in OIDCConfig (clientIds list, plural)
  with full backward compatibility — singular clientId still works and is
  merged into the audience allowlist. Provider factory accepts both.
- AWS-compatible TLS thumbprint pinning: when OIDCConfig.Thumbprints is
  non-empty, JWKS fetches enforce that the negotiated TLS chain contains
  a certificate whose SHA-1 hex matches the allowlist. Empty list keeps
  the existing system-trust path.

* fix(iam): factor Tags.member.N parser into a helper

CreateOpenIDConnectProvider and TagOpenIDConnectProvider were both
walking the AWS Tags.member.N.Key / Tags.member.N.Value query-string
convention with copy-pasted loops. Factor into extractTags so the
parsing rules and the "no tags present" semantics live in one place.

Addresses gemini medium review on PR #9320.

* fix(iam): sentinel errors for OIDC provider not-found / already-exists

The s3api dispatcher was using strings.Contains(err.Error(), "not found")
and "already exists" to map IAM-manager errors back to AWS error codes.
Substring matching on a formatted message couples the API error code
to the exact wording of the upstream message — touching the message
silently changes the IAM API contract.

Define ErrOIDCProviderNotFound and ErrOIDCProviderAlreadyExists in the
integration package, fmt.Errorf("%w: ...") them at the four return
sites in iam_manager.go and oidc_provider_store.go, and use errors.Is
at the s3api call sites. Same control flow, no string-match fragility.

Addresses gemini medium review on PR #9320.

* fix(iam): surface non-NotFound errors from CreateOIDCProvider lookup

Previously CreateOIDCProvider only treated GetProviderByARN's success path
as "exists" and silently fell through on any error, including transient
backend failures. That hid real problems and still attempted a write.
Distinguish ErrOIDCProviderNotFound (the only "safe to create" case) from
other errors so we don't mask filer outages or partition issues.

* fix(iam): enforce 100-client-ID cap on AddClientIDToOIDCProvider

CreateOIDCProvider and the implicit update path through validateOIDC-
ProviderRecord both reject lists with more than 100 client IDs, but
AddClientIDToOIDCProvider could grow the list past that bound one
ID at a time. Refuse the add when the list is already at the cap so
the invariant holds across every mutation entry point.

* feat(iam): IAM-managed OIDC provider live view in STS service

Add a separate, mutex-guarded map of admin-managed OIDC providers on
the STS service. The map can be atomically replaced via
SetIAMManagedOIDCProvidersByIssuer; AssumeRoleWithWebIdentity lookups
consult it first and fall back to the existing static-config map, so
records persisted through the IAM API can shadow bootstrap entries
without a restart.

This is the runtime hook the IAM API and the metadata-subscribe path
will both call when the OIDCProviderStore changes (next two commits).

* feat(iam): refresh STS service runtime view after OIDC mutations

Add IAMManager.RefreshOIDCProvidersFromStore: lists every persisted
OIDCProviderRecord, builds a runtime OIDCProvider for each, and atomically
publishes the issuer-keyed map into the STS service. Each mutating IAM API
call (Create / Delete / AddClientID / RemoveClientID / UpdateThumbprints)
now triggers this refresh inline so the local instance picks up the change
without waiting for a metadata-subscribe round trip. Tag mutations skip
the refresh because tags do not affect token validation.

Refresh failures only log; the persisted write has already succeeded by
that point, so a transient list error must not surface to the API caller.
The peer-instance update path (filer metadata subscription) is added in a
follow-up commit.

* feat(iam): subscribe to OIDC provider changes on the filer

Watch /etc/iam/oidc-providers under the existing s3 metadata-subscribe
loop and call RefreshOIDCProvidersFromStore on any create / update /
delete / rename. This is the cross-instance update path: S3 server A
writes via the IAM API, the filer fans out the metadata change, and S3
servers B..N pick up the new runtime view without a restart.

Mirrors the existing onIamConfigChange / onCircuitBreakerConfigChange
pattern. The handler short-circuits when the path is unrelated, and
when no IAMManager is wired in (static-only configurations).
2026-05-05 11:26:08 -07:00
Chris Lu
6141222ab0 fix(test/s3/policy): allocate fresh admin port per subtest (#9332)
* fix(test/s3/policy): allocate fresh admin port per subtest

startMiniCluster ran weed mini in-process and explicitly assigned
master/volume/filer/s3 ports allocated by MustAllocatePorts, but it
left -admin.port and -admin.port.grpc unset, so each subtest reused
the hardcoded defaults 23646 / 33646.

The package's subtests run sequentially within the same go test
process. The previous subtest's admin goroutine is still bound to
23646 by the time the next subtest spins up its own mini, so the
new admin can never bind, mini.go's waitForAdminServerReady hits
its 240-attempt cap, and glog.Fatalf kills the test binary. This
has been the dominant cause of "admin server did not become ready"
flakes across recent IAM PRs.

Allocate two extra ports for admin and pass them through. The other
subprocess-based tests (s3tables/*) are not affected because each
launches weed mini in a fresh OS process.

* fix(mini): make admin readiness wait context-aware

waitForAdminServerReady polled for 240 attempts × 500ms regardless of
whether the surrounding mini context was cancelled. When mini is run
in-process from a test harness (test/s3/policy/...) and the test calls
its cancel func, the leftover wait keeps spinning for the full two
minutes and then glog.Fatalf's, terminating the entire test binary —
including any sibling subtest that has since started its own mini.

Thread the existing miniClientsCtx through the wait so a Stop / cancel
returns context.Canceled immediately. The caller (startMiniAdminWithWorker)
treats a context-cancelled outcome as a graceful shutdown signal and
logs+returns instead of fataling.
2026-05-05 11:24:43 -07:00
Chris Lu
95560076e6 fix(mini): raise admin readiness timeout to 2 minutes (#9329)
The 30-second ceiling on waitForAdminServerReady was too tight on busy
CI runners. master + filer + volume + admin all start in parallel on a
shared worker, and S3 Policy Shell Integration Tests has been flaking
across multiple PRs with "admin server did not become ready... after
60 attempts" even though the server still comes up within a minute or
two. Two minutes (240 attempts at 500ms) leaves headroom for runner
contention without being absurd in a local-dev run.
2026-05-05 07:59:25 -07:00
Chris Lu
22ebe9feb0 ci(e2e): switch FUSE Mount build to Azure Ubuntu mirror, persist buildx cache
archive.ubuntu.com from GitHub-hosted runners has been Ign:/retrying for
~60s per package, eating the Start SeaweedFS step's 10-min budget before
apt-get install finishes. The host already uses azure.archive.ubuntu.com;
do the same inside Dockerfile.e2e and drop the Retries=5 amplifier.

Also rotate /tmp/.buildx-cache-new over /tmp/.buildx-cache so the apt
layer actually survives across runs, and bump the step to 15 minutes
as a safety margin.
2026-05-05 00:22:59 -07:00
Chris Lu
4ded97a321 feat(iam): OIDC provider store + read-only IAM API (Phase 2a) (#9319)
* feat(iam): STS web-identity AWS-fidelity polish

- OIDC discovery via .well-known/openid-configuration; falls back to
  /.well-known/jwks.json when discovery is absent. Reject discovery docs
  whose issuer claim does not match the configured issuer to defend
  against issuer-substitution.
- ComputeParentUser derives a stable per-identity hash from (sub, iss).
  Surface as aws:userid in the request context and as a parent_user
  claim in the session JWT so per-user state survives token rotation.
- Per-role MaxSessionDuration (3600..43200) clamps requested
  DurationSeconds before the STS service applies its own caps.
- Tighten RoleSessionName to the AWS contract: 2..64 chars from
  [\w+=,.@-].
- Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity /
  AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte
  inline session policy budget.

* fix(iam): leave omitted DurationSeconds nil so STS default applies

capDurationByRole was substituting the role's MaxSessionDuration
when the caller omitted DurationSeconds entirely. AWS returns the
configured default (typically 1 hour) in that case, not the role's
upper bound — a 12h MaxSessionDuration shouldn't silently make every
no-duration assume-role mint a 12h session.

Return nil when requested is nil; let the downstream
calculateSessionDuration in the STS service apply its TokenDuration
default. The role-max upper bound still clamps when the request
arrives with a concrete value above the cap.

Addresses gemini high-priority review on PR #9318.

* fix(iam): synchronize OIDCProvider JWKS cache fields

jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are
mutated lazily on the first token-validate call and refreshed
afterwards on TTL expiry. Multiple S3 requests can land here in
parallel, so the writes were racing against subsequent reads on
every other goroutine. resolvedJWKSUri/discoveryFailed inherited
the same un-protected pattern when discovery shipped.

Add sync.RWMutex; getPublicKey takes the read lock for the
common cache-hit path and promotes to the write lock for misses
+ refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the
write lock is held by the caller; fetchJWKS keeps the
test-friendly entry point that acquires the lock itself.

Addresses gemini high-priority review on PR #9318.

* fix(iam): trim trailing slash + retry discovery after transient failure

Two OIDC discovery edge cases reviewers flagged:

1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri
   trims them when building the discovery URL, but the doc.Issuer ↔
   p.config.Issuer check did not, so an IDP whose issuer claim drops or
   adds the slash relative to the configured value would be falsely
   rejected. Trim a single trailing slash on each side before comparing.

2. discoveryFailed flipped to true on any error and stayed there for the
   process lifetime. A transient 5xx at startup permanently locked the
   provider into the /.well-known/jwks.json fallback. Reset the flag at
   the top of fetchJWKSLocked when no URI has been cached yet, so each
   JWKS refresh (typically once per TTL = 1h) reattempts discovery.
   Successful discovery remains cached via resolvedJWKSUri so we don't
   pay the discovery RTT on every refresh.

Addresses gemini security-medium + medium reviews on PR #9318.

* fix(iam): require non-empty issuer in OIDC discovery doc

The previous "doc.Issuer != "" && ..." guard let a discovery document
that omitted the issuer field bypass the issuer-mismatch check
entirely, letting the doc steer fetchJWKS at any URL it provided.
OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a
hard failure same as mismatched. Trailing-slash equivalence still
applies.

Adds TestDiscoveryRejectsMissingIssuer alongside the existing
TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer
toggle on fakeIDP.

* feat(iam): OIDC provider store + read-only IAM API

Add OIDCProviderRecord — the persisted, IAM-managed view of an OIDC
identity provider — and an OIDCProviderStore interface with memory and
filer implementations mirroring the existing role-store pattern.

The store is hydrated at boot from the static STS.Providers list so the
new IAM API surfaces the same set the STS service already validates
against. Two read-only actions land now:

- ListOpenIDConnectProviders -> ARN-only list, AWS-shape XML.
- GetOpenIDConnectProvider   -> URL, ClientIDList, ThumbprintList,
                                Tags, CreateDate.

Mutations (Create/Delete/Add-Remove ClientID/Update Thumbprint), multiple
client_ids per provider, and TLS thumbprint pinning come in Phase 2b.

* fix(iam): preserve CreatedAt across boots + paginate ListProviders

Two medium-priority issues gemini flagged on the read-only IAM API:

1. The static-config bootstrap was setting CreatedAt = time.Now() on
   every server start, so the IAM GetOpenIDConnectProvider response's
   CreateDate shifted on each restart even when backed by a persistent
   store. Look up the existing record via GetProviderByARN first and
   preserve its CreatedAt; only the UpdatedAt advances.

2. FilerOIDCProviderStore.ListProviders had a hardcoded Limit: 1000
   that silently truncated above that. Stream-paginate via
   StartFromFileName, returning io.EOF naturally and surfacing all
   other errors instead of swallowing them.

Addresses two gemini medium reviews on PR #9319.
2026-05-04 22:15:03 -07:00
Chris Lu
d951a8df5a feat(iam): STS web-identity AWS-fidelity polish (Phase 1) (#9318)
* feat(iam): STS web-identity AWS-fidelity polish

- OIDC discovery via .well-known/openid-configuration; falls back to
  /.well-known/jwks.json when discovery is absent. Reject discovery docs
  whose issuer claim does not match the configured issuer to defend
  against issuer-substitution.
- ComputeParentUser derives a stable per-identity hash from (sub, iss).
  Surface as aws:userid in the request context and as a parent_user
  claim in the session JWT so per-user state survives token rotation.
- Per-role MaxSessionDuration (3600..43200) clamps requested
  DurationSeconds before the STS service applies its own caps.
- Tighten RoleSessionName to the AWS contract: 2..64 chars from
  [\w+=,.@-].
- Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity /
  AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte
  inline session policy budget.

* fix(iam): leave omitted DurationSeconds nil so STS default applies

capDurationByRole was substituting the role's MaxSessionDuration
when the caller omitted DurationSeconds entirely. AWS returns the
configured default (typically 1 hour) in that case, not the role's
upper bound — a 12h MaxSessionDuration shouldn't silently make every
no-duration assume-role mint a 12h session.

Return nil when requested is nil; let the downstream
calculateSessionDuration in the STS service apply its TokenDuration
default. The role-max upper bound still clamps when the request
arrives with a concrete value above the cap.

Addresses gemini high-priority review on PR #9318.

* fix(iam): synchronize OIDCProvider JWKS cache fields

jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are
mutated lazily on the first token-validate call and refreshed
afterwards on TTL expiry. Multiple S3 requests can land here in
parallel, so the writes were racing against subsequent reads on
every other goroutine. resolvedJWKSUri/discoveryFailed inherited
the same un-protected pattern when discovery shipped.

Add sync.RWMutex; getPublicKey takes the read lock for the
common cache-hit path and promotes to the write lock for misses
+ refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the
write lock is held by the caller; fetchJWKS keeps the
test-friendly entry point that acquires the lock itself.

Addresses gemini high-priority review on PR #9318.

* fix(iam): trim trailing slash + retry discovery after transient failure

Two OIDC discovery edge cases reviewers flagged:

1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri
   trims them when building the discovery URL, but the doc.Issuer ↔
   p.config.Issuer check did not, so an IDP whose issuer claim drops or
   adds the slash relative to the configured value would be falsely
   rejected. Trim a single trailing slash on each side before comparing.

2. discoveryFailed flipped to true on any error and stayed there for the
   process lifetime. A transient 5xx at startup permanently locked the
   provider into the /.well-known/jwks.json fallback. Reset the flag at
   the top of fetchJWKSLocked when no URI has been cached yet, so each
   JWKS refresh (typically once per TTL = 1h) reattempts discovery.
   Successful discovery remains cached via resolvedJWKSUri so we don't
   pay the discovery RTT on every refresh.

Addresses gemini security-medium + medium reviews on PR #9318.

* fix(iam): require non-empty issuer in OIDC discovery doc

The previous "doc.Issuer != "" && ..." guard let a discovery document
that omitted the issuer field bypass the issuer-mismatch check
entirely, letting the doc steer fetchJWKS at any URL it provided.
OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a
hard failure same as mismatched. Trailing-slash equivalence still
applies.

Adds TestDiscoveryRejectsMissingIssuer alongside the existing
TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer
toggle on fakeIDP.
2026-05-04 22:10:49 -07:00
Chris Lu
2417ba0354 fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints

Three destructive VolumeServer gRPC endpoints (DeleteCollection,
VolumeDelete, VolumeServerLeave) had no authentication checks, unlike
their HTTP counterparts which are protected by the Guard whitelist.

Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth
helper on VolumeServer that extracts the peer IP from gRPC context and
validates it against the guard whitelist. Gate all three endpoints
behind this check.

* fix(volume): tolerate unparseable gRPC peer address in admin auth check

S3 Filer Group integration tests were failing with
PermissionDenied "bad peer address: address @: missing port in address"
when DeleteCollection ran across the in-process gRPC connection
between filer and volume server — the peer addr surfaces as "@" there
and net.SplitHostPort can't parse it. The check rejected before
IsWhiteListed could exercise its allow-all path for empty-whitelist
deployments.

Hand the raw peer string to IsWhiteListed when SplitHostPort fails.
With no whitelist configured (the test environment's mode) it accepts;
with a whitelist configured the unparseable host won't match anything
and the call still gets denied as it should.

Adds three regression tests for IsWhiteListed pinning the empty-config
allow-all, populated-list reject-unknown, and signing-key-only allow-
all branches that the gRPC admin helper relies on.

* refactor(security): dedup checkWhiteList through IsWhiteListed

The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the
same lookup logic in two places; future drift was just a matter of
time. Have checkWhiteList delegate so the membership semantics live
in exactly one function.

Behaviour is unchanged: the new path still returns nil for
isEmptyWhiteList (signing-key-only mode) and still rejects unknown
hosts when a whitelist is configured.

Addresses gemini medium review on PR #8876.

* fix(volume): protect remaining state-altering gRPC admin endpoints

DeleteCollection, VolumeDelete, and VolumeServerLeave were the
truly-destructive endpoints, but AllocateVolume, VolumeMount,
VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and
VolumeMarkWritable also modify server state and should sit behind
the same whitelist gate. Read-only endpoints (VolumeStatus,
VolumeServerStatus, VolumeNeedleStatus, Ping) stay open.

The check is a no-op when no whitelist is configured (the default),
so existing deployments keep working; operators who lock down their
volume servers via guard.white_list now get consistent coverage.

Addresses gemini security-high review on PR #8876.

* fix(volume): typed peer addr + audit log for gRPC admin auth

Prefer a typed *net.TCPAddr when extracting the peer IP — string
parsing was already a fallback for the in-process case but using the
typed form first is cleaner and skips an unnecessary parse on the
common path. Log failed authorization attempts at V(0) so an operator
running with a whitelist sees the host that was rejected (and the
raw remote address in case the IP lookup itself was the failure
mode), matching what the HTTP Guard already does.

Addresses gemini medium review on PR #8876.

* fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints

Five more master/admin-driven destructive operations live outside
volume_grpc_admin.go and were missing the same whitelist gate:

- VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup
- ScrubVolume
- VolumeEcShardsDelete

VacuumVolumeCheck stays open (read-only). BatchDelete also stays
open: it's the data-plane multi-object delete called from the S3 API
and filer, not an admin operation; gating it would break ordinary S3
DeleteObjects calls.

Addresses gemini security-high review on PR #8876.

* fix(volume): simplify no-peer-info branch in gRPC admin auth

The IsWhiteListed("") fallback was defending against a scenario
that doesn't actually arise — real gRPC connections always populate
peer info. Drop the branch and just deny when peer info is missing,
which is the safer default and matches "if we don't know who the
caller is, refuse".

* fix(volume-rust): mirror gRPC admin auth on the rust volume server

The rust volume server has the same set of destructive admin
endpoints as the Go side and the same Guard infrastructure, but
nothing was wired together — every endpoint accepted unauthenticated
calls regardless of guard configuration. Same vulnerability class
the Go fix on this PR closes; this commit closes it on the rust
side too so the two stacks stay aligned.

Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer
SocketAddr off the tonic Request and runs Guard::check_whitelist on
its IP, then applies the helper to the same set the Go side covers:
DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount,
VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable,
VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit,
VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume,
VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete
stays open as a data-plane multi-object delete.
2026-05-04 21:14:55 -07:00
Chris Lu
a769c938ec test(s3tables): Unity Catalog OSS integration tests against SeaweedFS (#9308)
* test(s3tables): add Unity Catalog OSS integration test against SeaweedFS

Mirrors the configuration used by the upstream playground at
data-engineering-helpers/mds-in-a-box/unitycatalog-playground.

Three test variants under test/s3tables/unity_catalog:

- TestUnityCatalogDeltaIntegration: aws.masterRoleArn empty / static
  keys; catalog/schema/EXTERNAL Delta CRUD + temporary-table-credentials
  S3 round-trip (the playground's working configuration).
- TestUnityCatalogMasterRoleIntegration: aws.masterRoleArn set to a
  SeaweedFS-side role with a permissive trust policy; UC's StsClient
  is pinned at SeaweedFS via AWS_ENDPOINT_URL_STS, and the test asserts
  the vended creds carry a session_token and a non-static access key,
  proving the role-vended path the playground notes as not-yet-working
  actually does work today.
- TestUnityCatalogDeltaRsRoundTrip: writes/reads a real Delta table at
  the registered storage_location using delta-rs in a slim Python
  container, with temporary credentials fetched from UC.

All three self-skip without Docker or a weed binary, matching the
sibling lakekeeper / polaris tests.

* test(s3tables): tighten Unity Catalog tests against actual UC OSS behavior

After running the suite locally, ground the assertions in what the
upstream UC OSS Docker image actually does against SeaweedFS today.

- Static-key playground configuration
  (TestUnityCatalogDeltaIntegration): catalog/schema/EXTERNAL Delta CRUD
  pass against the SeaweedFS-backed warehouse. The temporary-table-
  credentials subtest is renamed and inverted to assert the failure mode
  the playground reports -- UC's AwsCredentialVendor falls through to an
  internal StsClient.assumeRole when masterRoleArn and sessionToken are
  both empty, which has no real STS to talk to. Bucket path is also
  fixed to match UC's getStorageBase() lookup (s3://lakehouse vs the
  playground's s3://lakehouse/warehouse, which the upstream code never
  matches).

- Master-role variant (TestUnityCatalogMasterRoleIntegration): split
  into two passing slices. Slice 1 proves SeaweedFS' STS endpoint
  vending UnityCatalogVendedRole works via the Go AWS SDK and the vended
  creds round-trip on S3. Slice 2 boots UC with aws.masterRoleArn set
  and verifies catalog/schema/Delta CRUD. The third hop -- UC's Java
  StsClient actually reaching SeaweedFS' STS handler during
  /temporary-table-credentials -- is logged but not asserted, since the
  AWS Java SDK's STS request currently lands on a SeaweedFS S3 path
  rather than the STS handler.

- Delta-RS round-trip (TestUnityCatalogDeltaRsRoundTrip): gated on
  UC_DELTA_RS_RUN=1 since it depends on the master-role STS handoff
  above. The Dockerfile / writer script stay in tree so the test runs
  end-to-end the moment that hop is fixed.

README rewritten to be explicit about what each test validates today
and what is still pending.

Result: `go test -run TestUnityCatalog ./test/s3tables/unity_catalog/...`
passes cleanly with weed + Docker available, and self-skips otherwise.

* test(s3tables): exercise unity catalog integrations

* ci: run Unity Catalog integration tests on PRs

Adds a unity-catalog-integration-tests job to s3-tables-tests.yml,
modeled on the existing lakekeeper / dremio jobs. Pre-pulls the UC
image and python:3.11-slim (used by the delta-rs writer container) and
runs `go test ./test/s3tables/unity_catalog`.

Format-check and go-vet jobs already recurse into ./test/s3tables/...
so the new package is covered there too.

* test/ci: address PR review

Tighten the UC readiness probe to require 200, not <500, so a
401/403/404 during startup surfaces immediately instead of being
treated as ready (CodeRabbit).

Pin the UC image to v0.4.0 in both the workflow and the test default,
matching the pinned-tag convention the rest of s3-tables-tests.yml
uses (CodeRabbit). Use UC_IMAGE=unitycatalog/unitycatalog:main to
re-test against current upstream.

* docs: separate UC static-key vs master-role failure modes

The README mixed the two together. Static-key empty-sessionToken
short-circuits with "S3 bucket configuration not found." before UC
even fires an STS call; the AccessDenied I described is what happens
in the master-role variant where UC's Java StsClient actually reaches
SeaweedFS. Cross-link the playground PR that fixes the static-key
vending side.

Also drop the "what most playground users actually run" hand-wave
under MANAGED tables.

* docs: trim README

Drop the playground cross-reference and the "two layers fail
independently" framing.

* docs: pin down what's actually pending

Investigated the master-role STS handoff with a sniffer in front of
SeaweedFS' STS port. UC's StsClient is constructed without an
endpointOverride and never reads aws.endpoint or AWS_ENDPOINT_URL_STS;
verified by pointing AWS_ENDPOINT_URL_STS at port 1 and seeing the
same real-AWS InvalidClientTokenId 403 with zero traffic to SeaweedFS.

The fix is upstream in UC. Updated the README and the master-role
test's t.Logf to say so precisely, and dropped the stale "Spark client"
bullet (delta-rs covers that path).

* test(s3tables): use BaseEndpoint instead of deprecated resolver

EndpointResolverWithOptions is deprecated in aws-sdk-go-v2; the
supported way to override a service endpoint is via the per-service
Options.BaseEndpoint. Switch the assume-role helper to that pattern so
the test stops compiling against deprecated API and the resolver
boilerplate disappears.

Addresses gemini review on PR #9308.

* test(s3tables): drop unused splitS3URI helper

Helper had no callers; gemini caught it on PR #9308. Easy to bring
back from git history if needed.

* test(s3tables): extract last token of docker run output as container ID

docker run -d may prefix the container ID with image-pull progress
when the image isn't cached locally. strings.TrimSpace on the whole
output then gave a multi-line string, not the ID. Take the last
whitespace-separated token so the ID survives a fresh CI runner.

Addresses gemini review on PR #9308.

* test(s3tables): cap Unity Catalog response body reads at 10 MiB

io.ReadAll without a limit could OOM the test runner if the UC
container hands back an unexpectedly large body. 10 MiB is well
above any well-formed catalog response and turns a misbehaving
server into a test failure instead of a runner crash.

Addresses gemini review on PR #9308.

* docs: link UC fix PR and call out UC's mocked-Sts test pattern

UC's own credential-vending tests substitute StsClient with an in-process
EchoAwsStsClient (BaseCRUDTestWithMockCredentials) or Mockito.mockStatic
(CloudCredentialVendorTest), so the wire path between UC's Java SDK and
a real STS server is untested -- which is why the missing endpointOverride
slipped through upstream. Linked the upstream fix at
unitycatalog/unitycatalog#1532.
2026-05-04 21:14:22 -07:00
Chris Lu
6d95a5592a test(s3/policy): stop racing t.TempDir cleanup against mini shutdown
The mini cluster's admin/plugin worker keeps creating files under
admin/plugin/job_types/ for ~1s after subtests finish, while the
previous Stop() only cancelled an unobserved context and slept 500ms.
t.TempDir()'s registered RemoveAll then raced the worker and
intermittently failed with "directory not empty" (CI run 25352039081).

Manage the data dir manually so it is removed only after the mini
goroutine has exited, and wire MiniClusterCtx so cancel actually drains
master/volume/filer/admin/s3/webdav.
2026-05-04 20:16:53 -07:00
Chris Lu
0a91b57f16 fix(s3): encrypt SSE-S3 KEK at rest with AES-GCM wrapping (#8880)
* fix(s3): encrypt SSE-S3 KEK at rest using passphrase-derived wrapping key

* fix(s3): surface KEK migration failures instead of silently dropping them

The legacy-plaintext -> encrypted-at-rest path used to swallow both
wrapKEK and updateKEKContent errors. An operator who configured a
passphrase but had a filer permission issue (or a wrap failure) saw
nothing in the logs and the KEK stayed on disk in plaintext, with the
migration retried on every restart and silently failing every time.

Log each failure path explicitly so the unmigrated state is visible.
The server still starts with the in-memory key loaded — refusing to
boot here would be worse than the warning.

Addresses gemini and coderabbit reviews on PR #8880.

* fix(s3): use a per-installation random salt for KEK wrapping HKDF

The original implementation hardcoded
"seaweedfs-sse-s3-kek-wrapping-v1" as the HKDF salt for the KEK
wrapping key. Two SeaweedFS installations using the same passphrase
therefore produced byte-identical wrapping keys, opening a
precomputation/rainbow-table angle against weaker passphrases.

Generate a random 32-byte salt every time wrapKEK runs and embed it
in the on-disk payload alongside the AES-GCM ciphertext. The new
format is base64(magic("SWv2") || salt || nonce || ciphertext+tag);
unwrapKEK detects the magic and reads the salt out of the payload.
KEKs wrapped under the legacy fixed-salt format still unwrap cleanly
and are opportunistically re-wrapped into v2 on the next load so
operators get the stronger format without manual migration.

Addresses gemini review on PR #8880.

* fix(s3): plumb KEK passphrase from env into the global key manager

InitializeGlobalSSES3KeyManager used to ignore the kekPassphrase
field entirely — the global manager was always constructed with the
empty constructor, so the encrypted-at-rest code path never engaged
in production. Read the passphrase from WEED_S3_SSE_KEK_PASSPHRASE
and apply it before InitializeWithFiler so the load path picks up the
encrypted format. Log a warning when the env var is unset to make the
plaintext fallback visible to operators upgrading from earlier builds.

Adds SetKEKPassphrase as the public seam used by the global init and
by tests, plus regression tests for the wrap/unwrap round-trip,
random-salt independence across managers sharing a passphrase, and
the no-passphrase fallback that preserves the legacy hex-decode path.

Addresses coderabbit review on PR #8880.

* fix(s3): drop redundant base64 decode in KEK migration check

unwrapKEK already does the base64 decode and the magic-prefix check;
isV2WrappedKEK was repeating both passes purely so the migration
branch in loadSuperKeyFromFiler could ask "was this v1 or v2". Have
unwrapKEK return the version flag directly and delete the redundant
helper. Single decode pass per load.

Addresses gemini review on PR #8880.

* fix(s3): updateKEKContent must overwrite, not create

filer_pb.MkFile maps to CreateEntry, which is O_EXCL: it fails with
ErrEntryAlreadyExists when the file already exists. Both KEK migration
paths (legacy v1→v2 rewrap and plaintext→encrypted) call
updateKEKContent against the entry they just read, so MkFile errored
out every time and the migrations only ran in memory while the on-disk
KEK stayed in its old format. The previous commit logged the failure
loudly but the result was the same: a pre-existing deployment never
got migrated.

Switch updateKEKContent to LookupDirectoryEntry + UpdateEntry so the
overwrite actually persists. Surface the lookup/update errors so
the caller's existing "migration failed; KEK still on disk in old
form" warnings fire on the right cases.

Addresses coderabbit critical review on PR #8880.

* chore(s3): drop unused generateAndSaveSuperKeyToFiler

Master removed this as dead code in #8913 and I reintroduced it during
the merge resolution thinking the migration paths still needed it.
On second look it has no callers in this branch — every KEK-creation
path on PR #8880 goes through the existing reader code that handles
"file not present, generate" inline. Drop the duplicate.

Addresses gemini medium review on PR #8880.

* fix(s3): updateKEKContent honours km.kekPath instead of hardcoded path

The migration write path was always pointing at /etc/s3/sse_kek even
when the manager was configured with an operator-overridden kekPath.
Split km.kekPath at the last "/" so the lookup + UpdateEntry land on
the same file the read path used. Defaults match defaultKEKPath when
kekPath is unset.

Addresses gemini medium review on PR #8880.

* fix(s3): KEK passphrase via Viper key, with env-var fallback

The KEK passphrase was read straight from os.Getenv, but every other
SSE-S3 secret (s3.sse.kek, s3.sse.key) goes through Viper so an
operator can set them in security.toml or via WEED_ env vars
interchangeably. Add s3.sse.kek.passphrase to the same path; the
existing SSES3KEKPassphraseEnv lookup stays as a fallback so
deployments wired before this commit keep working.

Addresses gemini medium review on PR #8880.
2026-05-04 19:21:41 -07:00
Chris Lu
e1d5e3899f fix(s3): add HMAC-SHA256 key commitment to SSE-S3 and SSE-KMS (#8879)
* fix(s3): add HMAC-SHA256 key commitment to SSE encryption modes

* fix(s3): validate SSE-S3 IV length before cipher.NewCTR

CreateSSES3DecryptedReader took the IV directly from object metadata
and passed it to cipher.NewCTR. The standard library panics when the
IV length differs from the block size, so a tampered metadata field
turned what should be a decryption error into a crash. Validate via
the existing ValidateIV helper first.

Addresses coderabbit review on PR #8879.

* fix(s3): centralize SSE-KMS key commitment in createSSEKMSKey

KeyCommitment used to be set explicitly in only one of the four
encryption entry points; the multipart-aware variants and the bucket-
bound CreateSSEKMSEncryptedReaderForBucket left it empty, so writes
through those paths landed with metadata that VerifyKeyCommitment
treats as legacy and accepts unconditionally — exactly the downgrade
gap the gemini review flagged.

Move the HMAC computation into createSSEKMSKey so every helper-driven
path picks it up automatically, and add the same call to the bucket-
scoped path that builds its own struct literal.

Addresses gemini review on PR #8879.

* fix(s3): store base IV (not derived IV) for offset-aware SSE-KMS chunks

CreateSSEKMSEncryptedReaderWithBaseIVAndOffset used to store the
already-offset-derived IV plus the chunk offset in metadata. The
decrypt path then applied calculateIVWithOffset a second time to
that stored IV, producing the wrong CTR keystream and a decryption
mismatch. Centralizing the key commitment made the bug visible as a
commitment failure, but the underlying issue predates that change.

Pass the base IV through to createSSEKMSKey so sseKey.IV is the
unmodified base on disk; the decrypt path's offset application then
recovers the correct chunk-level IV. The HMAC commitment binds the
base IV — same value the verify call at decrypt time hashes — so the
new commitment path stays consistent.

Addresses gemini security-high review on PR #8879.

* fix(s3): opt-in strict-commitment mode to close the downgrade vector

WEED_S3_REQUIRE_KEY_COMMITMENT=true flips VerifyKeyCommitment from
accept-when-missing (the AWS-compatible default needed for objects
written before commitments shipped) to reject-when-missing. With
the env var set, an attacker who strips the commitment field from
object metadata can no longer bypass integrity verification — every
object must carry a commitment that hashes to the right value.

Default stays false so existing legacy objects keep decrypting; the
warning the gemini review raised about the silent-downgrade vector
is closed for operators who explicitly opt in once their bucket is
fully migrated. SetRequireKeyCommitment exposes a runtime seam for
tests and future config-reload paths.

Addresses the security-medium review on PR #8879.

* test(s3): fix mislabelled IV literal in strict-commitment test

The "strict-mode-iv-16" literal was actually 17 bytes — the trailing
"16" was meant as a comment but counted as content. The IV is fed
into HMAC, not AES, so the length didn't matter for behaviour, but
the discrepancy was confusing. Tighten to a real 16-byte literal and
explain the choice in a comment.

Addresses coderabbit minor review on PR #8879.

* fix(s3): store KMS-resolved KeyID in SSE-KMS metadata, not the request

CreateSSEKMSEncryptedReaderForBucket built its SSEKMSKey with the
caller-supplied keyID, but CreateSSEKMSDecryptedReader later compares
decryptResp.KeyID against sseKey.KeyID. A request that used an alias
would resolve to a different ARN in the response; storing the request
form would then trip the mismatch check at decrypt time and surface
as a "KMS key ID mismatch" error against the operator's own object.
The helper-driven encryption paths already do the right thing via
createSSEKMSKey; this is the bucket-bound path catching up.

Addresses coderabbit review on PR #8879.

* test(s3): cover key commitment rejection paths under tampering

Adds the negative-path tests coderabbit flagged as missing: a
tampered key, IV, algorithm, or commitment field must fail
VerifyKeyCommitment, otherwise a regression in the rejection logic
could land silently. The HMAC binds all three inputs plus the
commitment itself, so any single mutation is enough.

Addresses coderabbit nitpick on PR #8879.
2026-05-04 19:14:41 -07:00
dependabot[bot]
3ee147dc4d build(deps): bump cloud.google.com/go/kms from 1.26.0 to 1.30.0 (#9311)
Bumps [cloud.google.com/go/kms](https://github.com/googleapis/google-cloud-go) from 1.26.0 to 1.30.0.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md)
- [Commits](https://github.com/googleapis/google-cloud-go/compare/kms/v1.26.0...dlp/v1.30.0)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/kms
  dependency-version: 1.30.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2026-05-04 18:20:18 -07:00
Chris Lu
66d9b89cd2 fix(iam): deny IAM users with no policies instead of granting full access (#9317)
* fix(iam): deny IAM users with zero policies instead of falling through to DefaultEffect=Allow

A user created via the S3 IAM API with no policies attached was inheriting
full S3 access. With `weed s3 -iam` and no explicit IAM config, the policy
engine's DefaultEffect defaults to Allow for the in-memory zero-config path.
The "no matching statement" guard in IsActionAllowed only triggered when the
user already had at least one policy, so a fresh user with PolicyNames=[]
slipped through and got allow-all.

Track hasManagedSubject whenever the principal resolves to a registered user
or role (or PolicyNames are supplied directly) and deny on no-match. The
DefaultEffect=Allow fallback now only applies to truly unmanaged callers.

* test(iam): cover non-matching attached-policy case for managed-subject deny

* test(iam): cover role-with-empty-AttachedPolicies deny path

Sibling case to TestIsActionAllowed_RegisteredUserWithoutPoliciesIsDenied:
a managed role that resolves but has zero AttachedPolicies must also
fall through to deny under DefaultEffect=Allow, not inherit full access.
The fix already handles this branch via the hasManagedSubject flag;
this test pins the regression class so we don't lose coverage on
either side of the user/role split.

Addresses coderabbit nitpick on PR #9317.
2026-05-04 17:56:10 -07:00
Chris Lu
57bb7b2f39 quiet noisy 'shard X not found' log when EC shard lives on another server (#9316)
quiet "shard X not found" log when EC shard lives on another server

When EC shards are spread across multiple volume servers, every read
that targets a shard not present locally was logging at V(0) with the
same wording as a real read failure. Under rclone-style traffic this
floods the logs and makes a healthy EC cluster look broken (issue #9310).

Distinguish the two cases with an errShardNotLocal sentinel: the
expected fall-through-to-remote path now logs at V(4); genuine local
read failures still log at V(0).
2026-05-04 15:30:22 -07:00
Chris Lu
6aa353716a test(kafka): snapshot consumer-group state mid-attempt for resumption flake
The onRetry hook in TestOffsetManagement/ConsumerGroupResumption only fires
after defer reader.Close(), so every dump shows the post-LeaveGroup Empty
state — useless for diagnosing why the second consumer hangs in the join
cycle. Add an onTick callback fired every 1.5s while the reader is still
joined so we can see PreparingRebalance / CompletingRebalance churn,
leader, and member assignments during the 20s attempt window.
2026-05-04 15:29:33 -07:00
dependabot[bot]
3efd1e8974 build(deps): bump github.com/a-h/templ from 0.3.977 to 0.3.1001 (#9312)
Bumps [github.com/a-h/templ](https://github.com/a-h/templ) from 0.3.977 to 0.3.1001.
- [Release notes](https://github.com/a-h/templ/releases)
- [Commits](https://github.com/a-h/templ/compare/v0.3.977...v0.3.1001)

---
updated-dependencies:
- dependency-name: github.com/a-h/templ
  dependency-version: 0.3.1001
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-04 15:02:29 -07:00
dependabot[bot]
39cf3cf719 build(deps): bump cloud.google.com/go/storage from 1.60.0 to 1.62.1 (#9313)
Bumps [cloud.google.com/go/storage](https://github.com/googleapis/google-cloud-go) from 1.60.0 to 1.62.1.
- [Release notes](https://github.com/googleapis/google-cloud-go/releases)
- [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md)
- [Commits](https://github.com/googleapis/google-cloud-go/compare/compute/v1.60.0...storage/v1.62.1)

---
updated-dependencies:
- dependency-name: cloud.google.com/go/storage
  dependency-version: 1.62.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-04 15:02:18 -07:00
dependabot[bot]
dee1e12bcd build(deps): bump github.com/aws/smithy-go from 1.25.0 to 1.25.1 (#9314)
Bumps [github.com/aws/smithy-go](https://github.com/aws/smithy-go) from 1.25.0 to 1.25.1.
- [Release notes](https://github.com/aws/smithy-go/releases)
- [Changelog](https://github.com/aws/smithy-go/blob/main/CHANGELOG.md)
- [Commits](https://github.com/aws/smithy-go/compare/v1.25.0...v1.25.1)

---
updated-dependencies:
- dependency-name: github.com/aws/smithy-go
  dependency-version: 1.25.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-04 15:02:06 -07:00
dependabot[bot]
bfcbd5aa0f build(deps): bump github.com/klauspost/reedsolomon from 1.13.3 to 1.14.0 (#9315)
Bumps [github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon) from 1.13.3 to 1.14.0.
- [Release notes](https://github.com/klauspost/reedsolomon/releases)
- [Commits](https://github.com/klauspost/reedsolomon/compare/v1.13.3...v1.14.0)

---
updated-dependencies:
- dependency-name: github.com/klauspost/reedsolomon
  dependency-version: 1.14.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-04 15:01:57 -07:00
Chris Lu
1de741737d test(s3tables): add Apache Doris Iceberg catalog integration test (#9307)
* test(s3tables): add Apache Doris Iceberg catalog integration test

Adds an end-to-end smoke test that boots the apache/doris all-in-one
container, registers SeaweedFS as an external Iceberg REST catalog
(OAuth2 client_credentials), and validates metadata visibility plus the
parquet read path against tables seeded via the Iceberg REST API and a
PyIceberg writer container, mirroring the existing Trino, Spark, and
Dremio coverage. Wires the test into a new s3-tables-tests workflow job.

* test(s3tables): document weed shell -master flag format and fill in helper docstrings

Restores the explanatory comment on createTableBucket about the
host:port.grpcPort ServerAddress format used by `weed shell -master`
(produced by pb.NewServerAddress) so the dot separator isn't mistaken
for a typo, and adds doc comments for createIcebergNamespace,
createIcebergTable, doIcebergJSONRequest, requireDorisRuntime, and
hasDocker.
2026-05-04 00:28:30 -07:00
Chris Lu
90b706984e docs(readme): align Docker quick start with weed mini defaults
Replace the old "server -s3" form with a parallel docker run that
uses the same env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
S3_BUCKET) as the binary weed mini quick start. Drop the explicit
"mini" subcommand since it is the default CMD in the image.
2026-05-04 00:09:13 -07:00
Chris Lu
7f1ac8cf1a docs(readme): dedup Quick Start, drop "weed server" section
Remove the "Quick Start with Single Binary" section which duplicated
the binary-download instructions and pushed users toward "weed server".
Fold the go install alternative and the volume-scaling tip into the
weed mini section so there is one canonical Quick Start path.
2026-05-03 23:56:24 -07:00
Chris Lu
4e22d9533b docs(readme): highlight built-in Iceberg catalog and lakehouse use case
Surface the table-buckets / Iceberg REST Catalog story in the
introduction and add a Data Lakehouse Features section so readers
landing on the README see that SeaweedFS replaces both the object
store and the metastore in an Iceberg stack.
2026-05-03 23:46:08 -07:00