seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-07-26 10:03:13 +00:00

Author	SHA1	Message	Date
dependabot[bot]andGitHub	9cb103cd35	build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364 )	2026-05-08 05:59:26 -07:00
Chris LuandGitHub	b7928637a0	refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml (#9360 ) * refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml The structs S3 PutBucketLifecycleConfiguration parses and the canonical conversion to s3lifecycle.Rule lived in package s3api, which transitively imports weed/server (which imports weed/shell). Any caller outside weed/s3api — the shell, the future lifecycle worker — that wanted to parse a bucket's lifecycle XML hit an import cycle. Moves: weed/s3api/s3api_policy.go -> lifecycle_xml/types.go weed/s3api/s3api_lifecycle_canonical.go -> lifecycle_xml/canonical.go s3api_lifecycle_canonical_test.go -> lifecycle_xml/canonical_test.go s3api_policy_test.go -> lifecycle_xml/round_trip_test.go Renames the public RuleStatus type (was unexported ruleStatus) and adds small accessor methods (Set/Val/AndSet/TagSet) for fields the s3api handler needs to read across the package boundary. Adds NewPrefix and NewExpirationDays constructors so the GET handler can build response rules without poking at unexported fields. Adds a Tag struct local to the package so it has zero internal seaweed deps. Adds a one-shot ParseCanonical(xml []byte) helper for non-server callers. s3api_policy.go was misnamed — its content is lifecycle XML, not S3 bucket policy. The new package name reflects the actual scope. * test(s3api/lifecycle_xml): exercise public API in tests - canonical_test.go's parseLifecycle helper went through xml.Unmarshal directly; route it through the package's exported Parse so tests validate the public entrypoint. - round_trip_test.go asserted internal flags (rule.Filter.tagSet, rule.Filter.andSet, Transition.set, NoncurrentVersionTransition.set); switch to TagSet(), AndSet(), Set() — exercises the public contract that downstream callers (s3api handler, future shell command) rely on.	2026-05-07 18:54:06 -07:00
Chris LuandGitHub	c567da7164	feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359 ) Phase 2 added the LifecycleDelete handler on S3ApiServer but never registered it on a running gRPC server, so workers had no endpoint to dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on S3ApiServer and register it on the s3 command's grpc server alongside SeaweedS3IamCacheServer.	2026-05-07 18:19:42 -07:00
Chris LuandGitHub	35e3fe89bc	feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358 ) * feat(s3/lifecycle): filer-backed cursor Persister FilerPersister persists per-shard cursor maps as JSON to /etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler. One file per shard keeps Save atomic — the filer writes the entry in a single mutation, so a crash mid-write doesn't leak partial state. Pipeline.Run loads on start; the periodic checkpoint and graceful-shutdown save go through this implementation. A small FilerStore interface wraps the SeaweedFilerClient surface the persister needs, so tests inject an in-memory fake instead of mocking the full gRPC client. * refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block A frozen cursor doesn't advance, so the durable cursor (FilerPersister) encodes the blocked state on its own. On worker restart the reader re-encounters the poison event at MinTsNs, the dispatcher walks the same retry budget to BLOCKED, and the cursor freezes at the same EventTs. Other in-flight events between freeze tsNs and prior cursor positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the underlying objects were already deleted on the prior pass. Removed: - BlockerStore interface + InMemoryBlockerStore + BlockerRecord - Dispatcher.Blockers + Dispatcher.ReplayBlockers - the BlockerStore.Put call in handleBlocked - Pipeline.Blockers field + the ReplayBlockers call on startup Added a TestDispatchRestartReFreezesNaturally that pins the self-recovery property: a fresh Dispatcher with a fresh Cursor, fed the same poison event, reaches the same frozen state at the same EventTs without any durable blocker store. Operator visibility: a cursor whose MinTsNs hasn't advanced is the signal — surfaced via the durable cursor file. * refactor(filer): SaveInsideFiler accepts ctx ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background() internally and silently dropped the caller's ctx. Symmetric API now; cancellation/deadlines propagate through LookupEntry / CreateEntry / UpdateEntry. Mechanical update of all callers — most pass context.Background() since the existing call sites have no ctx in scope. * fix(s3/lifecycle): deterministic order in cursor save Iterating Go maps yields random order, so json.Encode produced a different byte sequence on each save even when the state hadn't changed. Sort entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk file diffs cleanly. New test pins byte-identical output across two saves of the same map. * fix(s3/lifecycle): log reason when freezing cursor in handleBlocked handleBlocked dropped the reason via _ = reason with a comment claiming the caller logged it; none of the three callers do. A frozen cursor is the only surface where the operator finds out something stuck, so the reason has to land somewhere. glog.Warningf with shard, key, eventTs, and the original reason — same shape the rest of the package uses.	2026-05-07 17:45:04 -07:00
Chris Lu	ec83a87d68	perf(s3/lifecycle): defer pool Put on ShardID hasher defer guarantees the hasher returns to the pool even if h.Write or h.Sum panic, preventing pool leak under unexpected failure modes.	2026-05-07 17:08:50 -07:00
Chris LuandGitHub	3a192c6c57	fix(s3/lifecycle): address Phase 3 post-merge review (#9354 #9355 #9356 ) (#9357 ) * fix(s3/lifecycle): reader handles bare /buckets parent and pre-normalizes prefix extractBucketKey accepted /buckets/ but rejected /buckets (no trailing slash); some delete events emit the bare form, so bucket-root events were silently dropped. Pre-normalize BucketsPath once on Run instead of recomputing per event. * perf(s3/lifecycle): pool sha256 hashers in ShardID ShardID runs on every meta-log event before the shard filter; a fresh sha256.New per call produces measurable allocator pressure under load. sync.Pool reuses hashers across calls. * fix(s3/lifecycle): router skips hard deletes and missing-attribute events A hard delete carries no schedule-relevant state — Expiration would hit NOOP_RESOLVED at dispatch and ExpiredObjectDeleteMarker fires from a Create on the latest version. Skip rather than burn a schedule slot. Missing Attributes leaves ModTime at year 0001, which makes ExpirationDays fire immediately at dispatch. Skip the event instead. Drop the unused 'versioned' parameter from buildObjectInfo; the dispatcher's identity-CAS handles version drift in Phase 5. * fix(s3/lifecycle): EntryIdentity.MtimeNs holds true nanoseconds Both computeEntryIdentity (server) and buildIdentity (router) wrote entry.Attributes.Mtime (seconds) into a field named MtimeNs. The CAS worked because both sides agreed, but the encoding contradicted the field name and would break if either side later started using true nanoseconds. Combine Mtime1e9 + the FuseAttributes.MtimeNs nanosecond component on both sides; the test was updated to match. fix(s3/lifecycle): dispatcher distinguishes ctx cancel from transport errors A canceled or deadline-exceeded RPC is shutdown, not a transport failure: re-queue the Match at its original DueTime with no retry-budget burn so a quick restart can't escalate it to BLOCKED. * fix(s3/lifecycle): reader fallback prefix normalization mirrors Run The fallback path that builds prefix from r.BucketsPath when bucketsPathSlash is empty (test-only entry into extractBucketKey) was appending an unconditional '/', producing '//' if BucketsPath already ended with one. Use the same normalization Run does. * fix(s3/lifecycle): ObjectInfo.ModTime carries the nanosecond component ModTime dropped FuseAttributes.MtimeNs, leaving ExpirationDays one nanosecond off relative to EntryIdentity.MtimeNs. Pass both to time.Unix so the precision matches the CAS witness.	2026-05-07 16:54:24 -07:00
Chris LuandGitHub	5c991f38f5	feat(s3/lifecycle): dispatcher + per-shard pipeline (Phase 3 PR-D) (#9356 ) feat(s3/lifecycle): dispatcher + blocker store + per-shard pipeline Dispatcher consumes due Matches from the schedule, calls LifecycleDelete, and routes outcomes: DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK -> Cursor.Advance RETRY_LATER (within budget) -> re-schedule with backoff RETRY_LATER (budget exhausted) / BLOCKED -> BlockerStore.Put + Freeze BlockerStore is a small interface with InMemoryBlockerStore for tests; the filer-backed impl follows when the worker task registration lands. Pipeline composes Reader + Router + Dispatcher into a single Run loop keyed by shard. Cursor is restored on start, blockers are replayed as freezes, checkpoints write at a configurable cadence, and a final save fires on shutdown. The meta-log itself is the durable buffer for in-flight schedule entries — restart re-derives them from the cursor's MinTsNs.	2026-05-07 15:44:09 -07:00
Chris LuandGitHub	8425c42858	feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355 ) feat(s3/lifecycle): event router + DueTime schedule Router consumes per-shard reader events, looks up matching ActionKeys via the engine's BucketActionKeys index, and emits Matches with DueTime = event_time + action.Delay. Evaluation runs at DueTime so the age gate passes for fresh events; the dispatcher's identity-CAS catches drift. Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor added to engine.Snapshot.	2026-05-07 15:43:27 -07:00
Chris LuandGitHub	0f6c6b0524	feat(s3/lifecycle): shard-aware meta-log reader (Phase 3 PR-B) (#9354 ) feat(s3/lifecycle): shard-aware meta-log reader - ShardCount=16; ShardID(bucket,key)=top-4-bits of sha256(bucket\|\|/\|\|key) - Reader subscribes via SubscribeMetadata starting at Cursor.MinTsNs(), filters events by shard, emits to caller-owned Events channel - Cursor: per-(shard, ActionKey) position with monotonic Advance, Freeze for blocked actions, MinTsNs for subscription resume - Persister interface with InMemoryPersister for tests; filer-backed impl lands with the worker integration	2026-05-07 15:42:37 -07:00
Chris LuandGitHub	3a76c0b027	feat(worker/proto): per-shard READ — add S3LifecycleParams.shard_id (#9353 ) READ moves from a cluster-singleton to one task per (bucket, key-prefix-hash) shard. Cluster has 16 shards; workers receive one READ task per owned shard. Required for READ; ignored for BOOTSTRAP and DRAIN.	2026-05-07 15:29:12 -07:00
Chris LuandGitHub	4f79d8e358	feat(s3/lifecycle): bucket-level bootstrap walker (#9350 ) * feat(worker): add TaskTypeS3Lifecycle constant Single job type for the lifecycle worker; the S3LifecycleParams.Subtype field (READ / BOOTSTRAP / DRAIN) dispatches inside the handler. The "s3_lifecycle" string is already wired to LaneLifecycle in admin/plugin/scheduler_lane.go so adding the constant doesn't change runtime behavior — it lets future commits reference the type name without sprinkling string literals. * feat(s3/lifecycle): bucket-level bootstrap walker Iterates entries in a bucket, evaluates every active ActionKey in the engine snapshot against each entry, and dispatches inline-delete for currently-due actions. Date-kind actions and pending_bootstrap actions are skipped — the former are handled by their own SCAN_AT_DATE bootstrap, the latter aren't IsActive() yet. Walker is callback-driven so callers supply the listing source (real filer_pb.SeaweedList or test fake) and the dispatcher (real LifecycleDelete client or test fake). This keeps the walker free of filer_pb dependencies and makes the per-action evaluation flow unit-testable in isolation. Checkpoint state (LastScannedPath, Completed) is returned to the caller, who is responsible for persisting it under /etc/s3/lifecycle/<bucket>/_bootstrap. Walk() honours opts.Resume so a kill-resumed task picks up where the previous walker stopped. Tests cover: prefix-mismatched skip, not-yet-due skip (reader's job), date-kind skip, pending_bootstrap skip, multi-action rule (one rule with three actions dispatches three times — the regression that per-action keying fixes), dispatch error halts at last-successful checkpoint, Resume skips entries up to and including the resume path. * test(s3/lifecycle): walker test uses bucket-scoped ActionKey Mechanical follow-up to the bucket-scoped ActionKey on lifecycle-engine: the bootstrap walker tests construct ActionKeys to seed PriorStates and need the Bucket field to match what engine.Compile keys against. * fix(s3/lifecycle): walker quick wins Two minor cleanups noted on review: - Drop the redundant Resume re-filter inside the Walk callback. ListFunc's contract already promises "skip entries with Path <= start"; trusting that contract avoids divergence if the filter logic ever changes on one side and not the other. - Hoist the ObjectInfo allocation out of the per-action loop in walkEntry. Multi-action rules previously allocated one ObjectInfo per (entry, kind) pair; now it's one per entry, reused across all matching kinds. * fix(s3/lifecycle): walker Entry.NoncurrentIndex tracks ObjectInfo's int ObjectInfo.NoncurrentIndex is now int so unset is unambiguous; mirror that on bootstrap.Entry so the per-entry construction stays type-clean. Phase 5 (versioned-bucket walks) is the first caller that will populate the field. * refactor(s3/lifecycle): trim narration from bootstrap walker Drop the inline step-by-step on Walk and the multi-paragraph package preamble; the function names already say it. Keep one-liner WHYs at the SCAN_AT_DATE skip and the once-per-entry ObjectInfo build. * fix(s3/lifecycle): walker skips directories and ModeDisabled actions Two safety findings from review: 1. SeaweedFS directory entries can appear in the listing alongside objects; without an IsDirectory check the walker would treat a dir like any other entry and could dispatch a delete against it. Add IsDirectory to bootstrap.Entry and short-circuit it before walkEntry. 2. ModeDisabled is set by the operator (e.g. shell pause) independent of the XML rule's Status field. EvaluateAction gates on Status and would still fire for an operator-disabled action whose XML status is "Enabled". Skip ModeDisabled explicitly in walkEntry alongside the existing SCAN_AT_DATE skip. Two regression tests pin both cases. * perf(s3/lifecycle): reuse ObjectInfo across walker entries Walker allocated one ObjectInfo struct per entry. For buckets with millions of objects that's measurable GC pressure. Hoist the allocation out of the per-entry callback (one per Walk) and reuse via field assignment in walkEntry. EvaluateAction reads ObjectInfo synchronously and doesn't retain a reference, so the reuse is safe — the next iteration's overwrite can't corrupt an in-flight evaluation. * refactor(s3/lifecycle): trim narration on walker Drop the multi-line Entry / ObjectInfo-reuse / SCAN_AT_DATE+DISABLED explanations. The walker's structure is small enough that the condition itself reads as the documentation.	2026-05-07 15:04:51 -07:00
Chris LuandGitHub	5ab3860005	feat(s3/lifecycle): LifecycleDelete RPC server (#9349 ) * feat(s3/lifecycle): SeaweedS3LifecycleInternal.LifecycleDelete RPC Adds the worker-to-S3 RPC the lifecycle worker calls to execute one (rule, action) verdict against one entry. The S3 server is the only component allowed to mutate filer state, so it gets the final word: re-fetch, identity CAS, object-lock check, dispatch. LifecycleDeleteRequest carries the routing tuple (bucket, object_path, version_id, rule_hash, action_kind), the per-stream context echoed into BlockerRecord on FATAL outcomes, and an EntryIdentity CAS witness that lets the server detect mid-flight drift (mtime, size, head fid, sorted-Extended hash). LifecycleDeleteOutcome covers all five worker-actionable verdicts: DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK / RETRY_LATER / BLOCKED. Cursor-advance / pending-mutation rules per outcome are documented inline. Makefile updated to emit the gRPC stubs alongside the message types. * feat(s3/lifecycle): LifecycleDelete server handler Server-side handler for the worker-to-S3 RPC. Five-step flow: 1. Re-fetch live entry (NOT_FOUND -> NOOP_RESOLVED). 2. CAS on EntryIdentity (mtime / size / head fid / sorted-Extended sha256). Mismatch -> NOOP_RESOLVED with reason STALE_IDENTITY. 3. enforceObjectLockProtections with governanceBypassAllowed=false. Lifecycle never bypasses legal-hold or compliance retention; the safety scan re-attempts after the hold lifts. Lock refusal -> SKIPPED_OBJECT_LOCK (logged + counted, cursor advances). 4. Dispatch by action_kind: - EXPIRATION_DAYS / EXPIRATION_DATE: createDeleteMarker on versioned buckets, deleteUnversionedObjectWithClient otherwise. - NONCURRENT_DAYS / NEWER_NONCURRENT / EXPIRED_DELETE_MARKER: deleteSpecificObjectVersion (the marker is just a version). - ABORT_MPU: stub returning RETRY_LATER pending Phase 5 wiring. 5. Helper failures the server can't classify as transient -> BLOCKED with reason "FATAL_EVENT_ERROR: <detail>"; the worker writes a durable BlockerRecord and pauses the failing stream. Filer fetch transport errors -> RETRY_LATER (sustained transients eventually promote to BLOCKED via the worker's retry budget). hashExtended uses length-prefixed encoding so a forged Extended payload (e.g. one tag value containing the byte sequence of two real tags) can't collide with a real two-tag map. Regression test pins this. Tests cover identity-comparison fields, hashExtended order/delimiter stability, empty-request rejection. Full filer-integration tests come in Layer 2. * fix(s3/lifecycle): use stdlib bytes.Equal for ExtendedHash compare Replaces a hand-rolled byte slice comparator with bytes.Equal — idiomatic, slightly faster (the stdlib version is intrinsified on amd64/arm64), and one fewer test surface to maintain. * refactor(s3api): trim narration from lifecycle RPC + canonicalizer Drop step-by-step doc-block narrations on LifecycleDelete and the helpers that reproduced what the code already says. Keep WHY one- liners at non-obvious spots: the http.Request=nil safety claim, why hashExtended is length-prefixed, the safety-scan revisit contract for SKIPPED_OBJECT_LOCK, the TODO marker for ABORT_MPU. * refactor(s3/lifecycle): retryLater helper, drop inline literals Adds retryLater() alongside done/noopResolved/blocked so the four LifecycleDeleteOutcome constructions look the same. Replaces the two inline RETRY_LATER literal returns (live-fetch transport error and the ABORT_MPU stub) with the helper. No behavior change. * fix(s3/lifecycle): default filer-error classification to RETRY_LATER Versioning lookup, createDeleteMarker, deleteUnversionedObjectWithClient, and deleteSpecificObjectVersion all do filer round-trips. Most failures are transient (filer unavailable, slow, network blip), not deterministic per-event errors. Returning BLOCKED for them was too aggressive: BLOCKED halts the stream until an operator intervenes, which would surface on every transient filer hiccup. Default these paths to RETRY_LATER. The worker's retry budget already promotes sustained transients to BLOCKED after a configured threshold, which is the right place for that escalation. BLOCKED stays for the truly deterministic error: the request-shape check (missing version_id on noncurrent delete) and the unknown-action-kind dispatch fallback. * fix(s3/lifecycle): honor versioning Suspended on current-version expiration Treating Suspended-versioning buckets like never-versioned ones was wrong: per AWS S3, current-version expiration on Suspended must remove the existing null version and insert a new delete marker (the same behavior a user DELETE produces). The previous code branched only on isVersioningEnabled() — which returns false for Suspended — and dropped through to the actual-delete path, losing the version history that Suspended buckets are still expected to keep. Switch to getVersioningState() and dispatch on three branches: Enabled -> createDeleteMarker (current becomes non-current) Suspended -> deleteSpecificObjectVersion("null"), createDeleteMarker Off / never configured -> deleteUnversionedObjectWithClient The Suspended path's null-version deletion tolerates NotFound (the null version may not exist when this is the first delete after a versioning toggle); other errors classify as RETRY_LATER like the rest of the dispatch. * refactor(s3/lifecycle): trim narration on LifecycleDelete Drop the 6-line versioning-state docblock and the inline "transient -> RETRY_LATER" rationales; keep one-liners. * fix(s3/lifecycle): bucket-not-found is NOOP; include ErrObjectNotFound Three follow-ups from review (the rest were already addressed by prior commits): - Versioning lookup now distinguishes filer_pb.ErrNotFound (BUCKET_NOT_FOUND -> NOOP_RESOLVED) from genuinely transient failures (RETRY_LATER). A bucket deleted between live-fetch and versioning-lookup is benign, not transient. - deleteUnversionedObjectWithClient and deleteSpecificObjectVersion now also recognise ErrObjectNotFound as a resolved-NOOP, matching the live-fetch step's classification.	2026-05-07 15:03:33 -07:00
Chris LuandGitHub	7f2b20d577	feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348 ) * feat(s3/lifecycle): XML lifecycle config to canonical Rule LifecycleToCanonical takes a parsed Lifecycle and returns []s3lifecycle.Rule, the flat shape the engine compiles against. Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags + size filters) flatten into the canonical Rule's individual fields; single <Tag> filter populates FilterTags with one entry; <Prefix> filter takes precedence over the rule's top-level <Prefix>. Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on the same XML <Rule>) populate every action field they declare. RuleActionKinds expands the canonical rule into its compiled actions downstream. * feat(s3/lifecycle): engine snapshot skeleton + ActionKey type Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's primary identity, and adds the engine package's Snapshot type. Snapshot is immutable after Compile (atomic-swapped on rebuild) and holds the ActionKey-keyed routing indexes: - originalDelayGroups: map[time.Duration][]ActionKey - predicateActions: []ActionKey - dateActions: map[ActionKey]time.Time - actions: map[ActionKey]CompiledAction CompiledAction.engineState is an atomic.Uint32 so MarkActive (called after the durable bootstrap_complete + mode write commits) is visible to in-flight reader passes without a recompile. The reader filters on IsActive() before dispatching, so stale-snapshot dispatches are prevented. No callers yet; downstream commits add Compile, decideMode, and the Match functions. feat(s3/lifecycle): decideMode + retention gate decideMode picks the scheduling mode for one (rule, kind) compiled action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE; reader-driven kind whose eventLogHorizon + bootstrapLookbackMin exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The gate runs per (rule, kind), so a 90d ExpirationDays sibling can degrade to scan_only while its 7d AbortMPU sibling stays active. MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS default (Phase 0 verified that meta-log files are written without TtlSec by default), so the gate doesn't trip until an operator opts in to volume-TTL pruning of /topics/.system/log/. RuleMode is a Go-level enum here, separate from the wire-form LifecycleState.RuleMode in the proto package; the worker maps between them when reading/writing the durable state file. * feat(s3/lifecycle): Compile builds the engine snapshot per-action Compile produces a fresh Snapshot from per-bucket canonical rules. Each input rule expands into N CompiledActions via RuleActionKinds; mode comes from decideMode; activation requires both bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN. Routing indexes are populated by mode: - SCAN_AT_DATE: always indexed in dateActions (detector schedules at rule.date regardless of bootstrap status; the action runs once on the date and is then done). - EVENT_DRIVEN + active: indexed in originalDelayGroups (and in predicateActions when the rule has tag/size filters). - SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan tick or operator action handle these. snapshot_id is monotonic per process; pending writes stamp it. The new snapshot replaces the engine's atomic pointer; in-flight reader passes continue against their loaded snapshot. Tests cover: single-action rule, multi-action expansion (one rule -> three CompiledActions with three distinct delay groups), pending bootstrap exclusion from indexes, retention gate, sibling actions degrading independently under partial retention, ExpirationDate path, disabled rule, MarkActive flipping IsActive(), Compile producing monotonic snapshot ids. * feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath The reader feeds events through the engine's match functions to find the active ActionKeys whose filter applies. The minimal Event shape the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker, IsMPUInit) keeps engine free of filer_pb dependencies; the reader extracts these fields from the persisted filer_pb.LogEntry payload in Phase 3. - MatchOriginalWrite: per-delay-group sweep entry. Filters on shape = EventShapeOriginalWrite, prefix, tag, size, then per-kind shape gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on IsLatest+IsDeleteMarker). - MatchPredicateChange: single near-now sweep. Returns only the predicate-sensitive subset of active ActionKeys. - MatchPath: bucket-level walker entry. Returns every active action whose filter matches; bootstrap iterates these per object and calls EvaluateAction per kind. All filter on a.IsActive() at routing time so MarkActive flips become visible without recompile. fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile Three findings on the engine PR addressed: 1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind} only. Two buckets with rules whose XML is identical produce the same RuleHash; the second bucket's Compile would overwrite the first bucket's CompiledAction in snap.actions. Add Bucket to ActionKey so the engine's identity matches the on-disk path layout /etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression test pins it. 2. Major (immutability leak): OriginalDelayGroups, PredicateActions, DateActions returned the snapshot's internal maps/slices by reference, letting an external caller mutate routing state and break the documented immutability contract. Return defensive copies. 3. Minor (redundant condition): mode==EVENT_DRIVEN already implies kind != EXPIRATION_DATE because decideMode routes the date kind to SCAN_AT_DATE. Drop the redundant check. Tests updated to construct ActionKey with the new Bucket field. * fix(s3/lifecycle): drop size filters from rulePredicateSensitive An object's size is immutable once written: any content change is a fresh write that flows through the original-write stream, not the predicate-change one. Tagging rules really can flip post-PUT (operator adds/removes a tag without rewriting), so they belong; size filters do not. Including size filters here was adding rules to predicateActions for no purpose — every predicate-change sweep would waste cycles re-evaluating size predicates that physically can't have changed. * perf(s3/lifecycle): pre-sort AllActions at Compile time Snapshot is immutable after Compile (engineState bit-flips don't change membership), so the (bucket, rule_hash, action_kind) ordering is stable for the snapshot's lifetime. Build the sorted slice once and serve every AllActions() call from it; drop the per-call sort.Slice. The bootstrap walker is the primary caller and may iterate this on every task entry. * docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan> explicitly excludes 0-byte objects, but with the int64 zero value as the unset sentinel we can't distinguish that from omitted-and-default. Document the limitation inline so a future deployment that needs the distinction can switch to int64 (or a paired set-bool) and update the matchers / RuleHash accordingly. Not fixing now: the explicit-zero configuration is unusual, the canonical Rule shape mirrors the same zero-as-unset convention as s3api.Filter, and a structural fix touches every filter-using site (evaluator, due_at, match, RuleHash). fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex int The previous int field had a zero-value collision: 0 is both "newest non-current version" (a valid index) and "uninitialised by ObjectInfo{} literal." A caller who built &ObjectInfo{IsLatest: false} without explicitly setting NoncurrentIndex would have it implicitly read as "newest non-current," and the count-based NewerNoncurrent retention would use that bogus 0 to decide eligibility. Switch to int so nil is explicitly "not a non-current version / index not yet computed." The evaluator's NoncurrentDays and NewerNoncurrent paths conservatively return ActionNone when the index is nil — the safety scan will revisit once the index is supplied. This removes a class of latent footguns in test setup and in any future code path that constructs ObjectInfo without a versioning-aware builder. idx() helper added in tests to keep the call sites a one-liner. * refactor(s3/lifecycle): trim narration from engine + helpers Drop "what" comments where well-named identifiers already say it (IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph "why" docs to one-liners where the design rationale is already in the design doc. Keep WHY comments only at non-obvious load-bearing spots: the routing-index activation predicate, the int rationale on NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay horizon rule. Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go, min_trigger_age.go, event_log_horizon.go, engine/engine.go, engine/compile.go, engine/match.go, engine/mode.go. No behavior change; tests untouched and pass. fix(s3/lifecycle): durable PriorState.Mode wins over decideMode PriorState.Mode was declared but never read; Compile recomputed mode via decideMode and stored that on every CompiledAction. Effect: an action durably persisted as SCAN_ONLY (lag fallback or operator pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the next engine rebuild as soon as decideMode's XML+retention predicate said so. Defeats the durability of mode state. Use prior.Mode when set; fall through to decideMode only for new actions (no prior at all) and for legacy entries persisted before Mode existed (zero value). Regression test pins both branches. * fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key MarkActive's documented contract was "flip visible without a recompile," but the routing indexes (originalDelayGroups, predicateActions) were only populated when active && mode == EVENT_DRIVEN at compile time. So a key compiled with BootstrapComplete=false would never enter the indexes; a later MarkActive flipped engineState but MatchOriginalWrite / MatchPredicateChange iterated the indexes and never saw the key. Only MatchPath (which walks bi.actionKeys) and DateActions worked. Index every EVENT_DRIVEN key regardless of `active`. The runtime IsActive() filter inside filterMatching already gates dispatch, so inactive entries are matched-but-not-fired; flipping MarkActive makes them routable without recompile, matching the documented contract. Tests updated: TestCompile_BootstrapPendingIndexedButInactive asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable asserts a MarkActive flip routes the next match. * test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior Two regression tests for the int pointer migration: nil index combined with NewerNoncurrent (either paired with NoncurrentDays or standalone) must short-circuit to ActionNone rather than guess at the version's position in the keep-N window. refactor(s3/lifecycle): trim follow-up narration on engine + helpers Comments accumulated since the last sweep — the durable-Mode rationale, the MarkActive routability note, the routing-index doc, the NoncurrentIndex pointer rationale, and the EvaluateAction docblock. Trimmed each to one or two terse lines; the underlying contracts live in the design doc. * docs(s3/lifecycle): note CompileInput one-per-bucket invariant	2026-05-07 15:00:49 -07:00
Chris LuandGitHub	b9bf45cb2e	fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347 ) * fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection Closes #9345. -volumeId only filtered which volume .idx files were pulled; the filer-side BFS still walked from "/", printing every directory under -v and making it look like the flag was ignored. When all requested volumes share a single non-empty collection that maps to an existing <bucketsPath>/<collection> directory, restrict the BFS root to that bucket. Empty-collection volumes or multi-collection selections fall back to the full walk, since chunks for those can live anywhere. * trim comments * address review: collapse getCollectFilerFilePath; unshadow receiver in loop	2026-05-07 10:04:01 -07:00
Chris LuandGitHub	4e10669221	docs(s3/lifecycle): event-driven redesign (#9346 ) * docs(s3/lifecycle): event-driven redesign Replaces the synchronous PUT-handler walk with an event-driven worker model: meta-log reader subscribed to one filer, client-side heap merge with per-filer-shard MessagePosition cursors, bucket-level bootstrap with inline delete, blocked-cursor handling for fatal events, durable retry budget for sustained-transient promotion, retention mode gate that downgrades reader-driven rules to scan_only when log retention falls below the rule's event-log horizon. * docs(s3/lifecycle): record Phase 0 verified assumptions ReadPersistedLogBuffer payload carries Extended (event marshaled via SubscribeMetadataResponse → ToProtoEntry). Meta-log files at topics/.system/log/<date>/<HH-MM>.<filerId> are written without TtlSec in filer_notify_append.go; retention is unbounded by default and only shrinks if an operator sets a filer.conf rule with a TTL on the SystemLogDir prefix. .versions/ filenames are v_<16h-ts><16h-rand> with old/new (inverted) format distinguished by threshold 0x4000000000000000; getVersionTimestamp / compareVersionIds give format-agnostic ordering for successor-version discovery. * feat(s3/lifecycle): rule evaluator and dueAt helper Evaluate(rule, info, now) returns the EvalResult by object shape: IsMPUInit -> AbortMultipartUpload, IsDeleteMarker -> ExpireDeleteMarker when sole survivor, IsLatest -> DeleteObject (Days or Date), non-current -> DeleteVersion (Days fallback to ModTime when SuccessorModTime is zero; NewerNoncurrent retention enforced when both are set). ComputeDueAt mirrors Evaluate's shape and returns the earliest eligible wall-clock time for the same (rule, info), used by the reader/bootstrap to decide pending vs inline-delete. Adds StatusEnabled/Disabled and SmallDelay consts and an IsMPUInit flag on ObjectInfo so .uploads/<id>/ entries route to AbortMPU without overloading IsLatest. No callers; package compiles standalone. * feat(s3/lifecycle): MinTriggerAge for safety-scan cadence Returns the smallest non-zero day threshold across the rule's actions. Used as max(MinTriggerAge, kindFloor) for the per-kind cadence; date, count-only, and delete-marker-only rules return 0 so callers fall through to their kind-specific floor. * feat(s3/lifecycle): EventLogHorizon for retention mode gate Returns the maximum event age the reader needs for a rule. Days-based kinds return their day threshold; pure NewerNoncurrent (count) and ExpiredObjectDeleteMarker return SmallDelay. Date rules return 0 (the gate skips them). Multi-action rules take the max — strictest horizon wins. Drives the Phase 2 mode gate: metaLogRetention < EventLogHorizon(rule) + bootstrapLookbackMin -> scan_only with RETENTION_BELOW_HORIZON. * feat(s3/lifecycle): RuleHash for per-rule state CAS sha256 over canonicalized form, first 8 bytes. Stable across tag-key reorder, prefix trailing-slash variation, ID renames, and Status flips (state continuity is preserved when an operator toggles Enabled/Disabled). Different action shapes — different days, filter, or action type — hash differently. Used by the per-rule state directory layout /etc/s3/lifecycle/<bucket>/<rule_hash_hex>/ and by the bootstrap detector's reconcile-on-PUT. * feat(s3/lifecycle): add s3_lifecycle.proto storage schema Defines the durable types backing the lifecycle worker: LifecycleState (per-rule mode + bootstrap_complete + degraded_reason incl. RETENTION_BELOW_HORIZON / LOST_LOG), PendingItem, EntryIdentity, BootstrapState, ReaderState (per-filer-shard cursors plus tail_drained_streams marker), BlockerRecord (rule_hash optional for pre-evaluation failures), RetryBudgetEntry with the four-shape StreamKey oneof, and RetryTarget (no action / no expected_identity — retry replays handler against current state). No callers; schema only. Wired into the pb Makefile. * feat(s3/lifecycle): add S3LifecycleParams to TaskParams Wires lifecycle subroutines into the existing worker dispatch. Subtype is READ (cluster-singleton meta-log reader), BOOTSTRAP (per-bucket walker), or DRAIN (per-rule pending). bucket / rule_hash populated for the latter two; ContinuationHint is an advisory resume point for kill-resumable BOOTSTRAP / READ. oneof tag = 14, after the existing ec_balance_params at 13. * fix(s3/lifecycle): noncurrent delete markers honor NoncurrentDays A non-current delete marker is just another version per AWS S3 spec and is eligible under NoncurrentVersionExpirationDays. The IsDeleteMarker special case is meant only for the current delete marker (sole-survivor ExpiredObjectDeleteMarker action), so guard that switch arm with IsLatest. Without IsLatest, the bootstrap walker silently skipped pre-existing non-current delete markers under a NoncurrentDays rule because ComputeDueAt returned zero. Mirrors the same fix in evaluate.go so the runtime decision matches. * fix(s3/lifecycle): preserve prefix trailing slash in RuleHash "logs" and "logs/" match different object sets under literal strings.HasPrefix semantics: "logs" matches "logsmore/x", "logs/" does not. Collapsing them in the hash would let an XML edit silently bind the new rule to the previous rule's durable state directory, causing stale bootstrap_complete and stale pending entries against a rule that now matches a different set. * fix(s3/lifecycle): add UNSPECIFIED sentinels to enum zero values Proto3 best practice: enum 0 should be an _UNSPECIFIED sentinel, not an active value. Persisted state schemas care most: a partially populated payload (or one written by an older binary that didn't set the field) would otherwise silently default to a semantically active value. - LifecycleState.RuleKind: shifts EXPIRATION_DAYS..EXPIRED_DELETE_MARKER from 0..5 to 1..6, with RULE_KIND_UNSPECIFIED at 0. - LifecycleState.RuleMode: shifts EVENT_DRIVEN..PENDING_BOOTSTRAP from 0..4 to 1..5. - LifecycleState.DegradedReason: replaces NONE=0 with DEGRADED_REASON_UNSPECIFIED=0 (operators treat both as healthy). - StreamKind: shifts ORIGINAL..PENDING from 0..3 to 1..4. - S3LifecycleParams.Subtype: shifts READ..DRAIN from 0..2 to 1..3, so an unset subtype no longer routes into the cluster-singleton READ task. No on-disk state has been written yet; renumbering is safe. * docs(s3/lifecycle): per-action state, not per-rule A single AWS lifecycle XML <Rule> can declare multiple actions in parallel (e.g. ExpirationDays=90 + AbortMPU=7 + NoncurrentDays=30). Each must drive its own delay/horizon/mode/pending stream independently. Modeling the rule as one compiled entry with one kind collapses these — picking the smallest delay (7d MPU) means the 90d expiration cursor advances past objects that aren't yet due, and the 90d action never re-fires. Restructure storage to per-action: every XML rule expands into N compiled actions; state lives at <bucket>/<rule_hash>/<action_kind>/. The intermediate rule_hash directory keeps a rule's actions grouped for operator listing. Each action has its own state file with its own mode + bootstrap_complete + degraded_reason; sibling actions of the same rule can degrade independently. * fix(s3/lifecycle): per-action proto schema + safety-scan counters Realigns the durable schema with the per-action storage model and the safety-scan contract in the design doc. - Promote LifecycleState.RuleKind to a top-level ActionKind enum and rename rule_kind -> action_kind. The same enum now also keys BootstrapKey, PendingKey, and BlockerRecord so per-action streams under one rule never collapse. - LifecycleState now keyed by (rule_hash, action_kind). Added last_safety_scan_ts_ns / next_safety_scan_ts_ns and the four observability counters the design specifies (evaluated_total, expired_total, metadata_only_total, error_total). Dropped last_evaluated_ns, deleted_total, skipped_object_lock_total, and pending_size — subsumed or out-of-band. - BlockerRecord.action_kind is OPTIONAL (UNSPECIFIED for pre-evaluation failures, just like rule_hash). No on-disk state has been written yet; the renumbering / rename is free. * fix(s3/lifecycle): per-action MinTriggerAge / EventLogHorizon helpers The retention mode gate and safety-scan cadence run on each compiled action independently. Taking a "min/max across all actions in the rule" — as the previous helpers did — was wrong: a 90d ExpirationDays sibling alongside a 7d AbortMPU action would cause the 90d cursor to advance at 7d (because MinTriggerAge picked the smallest), and the ExpirationDays action would never re-fire on objects that aged past the 7d sweep window before reaching 90d. Both helpers now take an ActionKind and return the threshold for that specific action only. Returns 0 if the rule does not declare the requested kind, which makes the gate a no-op and the cadence fall through to the kind floor. Also adds RuleActionKinds(rule), the canonical expansion that the engine uses at compile time to turn one XML rule into N compiled actions. NewerNoncurrentVersions paired with NoncurrentDays is subsumed into a single NONCURRENT_DAYS action (AWS-paired conditions); only when NewerNoncurrent stands alone does it become NEWER_NONCURRENT. * fix(s3/lifecycle): length-prefix RuleHash to remove delimiter ambiguity The previous encoding used "tag=K=V\n" lines, which is ambiguous: a tag (a=b, c) serializes identically to (a, b=c). A prefix containing "\nexp_days=99" could likewise forge an action field. Either could silently bind two semantically different rules to the same per-rule state directory. Switch to a length-prefixed canonical form: each scalar is written as <field-tag-byte> <uvarint-length> <bytes>. Field-tag bytes namespace each scalar so ("a=b" tag-key) and ("a" tag-key with "=b" leakage) can't collide. Three regression tests pin the resistance. * docs(s3/lifecycle): ActionKey is the engine identity, not rule_hash Finishes the per-action restructure across the engine pseudocode, bootstrap completion, drain locks, detector paths, policy CAS, and Phase 2 plan. Every per-action data structure — engine indexes, target modes, newly-completed sets, bootstrap completion bits, drain keys, locks, metrics, status — is now keyed by ActionKey{rule_hash, action_kind}, not by rule_hash alone. Without this, sibling actions under one XML rule still shared scheduling state in the engine: originalDelayGroups holding []ruleHash means a rule's 7d AbortMPU and 90d ExpirationDays collapse into one entry, the smaller delay wins for cursor advance, and the larger sibling never re-fires. Helper API tightened: EvaluateAction(rule, kind, info, now) and ComputeDueAt(rule, kind, info) replace the aggregate-rule signatures so a caller asks one specific action's eligibility against one entry, never "any action of this rule." Drain task lock includes action_kind so siblings have independent re-arm timers. Policy CAS moves from rule_hash to ActionKey membership. * fix(s3/lifecycle): kind-aware EvaluateAction / ComputeDueAt Replaces the aggregate-rule signatures with per-action ones: EvaluateAction(rule, kind, info, now) and ComputeDueAt(rule, kind, info). The old Evaluate(rule, info, now) could return the verdict of ANY action declared on the rule, which sat wrong with the per-action engine indexing (each ActionKey has its own delay group, mode, and pending stream). Each helper now decides exactly one (rule, kind) compiled action's fate against one entry. Asking for a kind the rule doesn't declare, or asking against the wrong object shape for that kind, returns ActionNone / zero — never silently routes to a sibling. Multi-action regression test: a rule with ExpirationDays=90 and AbortMPU=7 evaluates each kind independently for the same entry; the 7d window has no influence on the 90d eligibility decision.	2026-05-07 10:02:32 -07:00
Minsoo KimandGitHub	a1e5eb9dad	Fix UI prefix url encoding (#9344 ) * Fix filer UI navigation for URL-sensitive object prefixes * Fix filer UI navigation for URL-sensitive object prefixes * Clarify filer UI path escaping test name Rename the legacy filer UI path test to describe the actual behavior being checked. The printpath helper preserves timestamp characters that are valid in URL path components, while the PR fix is focused on query-string escaping for path and cursor parameters.	2026-05-06 19:14:36 -07:00
Chris LuandGitHub	487b93eb49	fix(volume): don't panic on read when needle map is nil (#9342 ) * fix(volume): don't panic on read when needle map is nil A failed CommitCompact reload (and #9335's new error path for a remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil on a volume that's still in the store. readNeedle / readNeedleDataInto dereferenced v.nm with no guard, so the next GET segfaulted the http handler instead of returning an error the client could retry on another replica. Add the same v.nm == nil check the other Volume accessors already use, including the slow-read inner loop where the lock is released between iterations and a failed reload can race in. Fixes #9339. * match rust nm-nil read behavior; trim comments seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound on the Go side too instead of a generic 500-mapped error so both volume servers respond identically when v.nm is nil. * log once when reads hit nil needle map ErrorNotFound alone hides the real cause: a half-loaded volume just returns 404s and the operator has nothing to grep for. Add a once-per- volume Errorf on the nil path, reset on successful load. Mirror the same in seaweed-volume via nm_or_not_found(). * trim comments * drop once-flag, log inline on every nil-nm read	2026-05-06 18:23:06 -07:00
Chris LuandGitHub	e96190d128	fix(mount): skip pressure-eviction of gappy page chunks (#9330 ) (#9334 ) * fix(mount): skip pressure-eviction of gappy page chunks (#9330) A page chunk whose written-interval list has an internal hole was being sealed under buffer-pressure eviction, then SaveContent would emit one volume chunk per maximal adjacent run with no chunk covering the hole; reads then silently zero-fill the gap (filer/stream.go:177-186). On a sequential cp through FUSE, that bakes in-flight 4 KiB writes into split volume chunks and leaves chunk-sized blocks of zeros on the destination. Filter the pressure-driven sealers (SaveDataAt's over-limit path, EvictOneWritableChunk, ProactiveFlush) to only seal chunks whose written intervals form one unbroken run. The flush-on-close path (FlushAll) is unchanged: at close every gap is by definition a sparse-file write that the app legitimately never made. * fix(mount): also gate IsContiguouslyWritten on leading zero-offset Tighten IsContiguouslyWritten to also reject empty lists and lists whose first interval does not start at offset 0. The internal-gap and leading-gap cases are symmetric for pressure-driven sealing: both put in-flight FUSE writeback for the missing range at risk of being baked into split volume chunks. The flush-on-close path is still unfiltered (sparse writes are sealed legitimately at FlushAll). Also align EvictOneWritableChunk's bestBytes initialization with SaveDataAt (start at 0) so an empty chunk is never picked, matching the new semantic. Addresses gemini-code-assist review on PR #9334. * fix(mount): preserve cap-pressure liveness in EvictOneWritableChunk The previous version of this fix had EvictOneWritableChunk return false whenever every dirty chunk was gappy. That broke the accountant's Reserve loop: cond.Wait only wakes on Release, Release only fires on upload completion, and refusing to seal anything means no upload starts — the writer hangs at the -writeBufferSizeMB cap forever. Two-pass selection: prefer the fullest gap-free chunk (issue #9330: this is what protects sequential cp from racing FUSE writeback), fall back to the oldest non-empty writer when nothing is gap-free. Oldest-first maximizes the chance that FUSE writeback for the gap range has already settled. The actual sealing path is unchanged — SaveContent still emits one volume chunk per maximal adjacent run; pages that arrive after the seal land in a fresh MemChunk for the same logicChunkIndex and are sealed in turn, so coverage is reconstructed at read time by readResolvedChunks. Sequential cp at default settings always hits the strict pass (writes arrive contiguous-from-0 within their logicChunkIndex), so the bug-fix behavior is preserved; the fallback only runs under genuinely sparse workloads or under FUSE writeback so backed up that no chunk has settled, where forced progress is preferable to a hung mount. * test(mount): pin ProactiveFlush gap-skip behavior (#9330) Sibling regression test for the ProactiveFlush guard added in this series: same 3-chunk setup as TestEvictOneWritableChunk_SkipsGappyChunks (internal gap, leading gap, contiguous). Verifies ProactiveFlush picks the contiguous chunk when staleness criteria are otherwise satisfied, returns false when only gappy chunks remain (no liveness fallback like EvictOneWritableChunk has — failing here is just a missed optimization), and that filling the holes lets the chunks auto-seal via maybeMoveToSealed. * style(mount): trim verbose comments on #9330 fix	2026-05-06 15:26:56 -07:00
Chris LuandGitHub	7b0b64db65	fix(admin/view): wrap plugin history URL with basePath (#9341 ) Plugin tabs/sub-tabs use history.pushState/replaceState to keep the URL bar in sync with the active view, but updateURL fed it the raw output of buildPluginURL ("/plugin/lanes/<lane>/..."). Under a urlPrefix deployment that strips the prefix, so reloading the page hit /plugin/... directly and 404'd at the proxy. Wrap with basePath() so the rewritten URL keeps the deployment prefix. Reported at #9240.	2026-05-06 15:25:06 -07:00
Chris Lu GitHubgemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	1c0e24f06a	fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335 ) * fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-06 15:19:43 -07:00
Chris LuandGitHub	12f283357f	fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333 ) * fix(iam): scope IAM-managed OIDC provider lookup by role account Two account-scoped OIDC records sharing an issuer were collapsed into a single map slot keyed only by the URL. The last-write-wins entry then served every AssumeRoleWithWebIdentity, so a token destined for account B's role could be validated by account A's record (its clientIDs and thumbprints), defeating the per-account isolation the records exist for. The role-account check in enforceProviderAccountScope still rejected the cross-account assumption, but only after the wrong record's audience and TLS pin had already accepted the token. Refresh now keys IAM-managed records as (issuer, account), and validation parses the requested role's account up front and matches the record under that issuer in this order: exact account, global (account-less), static-config fallback. An unknown account hint deliberately skips account-scoped entries — picking one arbitrarily is the bug this commit fixes — and falls through to global or static. * fix(iam): route public AssumeRoleWithWebIdentity through IAMManager handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity directly, bypassing the IAMManager wrapper. The wrapper is where enforceProviderAccountScope rejects cross-account assumption attempts and capDurationByRole clamps to the role's MaxSessionDuration; both silently became no-ops for any AWS-SDK caller hitting the public endpoint. Dispatch through the IAMManager (via the existing IAMManagerProvider interface that other handlers in this file already use) when one is wired. Embedded test setups without an IAM integration fall back to the bare STS service unchanged. * fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config initOIDCProviderStore mirrored only URL and ClientIDs. Once RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on boot once the metadata-subscribe loop kicked in), buildOIDCProviderFromRecord rebuilt the runtime provider from this truncated record. Because IAM-managed entries take precedence over the static-config map, the rebuild silently shadowed the bootstrap with a weaker provider: - Thumbprints: dropped, so TLS-pinned issuers fell back to the system trust store. - AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped reaching the session. - PolicyClaim: dropped, so claim-based policy mode stopped triggering. Pull all three from the provider's static Config map at mirror time so the stored record round-trips to a runtime provider equivalent to the one the static config produced directly. * fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler Phase 3b advertises that RoleArn MAY be omitted in claim-based policy mode — the STS service then derives the assumed-role ARN from the configured policy claim. The HTTP handler still rejected empty RoleArn up front with MissingParameter, so SDK callers using the documented omitted-role flow never reached the STS layer. Drop the pre-check; STS still validates that claim-based mode is configured and that the IDP emits policies, returning a precise error when either is missing. The existing error mapping below this point surfaces those as InvalidParameterValue, matching what an AWS SDK expects. * test(iam): update missing-RoleArn STS integration test for the new contract The previous commit drops the HTTP-layer RoleArn pre-check so claim-based mode can derive the ARN from a JWT claim. The integration test still asserted MissingParameter for the missing-RoleArn case, which now reaches the STS layer and surfaces a JWT-parse error instead. Update the assertion to match: missing RoleArn alone must no longer surface as MissingParameter, but a bogus JWT must still be rejected.	2026-05-05 19:14:44 -07:00
Chris LuandGitHub	9af1b212d3	feat(iam): OIDC provider audit trail (Phase 3e) (#9325 ) * feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323. * feat(iam): opt-in session revocation via JTI blocklist Add SessionRevocationStore (memory + filer implementations) and wire it into the IAMManager.IsActionAllowed path so a revoked session is rejected on the next signed request. Session JWTs now embed the session id as the JTI claim, giving the blocklist a stable key without requiring a second secret. Operators who don't configure a store keep the existing fully-stateless behaviour: every session stays valid until natural expiry. Operators who do configure one accept one filer lookup per signed request in exchange for being able to invalidate compromised tokens before expiry. Revocation entries carry the original session expiry so the blocklist self-trims via PurgeRevokedSessions. * fix(iam): hash JTI filenames + paginate Purge with proper EOF handling Three reviewer-flagged issues on the filer-backed revocation store: 1. Path traversal (security-medium): RevokeSession is exported and takes an arbitrary string. Using the JTI verbatim as a filename meant a caller could pass "../../etc/passwd" to write outside the basePath. SHA-1 hash the JTI to a fixed-width hex name; lookups still find the entry because Revoke and IsRevoked share the same hash function. 2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed the outer err and the loop just broke on any failure, so a mid-stream gRPC error returned (count, nil) and the caller had no idea the purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream and propagate everything else. 3. Purge had a hardcoded 10000-entry cap. Stream-paginate via StartFromFileName so the operator-cron can clean a backlog larger than that without losing rows. * feat(iam): OIDC provider audit trail Emit one structured event per IAM-managed OIDC provider lifecycle mutation (Create, Delete, Add/Remove ClientID, UpdateThumbprints, Tag, Untag). Three sinks ship in-tree: - GlogAuditSink — default; events become structured log lines. - MemoryAuditSink — in-process buffer for tests / inspection. - FilerAuditSink — durable, one file per event under /etc/iam/audit/oidc-providers (operator-overridable basePath). Audit emission is best-effort: a failing sink never blocks an IAM mutation that has already succeeded. Use-events (per token validation) are intentionally not emitted yet — too hot for unconditional sinks. * fix(iam): collision-free audit filenames + correct file mtime Two issues gemini flagged on FilerAuditSink.Emit: 1. Filename was %d-%s.json (UnixNano + Type). Two mutations at the same nano against different ARNs (think batch script touching several providers) collided on the same path and the second CreateEntry failed silently. Append a short ARN hash to make the name unique per-event without leaking the ARN. 2. File Mtime/Crtime were set to time.Now() at write time. For audit integrity those should reflect the event's occurrence time so filer-level "ls -lt" output matches the contents. Addresses three medium-priority gemini reviews on PR #9325.	2026-05-05 13:37:24 -07:00
Chris LuandGitHub	9d6a699b94	feat(iam): opt-in session revocation via JTI blocklist (Phase 3d) (#9324 ) * feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323. * feat(iam): opt-in session revocation via JTI blocklist Add SessionRevocationStore (memory + filer implementations) and wire it into the IAMManager.IsActionAllowed path so a revoked session is rejected on the next signed request. Session JWTs now embed the session id as the JTI claim, giving the blocklist a stable key without requiring a second secret. Operators who don't configure a store keep the existing fully-stateless behaviour: every session stays valid until natural expiry. Operators who do configure one accept one filer lookup per signed request in exchange for being able to invalidate compromised tokens before expiry. Revocation entries carry the original session expiry so the blocklist self-trims via PurgeRevokedSessions. * fix(iam): hash JTI filenames + paginate Purge with proper EOF handling Three reviewer-flagged issues on the filer-backed revocation store: 1. Path traversal (security-medium): RevokeSession is exported and takes an arbitrary string. Using the JTI verbatim as a filename meant a caller could pass "../../etc/passwd" to write outside the basePath. SHA-1 hash the JTI to a fixed-width hex name; lookups still find the entry because Revoke and IsRevoked share the same hash function. 2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed the outer err and the loop just broke on any failure, so a mid-stream gRPC error returned (count, nil) and the caller had no idea the purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream and propagate everything else. 3. Purge had a hardcoded 10000-entry cap. Stream-paginate via StartFromFileName so the operator-cron can clean a backlog larger than that without losing rows.	2026-05-05 13:25:22 -07:00
Chris LuandGitHub	6483583491	feat(iam): account-scoped OIDC providers (Phase 3c) (#9323 ) * feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323.	2026-05-05 13:06:53 -07:00
Chris LuandGitHub	bc1d458fe6	fix(iam): reject empty issuer in ComputeParentUser (#9326 ) Without iss, the same `sub` from two different IDPs would collapse to the same parent_user hash. Short-circuit to empty when either input is missing so callers see "no identity" instead of a colliding hash. Addresses coderabbit review on PR #9318.	2026-05-05 13:01:33 -07:00
Chris LuandGitHub	1d3454ca5c	feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity (Phase 3b) (#9322 ) * feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322.	2026-05-05 12:21:55 -07:00
Chris LuandGitHub	6554ab7928	feat(iam): principal session tags from OIDC tokens (Phase 3a) (#9321 ) * feat(iam): principal session tags from OIDC tokens Extract the AWS principal-tags namespace claim (`https://aws.amazon.com/tags/principal_tags`) from validated OIDC tokens, filter through a per-provider AllowedPrincipalTagKeys allowlist, and surface as `aws:PrincipalTag/<key>` in the STS session request context. Empty allowlist means "no tags surfaced" — operators must opt keys in explicitly so a misconfigured IDP can't pollute policy evaluation. Policy engine now accepts `aws:PrincipalTag/...` and `aws:RequestTag/...` as substitutable variables so resource-level ABAC policies can reference them. * fix(iam): case-insensitive principal-tag allowlist + sharper comment filterPrincipalTags compared keys case-sensitively, but AWS IAM session tag keys are case-insensitive (the docs are explicit). An IDP whose claim casing drifts from the operator-configured allowlist string would silently filter the value out — surprising failure mode. Lowercase both sides during the lookup; the original key casing is preserved on the output so policy variables still match what the caller sees. Also reword the "anything the IDP signs is acceptable" comment in sts_service.go: it predates the per-provider allowlist that's already filtering before this point. The reality is now "everything reaching here is on the operator's opt-in list, dropped entirely if the allowlist is empty." Addresses two gemini medium reviews on PR #9321.	2026-05-05 11:43:05 -07:00
Chris LuandGitHub	f8973b3ed6	feat(iam): OIDC provider mutations + multi-client + TLS thumbprints (Phase 2b) (#9320 ) * feat(iam): OIDC provider mutations + multi-client + TLS thumbprints - Mutating IAM actions: CreateOpenIDConnectProvider, DeleteOpenIDConnectProvider, AddClientIDToOpenIDConnectProvider, RemoveClientIDFromOpenIDConnectProvider, UpdateOpenIDConnectProviderThumbprint, TagOpenIDConnectProvider, UntagOpenIDConnectProvider. Each enforces AWS-shape input bounds and the read-only mode rejects all mutations. - Multiple client_ids per provider in OIDCConfig (clientIds list, plural) with full backward compatibility — singular clientId still works and is merged into the audience allowlist. Provider factory accepts both. - AWS-compatible TLS thumbprint pinning: when OIDCConfig.Thumbprints is non-empty, JWKS fetches enforce that the negotiated TLS chain contains a certificate whose SHA-1 hex matches the allowlist. Empty list keeps the existing system-trust path. * fix(iam): factor Tags.member.N parser into a helper CreateOpenIDConnectProvider and TagOpenIDConnectProvider were both walking the AWS Tags.member.N.Key / Tags.member.N.Value query-string convention with copy-pasted loops. Factor into extractTags so the parsing rules and the "no tags present" semantics live in one place. Addresses gemini medium review on PR #9320. * fix(iam): sentinel errors for OIDC provider not-found / already-exists The s3api dispatcher was using strings.Contains(err.Error(), "not found") and "already exists" to map IAM-manager errors back to AWS error codes. Substring matching on a formatted message couples the API error code to the exact wording of the upstream message — touching the message silently changes the IAM API contract. Define ErrOIDCProviderNotFound and ErrOIDCProviderAlreadyExists in the integration package, fmt.Errorf("%w: ...") them at the four return sites in iam_manager.go and oidc_provider_store.go, and use errors.Is at the s3api call sites. Same control flow, no string-match fragility. Addresses gemini medium review on PR #9320. * fix(iam): surface non-NotFound errors from CreateOIDCProvider lookup Previously CreateOIDCProvider only treated GetProviderByARN's success path as "exists" and silently fell through on any error, including transient backend failures. That hid real problems and still attempted a write. Distinguish ErrOIDCProviderNotFound (the only "safe to create" case) from other errors so we don't mask filer outages or partition issues. * fix(iam): enforce 100-client-ID cap on AddClientIDToOIDCProvider CreateOIDCProvider and the implicit update path through validateOIDC- ProviderRecord both reject lists with more than 100 client IDs, but AddClientIDToOIDCProvider could grow the list past that bound one ID at a time. Refuse the add when the list is already at the cap so the invariant holds across every mutation entry point. * feat(iam): IAM-managed OIDC provider live view in STS service Add a separate, mutex-guarded map of admin-managed OIDC providers on the STS service. The map can be atomically replaced via SetIAMManagedOIDCProvidersByIssuer; AssumeRoleWithWebIdentity lookups consult it first and fall back to the existing static-config map, so records persisted through the IAM API can shadow bootstrap entries without a restart. This is the runtime hook the IAM API and the metadata-subscribe path will both call when the OIDCProviderStore changes (next two commits). * feat(iam): refresh STS service runtime view after OIDC mutations Add IAMManager.RefreshOIDCProvidersFromStore: lists every persisted OIDCProviderRecord, builds a runtime OIDCProvider for each, and atomically publishes the issuer-keyed map into the STS service. Each mutating IAM API call (Create / Delete / AddClientID / RemoveClientID / UpdateThumbprints) now triggers this refresh inline so the local instance picks up the change without waiting for a metadata-subscribe round trip. Tag mutations skip the refresh because tags do not affect token validation. Refresh failures only log; the persisted write has already succeeded by that point, so a transient list error must not surface to the API caller. The peer-instance update path (filer metadata subscription) is added in a follow-up commit. * feat(iam): subscribe to OIDC provider changes on the filer Watch /etc/iam/oidc-providers under the existing s3 metadata-subscribe loop and call RefreshOIDCProvidersFromStore on any create / update / delete / rename. This is the cross-instance update path: S3 server A writes via the IAM API, the filer fans out the metadata change, and S3 servers B..N pick up the new runtime view without a restart. Mirrors the existing onIamConfigChange / onCircuitBreakerConfigChange pattern. The handler short-circuits when the path is unrelated, and when no IAMManager is wired in (static-only configurations).	2026-05-05 11:26:08 -07:00
Chris LuandGitHub	6141222ab0	fix(test/s3/policy): allocate fresh admin port per subtest (#9332 ) * fix(test/s3/policy): allocate fresh admin port per subtest startMiniCluster ran weed mini in-process and explicitly assigned master/volume/filer/s3 ports allocated by MustAllocatePorts, but it left -admin.port and -admin.port.grpc unset, so each subtest reused the hardcoded defaults 23646 / 33646. The package's subtests run sequentially within the same go test process. The previous subtest's admin goroutine is still bound to 23646 by the time the next subtest spins up its own mini, so the new admin can never bind, mini.go's waitForAdminServerReady hits its 240-attempt cap, and glog.Fatalf kills the test binary. This has been the dominant cause of "admin server did not become ready" flakes across recent IAM PRs. Allocate two extra ports for admin and pass them through. The other subprocess-based tests (s3tables/) are not affected because each launches weed mini in a fresh OS process. fix(mini): make admin readiness wait context-aware waitForAdminServerReady polled for 240 attempts × 500ms regardless of whether the surrounding mini context was cancelled. When mini is run in-process from a test harness (test/s3/policy/...) and the test calls its cancel func, the leftover wait keeps spinning for the full two minutes and then glog.Fatalf's, terminating the entire test binary — including any sibling subtest that has since started its own mini. Thread the existing miniClientsCtx through the wait so a Stop / cancel returns context.Canceled immediately. The caller (startMiniAdminWithWorker) treats a context-cancelled outcome as a graceful shutdown signal and logs+returns instead of fataling.	2026-05-05 11:24:43 -07:00
Chris LuandGitHub	95560076e6	fix(mini): raise admin readiness timeout to 2 minutes (#9329 ) The 30-second ceiling on waitForAdminServerReady was too tight on busy CI runners. master + filer + volume + admin all start in parallel on a shared worker, and S3 Policy Shell Integration Tests has been flaking across multiple PRs with "admin server did not become ready... after 60 attempts" even though the server still comes up within a minute or two. Two minutes (240 attempts at 500ms) leaves headroom for runner contention without being absurd in a local-dev run.	2026-05-05 07:59:25 -07:00
Chris Lu	22ebe9feb0	ci(e2e): switch FUSE Mount build to Azure Ubuntu mirror, persist buildx cache archive.ubuntu.com from GitHub-hosted runners has been Ign:/retrying for ~60s per package, eating the Start SeaweedFS step's 10-min budget before apt-get install finishes. The host already uses azure.archive.ubuntu.com; do the same inside Dockerfile.e2e and drop the Retries=5 amplifier. Also rotate /tmp/.buildx-cache-new over /tmp/.buildx-cache so the apt layer actually survives across runs, and bump the step to 15 minutes as a safety margin.	2026-05-05 00:22:59 -07:00
Chris LuandGitHub	4ded97a321	feat(iam): OIDC provider store + read-only IAM API (Phase 2a) (#9319 ) * feat(iam): STS web-identity AWS-fidelity polish - OIDC discovery via .well-known/openid-configuration; falls back to /.well-known/jwks.json when discovery is absent. Reject discovery docs whose issuer claim does not match the configured issuer to defend against issuer-substitution. - ComputeParentUser derives a stable per-identity hash from (sub, iss). Surface as aws:userid in the request context and as a parent_user claim in the session JWT so per-user state survives token rotation. - Per-role MaxSessionDuration (3600..43200) clamps requested DurationSeconds before the STS service applies its own caps. - Tighten RoleSessionName to the AWS contract: 2..64 chars from [\w+=,.@-]. - Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity / AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte inline session policy budget. * fix(iam): leave omitted DurationSeconds nil so STS default applies capDurationByRole was substituting the role's MaxSessionDuration when the caller omitted DurationSeconds entirely. AWS returns the configured default (typically 1 hour) in that case, not the role's upper bound — a 12h MaxSessionDuration shouldn't silently make every no-duration assume-role mint a 12h session. Return nil when requested is nil; let the downstream calculateSessionDuration in the STS service apply its TokenDuration default. The role-max upper bound still clamps when the request arrives with a concrete value above the cap. Addresses gemini high-priority review on PR #9318. * fix(iam): synchronize OIDCProvider JWKS cache fields jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are mutated lazily on the first token-validate call and refreshed afterwards on TTL expiry. Multiple S3 requests can land here in parallel, so the writes were racing against subsequent reads on every other goroutine. resolvedJWKSUri/discoveryFailed inherited the same un-protected pattern when discovery shipped. Add sync.RWMutex; getPublicKey takes the read lock for the common cache-hit path and promotes to the write lock for misses + refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the write lock is held by the caller; fetchJWKS keeps the test-friendly entry point that acquires the lock itself. Addresses gemini high-priority review on PR #9318. * fix(iam): trim trailing slash + retry discovery after transient failure Two OIDC discovery edge cases reviewers flagged: 1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri trims them when building the discovery URL, but the doc.Issuer ↔ p.config.Issuer check did not, so an IDP whose issuer claim drops or adds the slash relative to the configured value would be falsely rejected. Trim a single trailing slash on each side before comparing. 2. discoveryFailed flipped to true on any error and stayed there for the process lifetime. A transient 5xx at startup permanently locked the provider into the /.well-known/jwks.json fallback. Reset the flag at the top of fetchJWKSLocked when no URI has been cached yet, so each JWKS refresh (typically once per TTL = 1h) reattempts discovery. Successful discovery remains cached via resolvedJWKSUri so we don't pay the discovery RTT on every refresh. Addresses gemini security-medium + medium reviews on PR #9318. * fix(iam): require non-empty issuer in OIDC discovery doc The previous "doc.Issuer != "" && ..." guard let a discovery document that omitted the issuer field bypass the issuer-mismatch check entirely, letting the doc steer fetchJWKS at any URL it provided. OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a hard failure same as mismatched. Trailing-slash equivalence still applies. Adds TestDiscoveryRejectsMissingIssuer alongside the existing TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer toggle on fakeIDP. * feat(iam): OIDC provider store + read-only IAM API Add OIDCProviderRecord — the persisted, IAM-managed view of an OIDC identity provider — and an OIDCProviderStore interface with memory and filer implementations mirroring the existing role-store pattern. The store is hydrated at boot from the static STS.Providers list so the new IAM API surfaces the same set the STS service already validates against. Two read-only actions land now: - ListOpenIDConnectProviders -> ARN-only list, AWS-shape XML. - GetOpenIDConnectProvider -> URL, ClientIDList, ThumbprintList, Tags, CreateDate. Mutations (Create/Delete/Add-Remove ClientID/Update Thumbprint), multiple client_ids per provider, and TLS thumbprint pinning come in Phase 2b. * fix(iam): preserve CreatedAt across boots + paginate ListProviders Two medium-priority issues gemini flagged on the read-only IAM API: 1. The static-config bootstrap was setting CreatedAt = time.Now() on every server start, so the IAM GetOpenIDConnectProvider response's CreateDate shifted on each restart even when backed by a persistent store. Look up the existing record via GetProviderByARN first and preserve its CreatedAt; only the UpdatedAt advances. 2. FilerOIDCProviderStore.ListProviders had a hardcoded Limit: 1000 that silently truncated above that. Stream-paginate via StartFromFileName, returning io.EOF naturally and surfacing all other errors instead of swallowing them. Addresses two gemini medium reviews on PR #9319.	2026-05-04 22:15:03 -07:00
Chris LuandGitHub	d951a8df5a	feat(iam): STS web-identity AWS-fidelity polish (Phase 1) (#9318 ) * feat(iam): STS web-identity AWS-fidelity polish - OIDC discovery via .well-known/openid-configuration; falls back to /.well-known/jwks.json when discovery is absent. Reject discovery docs whose issuer claim does not match the configured issuer to defend against issuer-substitution. - ComputeParentUser derives a stable per-identity hash from (sub, iss). Surface as aws:userid in the request context and as a parent_user claim in the session JWT so per-user state survives token rotation. - Per-role MaxSessionDuration (3600..43200) clamps requested DurationSeconds before the STS service applies its own caps. - Tighten RoleSessionName to the AWS contract: 2..64 chars from [\w+=,.@-]. - Populate PackedPolicySize in AssumeRole / AssumeRoleWithWebIdentity / AssumeRoleWithLDAPIdentity responses as a percentage of the 2048-byte inline session policy budget. * fix(iam): leave omitted DurationSeconds nil so STS default applies capDurationByRole was substituting the role's MaxSessionDuration when the caller omitted DurationSeconds entirely. AWS returns the configured default (typically 1 hour) in that case, not the role's upper bound — a 12h MaxSessionDuration shouldn't silently make every no-duration assume-role mint a 12h session. Return nil when requested is nil; let the downstream calculateSessionDuration in the STS service apply its TokenDuration default. The role-max upper bound still clamps when the request arrives with a concrete value above the cap. Addresses gemini high-priority review on PR #9318. * fix(iam): synchronize OIDCProvider JWKS cache fields jwksCache, jwksFetchedAt, resolvedJWKSUri, and discoveryFailed are mutated lazily on the first token-validate call and refreshed afterwards on TTL expiry. Multiple S3 requests can land here in parallel, so the writes were racing against subsequent reads on every other goroutine. resolvedJWKSUri/discoveryFailed inherited the same un-protected pattern when discovery shipped. Add sync.RWMutex; getPublicKey takes the read lock for the common cache-hit path and promotes to the write lock for misses + refreshes. fetchJWKSLocked / resolveJWKSUriLocked assume the write lock is held by the caller; fetchJWKS keeps the test-friendly entry point that acquires the lock itself. Addresses gemini high-priority review on PR #9318. * fix(iam): trim trailing slash + retry discovery after transient failure Two OIDC discovery edge cases reviewers flagged: 1. Issuer comparison was sensitive to trailing slashes. resolveJWKSUri trims them when building the discovery URL, but the doc.Issuer ↔ p.config.Issuer check did not, so an IDP whose issuer claim drops or adds the slash relative to the configured value would be falsely rejected. Trim a single trailing slash on each side before comparing. 2. discoveryFailed flipped to true on any error and stayed there for the process lifetime. A transient 5xx at startup permanently locked the provider into the /.well-known/jwks.json fallback. Reset the flag at the top of fetchJWKSLocked when no URI has been cached yet, so each JWKS refresh (typically once per TTL = 1h) reattempts discovery. Successful discovery remains cached via resolvedJWKSUri so we don't pay the discovery RTT on every refresh. Addresses gemini security-medium + medium reviews on PR #9318. * fix(iam): require non-empty issuer in OIDC discovery doc The previous "doc.Issuer != "" && ..." guard let a discovery document that omitted the issuer field bypass the issuer-mismatch check entirely, letting the doc steer fetchJWKS at any URL it provided. OIDC Discovery 1.0 §3 mandates the issuer field; treat missing as a hard failure same as mismatched. Trailing-slash equivalence still applies. Adds TestDiscoveryRejectsMissingIssuer alongside the existing TestDiscoveryRejectsIssuerMismatch via a new omitDiscoveryIssuer toggle on fakeIDP.	2026-05-04 22:10:49 -07:00
Chris LuandGitHub	2417ba0354	fix(volume): add authentication to destructive gRPC admin endpoints (#8876 ) * fix(volume): add authentication to destructive gRPC admin endpoints Three destructive VolumeServer gRPC endpoints (DeleteCollection, VolumeDelete, VolumeServerLeave) had no authentication checks, unlike their HTTP counterparts which are protected by the Guard whitelist. Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth helper on VolumeServer that extracts the peer IP from gRPC context and validates it against the guard whitelist. Gate all three endpoints behind this check. * fix(volume): tolerate unparseable gRPC peer address in admin auth check S3 Filer Group integration tests were failing with PermissionDenied "bad peer address: address @: missing port in address" when DeleteCollection ran across the in-process gRPC connection between filer and volume server — the peer addr surfaces as "@" there and net.SplitHostPort can't parse it. The check rejected before IsWhiteListed could exercise its allow-all path for empty-whitelist deployments. Hand the raw peer string to IsWhiteListed when SplitHostPort fails. With no whitelist configured (the test environment's mode) it accepts; with a whitelist configured the unparseable host won't match anything and the call still gets denied as it should. Adds three regression tests for IsWhiteListed pinning the empty-config allow-all, populated-list reject-unknown, and signing-key-only allow- all branches that the gRPC admin helper relies on. * refactor(security): dedup checkWhiteList through IsWhiteListed The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the same lookup logic in two places; future drift was just a matter of time. Have checkWhiteList delegate so the membership semantics live in exactly one function. Behaviour is unchanged: the new path still returns nil for isEmptyWhiteList (signing-key-only mode) and still rejects unknown hosts when a whitelist is configured. Addresses gemini medium review on PR #8876. * fix(volume): protect remaining state-altering gRPC admin endpoints DeleteCollection, VolumeDelete, and VolumeServerLeave were the truly-destructive endpoints, but AllocateVolume, VolumeMount, VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and VolumeMarkWritable also modify server state and should sit behind the same whitelist gate. Read-only endpoints (VolumeStatus, VolumeServerStatus, VolumeNeedleStatus, Ping) stay open. The check is a no-op when no whitelist is configured (the default), so existing deployments keep working; operators who lock down their volume servers via guard.white_list now get consistent coverage. Addresses gemini security-high review on PR #8876. * fix(volume): typed peer addr + audit log for gRPC admin auth Prefer a typed net.TCPAddr when extracting the peer IP — string parsing was already a fallback for the in-process case but using the typed form first is cleaner and skips an unnecessary parse on the common path. Log failed authorization attempts at V(0) so an operator running with a whitelist sees the host that was rejected (and the raw remote address in case the IP lookup itself was the failure mode), matching what the HTTP Guard already does. Addresses gemini medium review on PR #8876. fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints Five more master/admin-driven destructive operations live outside volume_grpc_admin.go and were missing the same whitelist gate: - VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup - ScrubVolume - VolumeEcShardsDelete VacuumVolumeCheck stays open (read-only). BatchDelete also stays open: it's the data-plane multi-object delete called from the S3 API and filer, not an admin operation; gating it would break ordinary S3 DeleteObjects calls. Addresses gemini security-high review on PR #8876. * fix(volume): simplify no-peer-info branch in gRPC admin auth The IsWhiteListed("") fallback was defending against a scenario that doesn't actually arise — real gRPC connections always populate peer info. Drop the branch and just deny when peer info is missing, which is the safer default and matches "if we don't know who the caller is, refuse". * fix(volume-rust): mirror gRPC admin auth on the rust volume server The rust volume server has the same set of destructive admin endpoints as the Go side and the same Guard infrastructure, but nothing was wired together — every endpoint accepted unauthenticated calls regardless of guard configuration. Same vulnerability class the Go fix on this PR closes; this commit closes it on the rust side too so the two stacks stay aligned. Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer SocketAddr off the tonic Request and runs Guard::check_whitelist on its IP, then applies the helper to the same set the Go side covers: DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount, VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable, VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume, VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete stays open as a data-plane multi-object delete.	2026-05-04 21:14:55 -07:00
Chris LuandGitHub	a769c938ec	test(s3tables): Unity Catalog OSS integration tests against SeaweedFS (#9308 ) * test(s3tables): add Unity Catalog OSS integration test against SeaweedFS Mirrors the configuration used by the upstream playground at data-engineering-helpers/mds-in-a-box/unitycatalog-playground. Three test variants under test/s3tables/unity_catalog: - TestUnityCatalogDeltaIntegration: aws.masterRoleArn empty / static keys; catalog/schema/EXTERNAL Delta CRUD + temporary-table-credentials S3 round-trip (the playground's working configuration). - TestUnityCatalogMasterRoleIntegration: aws.masterRoleArn set to a SeaweedFS-side role with a permissive trust policy; UC's StsClient is pinned at SeaweedFS via AWS_ENDPOINT_URL_STS, and the test asserts the vended creds carry a session_token and a non-static access key, proving the role-vended path the playground notes as not-yet-working actually does work today. - TestUnityCatalogDeltaRsRoundTrip: writes/reads a real Delta table at the registered storage_location using delta-rs in a slim Python container, with temporary credentials fetched from UC. All three self-skip without Docker or a weed binary, matching the sibling lakekeeper / polaris tests. * test(s3tables): tighten Unity Catalog tests against actual UC OSS behavior After running the suite locally, ground the assertions in what the upstream UC OSS Docker image actually does against SeaweedFS today. - Static-key playground configuration (TestUnityCatalogDeltaIntegration): catalog/schema/EXTERNAL Delta CRUD pass against the SeaweedFS-backed warehouse. The temporary-table- credentials subtest is renamed and inverted to assert the failure mode the playground reports -- UC's AwsCredentialVendor falls through to an internal StsClient.assumeRole when masterRoleArn and sessionToken are both empty, which has no real STS to talk to. Bucket path is also fixed to match UC's getStorageBase() lookup (s3://lakehouse vs the playground's s3://lakehouse/warehouse, which the upstream code never matches). - Master-role variant (TestUnityCatalogMasterRoleIntegration): split into two passing slices. Slice 1 proves SeaweedFS' STS endpoint vending UnityCatalogVendedRole works via the Go AWS SDK and the vended creds round-trip on S3. Slice 2 boots UC with aws.masterRoleArn set and verifies catalog/schema/Delta CRUD. The third hop -- UC's Java StsClient actually reaching SeaweedFS' STS handler during /temporary-table-credentials -- is logged but not asserted, since the AWS Java SDK's STS request currently lands on a SeaweedFS S3 path rather than the STS handler. - Delta-RS round-trip (TestUnityCatalogDeltaRsRoundTrip): gated on UC_DELTA_RS_RUN=1 since it depends on the master-role STS handoff above. The Dockerfile / writer script stay in tree so the test runs end-to-end the moment that hop is fixed. README rewritten to be explicit about what each test validates today and what is still pending. Result: `go test -run TestUnityCatalog ./test/s3tables/unity_catalog/...` passes cleanly with weed + Docker available, and self-skips otherwise. * test(s3tables): exercise unity catalog integrations * ci: run Unity Catalog integration tests on PRs Adds a unity-catalog-integration-tests job to s3-tables-tests.yml, modeled on the existing lakekeeper / dremio jobs. Pre-pulls the UC image and python:3.11-slim (used by the delta-rs writer container) and runs `go test ./test/s3tables/unity_catalog`. Format-check and go-vet jobs already recurse into ./test/s3tables/... so the new package is covered there too. * test/ci: address PR review Tighten the UC readiness probe to require 200, not <500, so a 401/403/404 during startup surfaces immediately instead of being treated as ready (CodeRabbit). Pin the UC image to v0.4.0 in both the workflow and the test default, matching the pinned-tag convention the rest of s3-tables-tests.yml uses (CodeRabbit). Use UC_IMAGE=unitycatalog/unitycatalog:main to re-test against current upstream. * docs: separate UC static-key vs master-role failure modes The README mixed the two together. Static-key empty-sessionToken short-circuits with "S3 bucket configuration not found." before UC even fires an STS call; the AccessDenied I described is what happens in the master-role variant where UC's Java StsClient actually reaches SeaweedFS. Cross-link the playground PR that fixes the static-key vending side. Also drop the "what most playground users actually run" hand-wave under MANAGED tables. * docs: trim README Drop the playground cross-reference and the "two layers fail independently" framing. * docs: pin down what's actually pending Investigated the master-role STS handoff with a sniffer in front of SeaweedFS' STS port. UC's StsClient is constructed without an endpointOverride and never reads aws.endpoint or AWS_ENDPOINT_URL_STS; verified by pointing AWS_ENDPOINT_URL_STS at port 1 and seeing the same real-AWS InvalidClientTokenId 403 with zero traffic to SeaweedFS. The fix is upstream in UC. Updated the README and the master-role test's t.Logf to say so precisely, and dropped the stale "Spark client" bullet (delta-rs covers that path). * test(s3tables): use BaseEndpoint instead of deprecated resolver EndpointResolverWithOptions is deprecated in aws-sdk-go-v2; the supported way to override a service endpoint is via the per-service Options.BaseEndpoint. Switch the assume-role helper to that pattern so the test stops compiling against deprecated API and the resolver boilerplate disappears. Addresses gemini review on PR #9308. * test(s3tables): drop unused splitS3URI helper Helper had no callers; gemini caught it on PR #9308. Easy to bring back from git history if needed. * test(s3tables): extract last token of docker run output as container ID docker run -d may prefix the container ID with image-pull progress when the image isn't cached locally. strings.TrimSpace on the whole output then gave a multi-line string, not the ID. Take the last whitespace-separated token so the ID survives a fresh CI runner. Addresses gemini review on PR #9308. * test(s3tables): cap Unity Catalog response body reads at 10 MiB io.ReadAll without a limit could OOM the test runner if the UC container hands back an unexpectedly large body. 10 MiB is well above any well-formed catalog response and turns a misbehaving server into a test failure instead of a runner crash. Addresses gemini review on PR #9308. * docs: link UC fix PR and call out UC's mocked-Sts test pattern UC's own credential-vending tests substitute StsClient with an in-process EchoAwsStsClient (BaseCRUDTestWithMockCredentials) or Mockito.mockStatic (CloudCredentialVendorTest), so the wire path between UC's Java SDK and a real STS server is untested -- which is why the missing endpointOverride slipped through upstream. Linked the upstream fix at unitycatalog/unitycatalog#1532.	2026-05-04 21:14:22 -07:00
Chris Lu	6d95a5592a	test(s3/policy): stop racing t.TempDir cleanup against mini shutdown The mini cluster's admin/plugin worker keeps creating files under admin/plugin/job_types/ for ~1s after subtests finish, while the previous Stop() only cancelled an unobserved context and slept 500ms. t.TempDir()'s registered RemoveAll then raced the worker and intermittently failed with "directory not empty" (CI run 25352039081). Manage the data dir manually so it is removed only after the mini goroutine has exited, and wire MiniClusterCtx so cancel actually drains master/volume/filer/admin/s3/webdav.	2026-05-04 20:16:53 -07:00
Chris LuandGitHub	0a91b57f16	fix(s3): encrypt SSE-S3 KEK at rest with AES-GCM wrapping (#8880 ) * fix(s3): encrypt SSE-S3 KEK at rest using passphrase-derived wrapping key * fix(s3): surface KEK migration failures instead of silently dropping them The legacy-plaintext -> encrypted-at-rest path used to swallow both wrapKEK and updateKEKContent errors. An operator who configured a passphrase but had a filer permission issue (or a wrap failure) saw nothing in the logs and the KEK stayed on disk in plaintext, with the migration retried on every restart and silently failing every time. Log each failure path explicitly so the unmigrated state is visible. The server still starts with the in-memory key loaded — refusing to boot here would be worse than the warning. Addresses gemini and coderabbit reviews on PR #8880. * fix(s3): use a per-installation random salt for KEK wrapping HKDF The original implementation hardcoded "seaweedfs-sse-s3-kek-wrapping-v1" as the HKDF salt for the KEK wrapping key. Two SeaweedFS installations using the same passphrase therefore produced byte-identical wrapping keys, opening a precomputation/rainbow-table angle against weaker passphrases. Generate a random 32-byte salt every time wrapKEK runs and embed it in the on-disk payload alongside the AES-GCM ciphertext. The new format is base64(magic("SWv2") \|\| salt \|\| nonce \|\| ciphertext+tag); unwrapKEK detects the magic and reads the salt out of the payload. KEKs wrapped under the legacy fixed-salt format still unwrap cleanly and are opportunistically re-wrapped into v2 on the next load so operators get the stronger format without manual migration. Addresses gemini review on PR #8880. * fix(s3): plumb KEK passphrase from env into the global key manager InitializeGlobalSSES3KeyManager used to ignore the kekPassphrase field entirely — the global manager was always constructed with the empty constructor, so the encrypted-at-rest code path never engaged in production. Read the passphrase from WEED_S3_SSE_KEK_PASSPHRASE and apply it before InitializeWithFiler so the load path picks up the encrypted format. Log a warning when the env var is unset to make the plaintext fallback visible to operators upgrading from earlier builds. Adds SetKEKPassphrase as the public seam used by the global init and by tests, plus regression tests for the wrap/unwrap round-trip, random-salt independence across managers sharing a passphrase, and the no-passphrase fallback that preserves the legacy hex-decode path. Addresses coderabbit review on PR #8880. * fix(s3): drop redundant base64 decode in KEK migration check unwrapKEK already does the base64 decode and the magic-prefix check; isV2WrappedKEK was repeating both passes purely so the migration branch in loadSuperKeyFromFiler could ask "was this v1 or v2". Have unwrapKEK return the version flag directly and delete the redundant helper. Single decode pass per load. Addresses gemini review on PR #8880. * fix(s3): updateKEKContent must overwrite, not create filer_pb.MkFile maps to CreateEntry, which is O_EXCL: it fails with ErrEntryAlreadyExists when the file already exists. Both KEK migration paths (legacy v1→v2 rewrap and plaintext→encrypted) call updateKEKContent against the entry they just read, so MkFile errored out every time and the migrations only ran in memory while the on-disk KEK stayed in its old format. The previous commit logged the failure loudly but the result was the same: a pre-existing deployment never got migrated. Switch updateKEKContent to LookupDirectoryEntry + UpdateEntry so the overwrite actually persists. Surface the lookup/update errors so the caller's existing "migration failed; KEK still on disk in old form" warnings fire on the right cases. Addresses coderabbit critical review on PR #8880. * chore(s3): drop unused generateAndSaveSuperKeyToFiler Master removed this as dead code in #8913 and I reintroduced it during the merge resolution thinking the migration paths still needed it. On second look it has no callers in this branch — every KEK-creation path on PR #8880 goes through the existing reader code that handles "file not present, generate" inline. Drop the duplicate. Addresses gemini medium review on PR #8880. * fix(s3): updateKEKContent honours km.kekPath instead of hardcoded path The migration write path was always pointing at /etc/s3/sse_kek even when the manager was configured with an operator-overridden kekPath. Split km.kekPath at the last "/" so the lookup + UpdateEntry land on the same file the read path used. Defaults match defaultKEKPath when kekPath is unset. Addresses gemini medium review on PR #8880. * fix(s3): KEK passphrase via Viper key, with env-var fallback The KEK passphrase was read straight from os.Getenv, but every other SSE-S3 secret (s3.sse.kek, s3.sse.key) goes through Viper so an operator can set them in security.toml or via WEED_ env vars interchangeably. Add s3.sse.kek.passphrase to the same path; the existing SSES3KEKPassphraseEnv lookup stays as a fallback so deployments wired before this commit keep working. Addresses gemini medium review on PR #8880.	2026-05-04 19:21:41 -07:00
Chris LuandGitHub	e1d5e3899f	fix(s3): add HMAC-SHA256 key commitment to SSE-S3 and SSE-KMS (#8879 ) * fix(s3): add HMAC-SHA256 key commitment to SSE encryption modes * fix(s3): validate SSE-S3 IV length before cipher.NewCTR CreateSSES3DecryptedReader took the IV directly from object metadata and passed it to cipher.NewCTR. The standard library panics when the IV length differs from the block size, so a tampered metadata field turned what should be a decryption error into a crash. Validate via the existing ValidateIV helper first. Addresses coderabbit review on PR #8879. * fix(s3): centralize SSE-KMS key commitment in createSSEKMSKey KeyCommitment used to be set explicitly in only one of the four encryption entry points; the multipart-aware variants and the bucket- bound CreateSSEKMSEncryptedReaderForBucket left it empty, so writes through those paths landed with metadata that VerifyKeyCommitment treats as legacy and accepts unconditionally — exactly the downgrade gap the gemini review flagged. Move the HMAC computation into createSSEKMSKey so every helper-driven path picks it up automatically, and add the same call to the bucket- scoped path that builds its own struct literal. Addresses gemini review on PR #8879. * fix(s3): store base IV (not derived IV) for offset-aware SSE-KMS chunks CreateSSEKMSEncryptedReaderWithBaseIVAndOffset used to store the already-offset-derived IV plus the chunk offset in metadata. The decrypt path then applied calculateIVWithOffset a second time to that stored IV, producing the wrong CTR keystream and a decryption mismatch. Centralizing the key commitment made the bug visible as a commitment failure, but the underlying issue predates that change. Pass the base IV through to createSSEKMSKey so sseKey.IV is the unmodified base on disk; the decrypt path's offset application then recovers the correct chunk-level IV. The HMAC commitment binds the base IV — same value the verify call at decrypt time hashes — so the new commitment path stays consistent. Addresses gemini security-high review on PR #8879. * fix(s3): opt-in strict-commitment mode to close the downgrade vector WEED_S3_REQUIRE_KEY_COMMITMENT=true flips VerifyKeyCommitment from accept-when-missing (the AWS-compatible default needed for objects written before commitments shipped) to reject-when-missing. With the env var set, an attacker who strips the commitment field from object metadata can no longer bypass integrity verification — every object must carry a commitment that hashes to the right value. Default stays false so existing legacy objects keep decrypting; the warning the gemini review raised about the silent-downgrade vector is closed for operators who explicitly opt in once their bucket is fully migrated. SetRequireKeyCommitment exposes a runtime seam for tests and future config-reload paths. Addresses the security-medium review on PR #8879. * test(s3): fix mislabelled IV literal in strict-commitment test The "strict-mode-iv-16" literal was actually 17 bytes — the trailing "16" was meant as a comment but counted as content. The IV is fed into HMAC, not AES, so the length didn't matter for behaviour, but the discrepancy was confusing. Tighten to a real 16-byte literal and explain the choice in a comment. Addresses coderabbit minor review on PR #8879. * fix(s3): store KMS-resolved KeyID in SSE-KMS metadata, not the request CreateSSEKMSEncryptedReaderForBucket built its SSEKMSKey with the caller-supplied keyID, but CreateSSEKMSDecryptedReader later compares decryptResp.KeyID against sseKey.KeyID. A request that used an alias would resolve to a different ARN in the response; storing the request form would then trip the mismatch check at decrypt time and surface as a "KMS key ID mismatch" error against the operator's own object. The helper-driven encryption paths already do the right thing via createSSEKMSKey; this is the bucket-bound path catching up. Addresses coderabbit review on PR #8879. * test(s3): cover key commitment rejection paths under tampering Adds the negative-path tests coderabbit flagged as missing: a tampered key, IV, algorithm, or commitment field must fail VerifyKeyCommitment, otherwise a regression in the rejection logic could land silently. The HMAC binds all three inputs plus the commitment itself, so any single mutation is enough. Addresses coderabbit nitpick on PR #8879.	2026-05-04 19:14:41 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>Chris Lu	3ee147dc4d	build(deps): bump cloud.google.com/go/kms from 1.26.0 to 1.30.0 (#9311 ) Bumps [cloud.google.com/go/kms](https://github.com/googleapis/google-cloud-go) from 1.26.0 to 1.30.0. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/kms/v1.26.0...dlp/v1.30.0) --- updated-dependencies: - dependency-name: cloud.google.com/go/kms dependency-version: 1.30.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>	2026-05-04 18:20:18 -07:00
Chris LuandGitHub	66d9b89cd2	fix(iam): deny IAM users with no policies instead of granting full access (#9317 ) * fix(iam): deny IAM users with zero policies instead of falling through to DefaultEffect=Allow A user created via the S3 IAM API with no policies attached was inheriting full S3 access. With `weed s3 -iam` and no explicit IAM config, the policy engine's DefaultEffect defaults to Allow for the in-memory zero-config path. The "no matching statement" guard in IsActionAllowed only triggered when the user already had at least one policy, so a fresh user with PolicyNames=[] slipped through and got allow-all. Track hasManagedSubject whenever the principal resolves to a registered user or role (or PolicyNames are supplied directly) and deny on no-match. The DefaultEffect=Allow fallback now only applies to truly unmanaged callers. * test(iam): cover non-matching attached-policy case for managed-subject deny * test(iam): cover role-with-empty-AttachedPolicies deny path Sibling case to TestIsActionAllowed_RegisteredUserWithoutPoliciesIsDenied: a managed role that resolves but has zero AttachedPolicies must also fall through to deny under DefaultEffect=Allow, not inherit full access. The fix already handles this branch via the hasManagedSubject flag; this test pins the regression class so we don't lose coverage on either side of the user/role split. Addresses coderabbit nitpick on PR #9317.	2026-05-04 17:56:10 -07:00
Chris LuandGitHub	57bb7b2f39	quiet noisy 'shard X not found' log when EC shard lives on another server (#9316 ) quiet "shard X not found" log when EC shard lives on another server When EC shards are spread across multiple volume servers, every read that targets a shard not present locally was logging at V(0) with the same wording as a real read failure. Under rclone-style traffic this floods the logs and makes a healthy EC cluster look broken (issue #9310). Distinguish the two cases with an errShardNotLocal sentinel: the expected fall-through-to-remote path now logs at V(4); genuine local read failures still log at V(0).	2026-05-04 15:30:22 -07:00
Chris Lu	6aa353716a	test(kafka): snapshot consumer-group state mid-attempt for resumption flake The onRetry hook in TestOffsetManagement/ConsumerGroupResumption only fires after defer reader.Close(), so every dump shows the post-LeaveGroup Empty state — useless for diagnosing why the second consumer hangs in the join cycle. Add an onTick callback fired every 1.5s while the reader is still joined so we can see PreparingRebalance / CompletingRebalance churn, leader, and member assignments during the 20s attempt window.	2026-05-04 15:29:33 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	3efd1e8974	build(deps): bump github.com/a-h/templ from 0.3.977 to 0.3.1001 (#9312 ) Bumps [github.com/a-h/templ](https://github.com/a-h/templ) from 0.3.977 to 0.3.1001. - [Release notes](https://github.com/a-h/templ/releases) - [Commits](https://github.com/a-h/templ/compare/v0.3.977...v0.3.1001) --- updated-dependencies: - dependency-name: github.com/a-h/templ dependency-version: 0.3.1001 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-04 15:02:29 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	39cf3cf719	build(deps): bump cloud.google.com/go/storage from 1.60.0 to 1.62.1 (#9313 ) Bumps [cloud.google.com/go/storage](https://github.com/googleapis/google-cloud-go) from 1.60.0 to 1.62.1. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/compute/v1.60.0...storage/v1.62.1) --- updated-dependencies: - dependency-name: cloud.google.com/go/storage dependency-version: 1.62.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-04 15:02:18 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	dee1e12bcd	build(deps): bump github.com/aws/smithy-go from 1.25.0 to 1.25.1 (#9314 ) Bumps [github.com/aws/smithy-go](https://github.com/aws/smithy-go) from 1.25.0 to 1.25.1. - [Release notes](https://github.com/aws/smithy-go/releases) - [Changelog](https://github.com/aws/smithy-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/aws/smithy-go/compare/v1.25.0...v1.25.1) --- updated-dependencies: - dependency-name: github.com/aws/smithy-go dependency-version: 1.25.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-04 15:02:06 -07:00
dependabot[bot]GitHubdependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	bfcbd5aa0f	build(deps): bump github.com/klauspost/reedsolomon from 1.13.3 to 1.14.0 (#9315 ) Bumps [github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon) from 1.13.3 to 1.14.0. - [Release notes](https://github.com/klauspost/reedsolomon/releases) - [Commits](https://github.com/klauspost/reedsolomon/compare/v1.13.3...v1.14.0) --- updated-dependencies: - dependency-name: github.com/klauspost/reedsolomon dependency-version: 1.14.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-04 15:01:57 -07:00
Chris LuandGitHub	1de741737d	test(s3tables): add Apache Doris Iceberg catalog integration test (#9307 ) * test(s3tables): add Apache Doris Iceberg catalog integration test Adds an end-to-end smoke test that boots the apache/doris all-in-one container, registers SeaweedFS as an external Iceberg REST catalog (OAuth2 client_credentials), and validates metadata visibility plus the parquet read path against tables seeded via the Iceberg REST API and a PyIceberg writer container, mirroring the existing Trino, Spark, and Dremio coverage. Wires the test into a new s3-tables-tests workflow job. * test(s3tables): document weed shell -master flag format and fill in helper docstrings Restores the explanatory comment on createTableBucket about the host:port.grpcPort ServerAddress format used by `weed shell -master` (produced by pb.NewServerAddress) so the dot separator isn't mistaken for a typo, and adds doc comments for createIcebergNamespace, createIcebergTable, doIcebergJSONRequest, requireDorisRuntime, and hasDocker.	2026-05-04 00:28:30 -07:00
Chris Lu	90b706984e	docs(readme): align Docker quick start with weed mini defaults Replace the old "server -s3" form with a parallel docker run that uses the same env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, S3_BUCKET) as the binary weed mini quick start. Drop the explicit "mini" subcommand since it is the default CMD in the image.	2026-05-04 00:09:13 -07:00
Chris Lu	7f1ac8cf1a	docs(readme): dedup Quick Start, drop "weed server" section Remove the "Quick Start with Single Binary" section which duplicated the binary-download instructions and pushed users toward "weed server". Fold the go install alternative and the volume-scaling tip into the weed mini section so there is one canonical Quick Start path.	2026-05-03 23:56:24 -07:00
Chris Lu	4e22d9533b	docs(readme): highlight built-in Iceberg catalog and lakehouse use case Surface the table-buckets / Iceberg REST Catalog story in the introduction and add a Data Lakehouse Features section so readers landing on the README see that SeaweedFS replaces both the object store and the metastore in an Iceberg stack.	2026-05-03 23:46:08 -07:00

1 2 3 4 5 ...