mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-28 12:41:15 +00:00
8cf42a5abb2a6766aa3d86cd07531c834151a74e
13753 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8cf42a5abb |
test(s3/lifecycle): assert per-goroutine errors in fake-server concurrent test (#9393)
test(s3/lifecycle): assert per-goroutine errors in concurrent fake test The previous TestFake_ConcurrentCallsSerializeWithoutDeadlock dropped the err return from each LifecycleDelete call, so a regression in the concurrent path could pass the length-only assertion. Capture each err on a buffered channel and require.NoError after wg.Wait(). |
||
|
|
ddfb219ec3 |
test(s3/lifecycle): fake LifecycleDelete server (Phase 12 slice) (#9391)
* test(s3/lifecycle): fake LifecycleDelete server for component tests A reusable double for SeaweedS3LifecycleInternalServer with per-key FIFO outcome queues, a fallback Default, and recorded request capture. Tests of the worker pipeline that need to hit the proto boundary can queue up DONE/NOOP/RETRY/FATAL/SKIPPED_OBJECT_LOCK responses per (bucket, objectPath, versionId) and assert dispatch order against Recorded(). SetError flips the server into transport-failure mode without polluting the request log. * test(s3/lifecycle): use struct map key for FakeLifecycleServer queues Bucket / object path / version-id are user-supplied strings that can contain "/" or "@", which would collide if the queue map were keyed by "<bucket>/<object>@<version>". Switch to a struct key so the components stay separate. * test(s3/lifecycle): deep-copy recorded LifecycleDelete requests Tests that mutate a Recorded() entry — or a request pointer they already passed in — were able to corrupt the fake's bookkeeping because the slice carried shared pointers. Clone with proto.Clone at both record and read time so the fake holds an independent snapshot of every arriving request and hands callers an independent snapshot back. Tightened TestFake_VersionIDPartOfKey error checks while there. |
||
|
|
bb0c7c779f |
feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0 (Phase 2b) (#9390)
* refactor(s3): thread metadataOnly into delete helpers Add a metadataOnly bool to deleteUnversionedObjectWithClient and deleteSpecificObjectVersion. When true the helper sends IsDeleteData= false to the filer's DeleteEntry RPC so per-chunk DeleteFile RPCs are skipped — the volume server reclaims chunks on its own at TTL drop. Non-lifecycle callers (DELETE handlers, batch delete) pass false to preserve today's eager-chunk-delete behavior; only the lifecycle handler in the next commit will pass true. * feat(s3/lifecycle): metadata-only delete when entry TtlSec > 0 Per-write TTL stamping (PR 9377) sets Attributes.TtlSec on every lifecycle-fitting entry. When the live entry the LifecycleDelete handler fetched carries TtlSec > 0 the volume server is guaranteed to reclaim chunks at TTL drop, so the filer can skip per-chunk DeleteFile RPCs and just remove the entry record. lifecycleDispatch now computes metadataOnly from the live entry and threads it through the unversioned, suspended-null, and noncurrent/expired-marker delete paths. createDeleteMarker is unaffected — it creates a marker, never deletes chunks. |
||
|
|
255e9cd0f7 |
test(s3/lifecycle): cover reader cursor + Run validation contracts (#9389)
* test(s3/lifecycle): cover reader cursor + Run validation contracts Layer 2 tests pinning four reader-package contracts the dispatcher pipeline depends on: MinTsNs anchors at frozen positions, Snapshot returns a deep copy in both directions, Restore replaces (not merges), and Run validates ShardID/Events/BucketsPath before subscribing. * test(s3/lifecycle): tighten cursor composition assertions Snapshot deep-copy: also assert cursor doesn't see keys added to the returned map. Restore replace: freeze before second Restore and assert IsFrozen returns false after, pinning the contract that Restore wipes frozen state alongside the value map. Run validation: bound the call with a 5s context timeout so a regression that lets Run reach the nil client surfaces as a failure instead of a hang. |
||
|
|
aa280443e7 |
test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher (#9387)
* test(s3/lifecycle): Layer 2 multi-shard composition for the dispatcher The existing dispatcher unit tests cover individual outcomes (DONE / RETRY_LATER / BLOCKED / etc.) on a single shard, and pipeline_test.go has only one end-to-end happy-path assertion. Multi-shard composition — the contract Pipeline.Run wires up at runtime — was untested at the component level. Add four Layer 2 tests in dispatcher/multi_shard_test.go: Two events for two shards land in different schedules, dispatch independently, and each cursor advances only for its own event (no cross-contamination on the action-key map). A poison event on shard 0 returns BLOCKED and freezes shard 0's cursor; shard 1's normal event continues to dispatch and its cursor advances. Per-shard isolation contract. Save/Load round-trips a per-shard cursor snapshot through the Persister: a fresh dispatcher restores the same TsNs map. Pins the contract Pipeline.Run drives on the checkpoint ticker. RETRY_LATER respects RetryBackoff against the wall clock — a Tick within the backoff window doesn't re-dispatch; a Tick past the new DueTime does. Guards against premature retries from refresh ticks landing inside the backoff. Pipeline.Run itself can't run here (it builds a real reader.Reader), so the tests share the same fakeClient pattern dispatcher_test.go uses and drive Tick directly. * test(s3/lifecycle): drop unused snapshot helper and addAndTick parameter |
||
|
|
1854101125 |
feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386)
* feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) scan_only actions only fire from the bootstrap walk: the engine classifies a rule as scan_only when its retention horizon exceeds the meta-log retention, so event-driven routing can't be trusted. Today each bucket walks once per process, so a long-running worker never revisits — scan_only retention only catches up when the worker restarts. Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk (name -> completion time). KickOffNew now re-walks a bucket whose last walk completed more than BootstrapInterval ago. Zero interval preserves the legacy walk-once-per-process behavior so existing deployments don't change cadence by default. walkBucket re-stamps on success and clears the stamp on failure (via MarkDirty), so the next KickOffNew picks failed walks back up. Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and a Now func() for testable time travel. weed shell run-shard grows --bootstrap-interval (cadence knob) and --force-bootstrap (drop in-memory state at startup so every bucket walks again immediately, useful when a config change should take effect without a restart). Tests: cadence respected (skip inside interval, re-walk past it); zero interval keeps once-per-process; MarkDirty forces re-walk under a 24h interval; MarkAllDirty resets every record. The fakeClock helper guards the test clock with a mutex so race-detector runs are clean. * fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag Three issues with the Phase 8 cadence work as it landed: 1. lastWalk did double duty as both completed-walk timestamp and in-flight debounce. A walk that took longer than BootstrapInterval would have a fresh KickOffNew start a duplicate goroutine on the next refresh tick because the stamp from KickOffNew looked stale against the interval. Split into lastCompleted (set on success) and inFlight (set on dispatch, cleared after the walk goroutine returns success or failure). KickOffNew skips inFlight buckets regardless of cadence. 2. The cadence knob existed on `weed shell` but not on the production path: scheduler.Scheduler constructed BucketBootstrapper without BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had no field for it. Add Scheduler.BootstrapInterval, parse `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk- once-per-process; negative clamps to zero), and forward it from the handler. Tests cover default, override, clamp, and explicit-zero. 3. --force-bootstrap was a no-op: BucketBootstrapper is freshly allocated at command start, so MarkAllDirty on empty state does nothing, and the flag couldn't influence an already-running process anyway. Remove it; a real runtime trigger (SIGHUP, control RPC) is a separate change. In-flight regression: a blockingInjector pins the first walk in progress while the test advances the clock past the interval. The second KickOffNew is a no-op (inFlight check). After release, the post-completion KickOffNew within the interval is also a no-op. * test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock The cadence test polled listedN to know "the walk happened" — but that fires once both list passes are issued, while the success-stamp lands later, after walkBucketDir returns. A clock.Advance(30m) between those two events would record the stamp at clock+30m instead of T0; the next assertion would then see now.Sub(last) < 1h and skip the expected re-walk. Tight in practice but exposed under -race / load. Add a waitForCompleted helper that polls b.lastCompleted directly, and use it before each clock advance in both the cadence and zero- interval tests. * fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks Two follow-ups on Phase 8. The worker config descriptor had no bootstrap_interval_minutes field, so the production operator UI couldn't enable the cadence — only the internal ParseConfig + Scheduler wiring knew about it. Add the field to the cadence section (MinValue=0 since 0 is the legacy default) and include the default in DefaultValues so existing deployments see the knob with the right preset. MarkDirty / MarkAllDirty silently lost their effect when a walk was in flight: the methods cleared lastCompleted, but the walk's success path then wrote a fresh timestamp, hiding the operator's invalidation. Track a pendingDirty set; the walk goroutine consumes the flag on exit and skips the success stamp, so the next KickOffNew picks the bucket up immediately. Regression: pin a walk in progress with a blockingInjector, MarkDirty the bucket, release the walk, and assert lastCompleted stayed empty plus the next KickOffNew triggers a new walk inside the BootstrapInterval window. * refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty These methods were the operator-hook half of Phase 8, but the only caller (--force-bootstrap on the shell command) was removed when it turned out to be a no-op against a freshly-allocated bootstrapper. Nothing in production calls them anymore. Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty set, the dirty-suppression branch in walkBucket, and the three tests that only exercised those methods. BootstrapInterval-driven re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP, control RPC) is a separate change with a real call site. |
||
|
|
edfa1ce210 |
feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4) (#9385)
* feat(s3/lifecycle): pointer-transition routing for live PUTs (Phase 5b/4) Bootstrap covers existing versions, but a live PUT that creates a new .versions/<v-new> file and updates the parent's ExtLatestVersionIdKey didn't fire NoncurrentDays / NewerNoncurrent on the displaced prior version until the next bootstrap. Close that runtime gap. The meta-log already emits an Update event for the .versions/ directory itself when the latest pointer changes; the router was dropping it because buildObjectInfo returns nil for directories. New branch in Route detects that shape (versioned bucket, NewEntry + OldEntry both directories with the .versions/ suffix, ExtLatestVersionIdKey changed, ID different from the new ID) and emits a Match against the LOGICAL key with VersionID=oldID. Match.Identity comes from a single LookupVersion RPC for the displaced version file; SuccessorModTime is the directory update's mtime, which is the moment the displaced version became noncurrent. SiblingLister grows LookupVersion(bucket, key, versionID) for that single-RPC fetch. filerSiblingLister implements it; routing path treats NotFound as "displaced version was hard-deleted in the meantime, suppress" rather than an error. The router gates the lookup on at least one active event-driven NoncurrentDays / NewerNoncurrent rule for the bucket, so most buckets pay nothing per directory update. Tests: pointer-flip fires NoncurrentDays with displaced version_id; unchanged pointer skips; empty old pointer skips (first-PUT scenario); displaced-version NotFound suppresses; no-rule skips lookup; NewerNoncurrentVersions retains rank-0; unversioned bucket skips. * fix(s3/lifecycle): SuccessorModTime cache + NewerNoncurrent expansion Two correctness gaps in pointer-transition routing. The .versions/ directory's own Attributes.Mtime is preserved across pointer updates by updateLatestVersionInDirectory: it's a stale clock relative to the freshly-written latest version. Using it as the displaced version's SuccessorModTime made NoncurrentDays compute due = staleMtime + days, which fires immediately on a fresh PUT into an old .versions/ container. Read ExtLatestVersionMtimeKey written by setCachedListMetadata; suppress (return no matches) when the cache is missing rather than fall back to dir mtime. Single-oldID lookup is only enough for pure NoncurrentVersionExpirationDays. Any rule with NewerNoncurrentVersions > 0 cares about the noncurrent ranks, and a pointer flip shifts every prior noncurrent's index by one — the version that just crossed the keep-count threshold needs to be evaluated too. When any matching rule needs ranks, list the full .versions/ container, sort newest- first with mtime + version-id tiebreak, and route every noncurrent with its real index. Identity-CAS dedups against earlier schedules. SiblingLister grows ListVersions(bucket, key); filerSiblingLister's implementation paginates the container fully. Two regression tests: stale dir mtime + correct cached mtime schedules ~30 days out (not immediate); NewerNoncurrentVersions=2 with 4 versions fires on the rank-2 entry that just crossed the threshold while rank-0/1 are retained. * fix(s3/lifecycle): bound pointer-transition expansion to threshold crossings routePointerTransitionExpand emitted a Match for every eligible noncurrent on every PUT. Schedule.Add doesn't dedup, so identity-CAS at dispatch only saved the wasted RPC, not the heap slot. A hot key with many already-eligible versions and a count rule would push O(versions) entries per flip, repeatedly, until dispatch caught up. Bound the emission to versions that newly entered eligibility on this specific flip: rank 0 (the displaced version, for the NoncurrentDays clock) plus rank == rule.NewerNoncurrentVersions for each active count-gated rule (the version that just crossed from kept to expired). Bootstrap still owns full backfill for versions that were already over-threshold. Adds a regression with 6 versions and NewerNoncurrentVersions=2: asserts only the rank-2 entry that just crossed fires, not the already-over-threshold rank-3/rank-4 entries. * fix(s3/lifecycle): suppress pointer-transition expansion when newID missing routePointerTransitionExpand defaulted latestPos to 0 if newID wasn't found in the listing. That made the actual newest sibling latest against the pointer's intent, then misranked every other version. A race between the pointer write and the version write could land us there. Default latestPos to -1, set it only on a real match, and suppress the expansion when the search misses. Bootstrap repairs state on the next walk. The NewerNoncurrentVersions retention test was setting only lookupEntry, so Route never reached the expansion path it claimed to exercise. Repoint to listVersions and assert ListVersions was consulted while LookupVersion was not. Adds a regression covering the missing-newID suppression directly. * fix(s3/lifecycle): include bare null version in pointer-transition routing Bootstrap models the bare-key object as a "null" sibling alongside .versions/ children, but the live pointer-transition path didn't. Two cases lost: 1. oldID == "" was treated as "nothing displaced". A pre-versioning bare object becomes noncurrent when the first versioned PUT lands and the pointer flips to a real id, but live routing skipped it and waited for the next bootstrap. 2. The expansion path's ListVersions returned only .versions/ children. With a bare null in the picture, the noncurrent ranks were wrong, so NewerNoncurrentVersions could keep the wrong versions and delete the right ones (or vice versa). SiblingLister grows LookupNullVersion(bucket, key) returning the bare entry plus an explicit-null flag (matches the bootstrap shape). filerSiblingLister implements it via util.NewFullPath + filer_pb.LookupEntry. routePointerTransitionDisplaced: oldID == "" now consults LookupNullVersion. When the bare entry exists, route it as VersionID="null" against the LOGICAL key. routePointerTransitionExpand: collect .versions/ children plus the null entry into one sibling slice before sorting and ranking. The threshold-crossing logic now sees the same N-version set that bootstrap would compute. Three new tests: oldID == "" with no null is a no-op (one null lookup, no version lookup); oldID == "" with a bare null schedules NoncurrentDays as VersionID="null"; expansion with a bare null between .versions/ siblings places null at its mtime-correct rank and only that rank-N entry fires. * fix(s3/lifecycle): atomic listPageSize so test cleanup doesn't race KickOffNew dispatches walks via `go b.walkBucket(...)`. A test that finishes before its goroutines drain leaves them running into the next test's t.Cleanup, which mutates listPageSize. -race spots the read/write collision intermittently. Convert listPageSize to atomic.Uint32; tests use Load/Store. No production semantics change. * fix(s3/lifecycle): null becomes latest when suspended PUT clears pointer The router treated newID == "" as if the cached ExtLatestVersionMtimeKey were still authoritative — but that cache holds the displaced version's mtime, written by setCachedListMetadata when the prior version became latest. Using it as SuccessorModTime made NoncurrentDays=30 immediately fire on a 100-day-old displaced version even though it became noncurrent today. When newID == "" the bare null is the new latest. Look it up, substitute its mtime as the successor clock, and substitute "null" as the latestPos target for the expansion path's id match. Both displaced and expand paths now derive the right clock. updateIsLatestFlagsForSuspendedVersioning was the upstream cause of the staleness — it cleared ExtLatestVersionIdKey and FileNameKey but left the cached size/mtime/etag/owner/delete-marker behind. Call clearCachedVersionMetadata so the .versions/ container is consistent with "null is latest". The router-side guard is still needed for older deployments that ran the buggy code, but new writes won't exercise the workaround. Two regressions: 100-day-old displaced under NoncurrentDays=30 with a today-null PUT schedules ~30d out (not immediate); same shape with NewerNoncurrentVersions=2 ranks the null at latest and only the rank-2 entry fires. |
||
|
|
2f7ac1d664 |
feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3) (#9383)
* feat(s3/lifecycle): NoncurrentVersionExpiration via bootstrap (Phase 5b/3)
Bootstrap now expands every <key>.versions/ directory into one event
per version with sibling state pre-computed. The router fires
NoncurrentDays / NewerNoncurrent off these events using
SuccessorModTime as the noncurrent clock; previously these rules
never ran on a versioned bucket because buildObjectInfo couldn't
classify version-folder events without the latest pointer.
Mechanics
walkBucketDir treats a directory ending in .versions and carrying
ExtLatestVersionIdKey as a SeaweedFS .versions container — emit it
once and skip the recursion. Coincidentally-named directories without
the latest pointer recurse normally.
BucketBootstrapper.expandVersionsDir lists the children, sorts
newest-first by mtime, resolves the latest position from the pointer,
and injects a synthesized reader.Event per version with
BootstrapVersion populated. NoncurrentIndex is 0-based among
noncurrents in newest-first order; SuccessorModTime is the immediate
newer sibling's mtime (zero for the latest). Pointer naming a missing
or absent version falls back to the newest-by-mtime sibling so a
race window can't flag every entry as noncurrent.
routeBootstrapVersion uses BootstrapVersion to build ObjectInfo
directly (bypassing the version-folder skip in buildObjectInfo) and
runs the standard match loop. ABORT_MPU is excluded by kind-shape
gate. The schedule clock uses SuccessorModTime for noncurrents and
ModTime for the latest, so the dispatcher fires when the rule's days
threshold is met. Match.ObjectKey is the LOGICAL key,
Match.VersionID is the marker's stored version_id — the dispatcher
reaches deleteSpecificObjectVersion or createDeleteMarker correctly.
Layer 2 tests cover both sides. Router: latest fires ExpirationDays;
noncurrent fires NoncurrentDays; NewerNoncurrentVersions retains the
N newest noncurrents; ABORT_MPU never matches. Bootstrap: .versions
dir emitted once and not recursed; missing latest pointer falls back
to newest; backdated PUT (latest pointer is older by mtime) keeps
the right noncurrent index; delete-marker flag propagates.
* fix(s3/lifecycle): no VersionID for latest expirations, child-based dir disambig
Two correctness gaps in Phase 5b/3.
Bootstrap was pinning the version_id on every Match. For
EXPIRATION_DAYS / EXPIRATION_DATE on the latest version this is
unsafe: between schedule and dispatch a fresh PUT can land, the
dispatcher would still identity-match against the original version's
bytes (it still exists at that path) and the resulting delete marker
would hide the new latest. Drop VersionID for those kinds; an empty
VersionID makes the dispatcher fetch the current latest, where
identity-CAS resolves to STALE_IDENTITY and bootstrap re-schedules
with the new latest's identity. NoncurrentDays / NewerNoncurrent /
EXPIRED_DELETE_MARKER still pin the version_id since those are
version-targeted.
isVersionsDir gating on ExtLatestVersionIdKey lost a race window:
createDeleteMarker writes the version file before updating the
parent's Extended pointer, so a walk between those two steps would
see a .versions/ dir without the pointer, recurse into it, and emit
raw version files that the router drops. Match the suffix only and
let expandVersionsDir disambiguate by child inspection: if any child
carries ExtVersionIdKey it's a real .versions container and we expand;
otherwise it's a coincidentally-named user folder and we recurse via
the bucket-walk's own callback so nested entries still flow through.
Tests: latest-expiration assertion flipped to expect empty VersionID;
new tests cover the coincidentally-named-folder recursion and the
race-window expansion (children present, pointer absent).
* fix(s3/lifecycle): filter directory + missing-version-id children at listing
expandVersionsDir's listing callback collected every child with
attributes; subdirectories or entries without ExtVersionIdKey would
make it past the empty-id skip in the inner loop but still inflate
NumVersions and skew NoncurrentIndex (the rank derives from the
filtered slice's position, which was wrong when the unfiltered slice
was sorted). Drop directories at listing time and partition the
file children into a versions slice that's the actual rank source.
Test cleanups: out-of-order-mtime test now sets v1 older than v2 so
latestPos > 0 actually exercises the rank-skip branch in
expandVersionsDir; bootstrapVersionEntry preserves nanosecond
precision via MtimeNs to match markerLoneEntry's pattern; drop a
leftover unused idx variable.
* fix(s3/lifecycle): null version + canonical version-id tiebreak
Two correctness gaps in Phase 5b/3 bootstrap.
Null versions live at the bare logical path, not under .versions/.
Bootstrap previously expanded only .versions/<key>/ children, so:
- pre-versioning objects with newer .versions/ history never had
their null version expired by NoncurrentDays
- suspended-bucket writes (which clear the .versions/ latest pointer
so null becomes current) had every .versions/ child wrongly
classified as latest by the buildObjectInfo fallback
expandVersionsDir now looks up the bare key via NewFullPath +
LookupEntry, accepts a regular file or an explicit S3 directory-key
marker (Mime set), and folds it into the sibling set with
VersionID="null". Latest resolution: pointer present + names a real
id wins; pointer absent + null exists makes null latest; otherwise
falls back to newest sibling. The walker's regular emission for the
bare entry would otherwise duplicate, so walkBucketDir now does a
two-pass walk per directory level — .versions/ first, then everything
else with a per-walk skipBare set keyed by bucket-relative path that
expandVersionsDir populates when it claims a bare null sibling.
Sort tiebreak: PUTs only set second-level Mtime, so two versions
written in the same second tied. The unstable secondary order let
old-format version filenames sort oldest-first and corrupt
NoncurrentIndex under NewerNoncurrentVersions retention. Add
CompareVersionIds to s3lifecycle/version_time.go (mirrors the
canonical comparator in s3api/s3api_version_id.go to avoid the
import cycle) and use it as a secondary key after mtime equality.
Tests: pre-versioning null-as-noncurrent, suspended null-as-current,
directory-key marker as null version, end-to-end claim through
walkBucketDir's two-pass ordering, and same-second tiebreak via
canonical version-id ordering. fakeFilerClient grows a
LookupDirectoryEntry implementation backed by the same in-memory tree.
* fix(s3/lifecycle): only treat explicit-null bare entries as current
The pointer-missing branch in expandVersionsDir made null latest as
soon as a bare object was found. That's correct for suspended-bucket
writes (s3api_object_handlers_put.go writes the bare entry with
ExtVersionIdKey="null") but wrong for the pre-versioning race window:
a brand-new version under .versions/<file> exists before the parent's
ExtLatestVersionIdKey update lands, and a pre-versioning bare object
has no version-id marker. Marking that older bare object latest hides
the real new version and skips noncurrent expiration of the null
until the next process restart/bootstrap.
Distinguish the two: lookupNullVersion now returns whether the bare
entry's Extended map carries ExtVersionIdKey="null" (the suspended
write marker). expandVersionsDir's pointer-missing branch only
promotes null to latest when explicit; otherwise it falls back to
newest-sibling, which is safe for the race window since the new
version's mtime is fresher than the bare object's.
The existing suspended-null test now uses a new helper that adds the
explicit marker. New regression test covers the race window: bare
entry without the marker + a fresh .versions/<v1> file + missing
parent pointer must keep v1 as latest and the null as noncurrent.
* fix(s3/lifecycle): only the newest item can be the explicit-null latest
The pointer-missing branch in expandVersionsDir scanned every item for
an explicit null and promoted it to latest. After a suspended->enabled
transition that's the wrong call: createVersion writes the version
file before updating ExtLatestVersionIdKey, so a bootstrap that lands
in the race window sees an older bare null with ExtVersionIdKey="null"
plus a newer .versions/<v-new> child and no parent pointer. Promoting
the null misclassifies v-new as noncurrent and skips both the new
version's current-version expiration and the null's noncurrent
scheduling until the next bootstrap.
Constrain the explicit-null branch to items[0]: if the suspended-null
write is genuinely current it'll be the newest by mtime AND tagged.
Anything else falls through to the newest-sibling default.
Adds a regression test for the suspended->re-enabled race.
* fix(s3/lifecycle): paginate bootstrap directory listings
SeaweedList(..., limit=0) is a single-page request: the filer caps
limit=0 at DirListingLimit (1000 by default) and returns whatever fits
in one round trip. expandVersionsDir and walkBucketDir both relied on
that, so any directory bigger than the cap silently truncated. For
noncurrent retention this is correctness, not just scale — a hot key
with more versions than the cap had its rank/sort math computed off
the first page only, NumVersions, NoncurrentIndex, SuccessorModTime,
and the latest-fallback all wrong, with the older versions never
scheduled until a future bootstrap.
Add a listAll helper that drives pagination via StartFromFileName +
inclusive=false, looping until a page returns fewer entries than the
configured page size. Use it in both call sites. Page size is a var
(listPageSize, default 1024) so tests can shrink it without
generating thousands of entries.
The fake filer client now mirrors the real semantics: sort children
by name, honor StartFromFileName/InclusiveStartFrom, cap at Limit.
New regression tests force a small page size and assert the full
result set is processed and the call count matches what pagination
should drive.
* perf(s3/lifecycle): stream bucket walk in two passes instead of buffering
walkBucketDir was paginating into a children slice and then iterating
twice (pass 1: .versions/, pass 2: everything else). For flat buckets
with millions of entries the buffer is a real memory spike. Drop the
materialization: each pass now drives its own listAll over the same
directory and acts on entries as they stream in. The skipBare ordering
contract is preserved — pass 2 still runs after pass 1 finishes — and
the per-pass paging keeps memory bounded by listPageSize.
Tradeoff: each directory level is listed twice. For workloads where
that matters more than the memory headroom, we can revisit; the
correctness/scale dial here is what the noncurrent rules need.
Updated three tests for the new call count: each walk now records 2
listings per directory (pass 1 + pass 2). The KickOffNew dedup tests
expect 2 calls per bucket; the pagination test expects 6 instead of 3.
|
||
|
|
1c917ffacb |
fix(volume): sticky EIO quarantine; track streamed reads (#9384)
Two follow-ups on PR #9382: 1. Quarantine wasn't sticky. Once CollectHeartbeat crossed the streak threshold and hid the replica, a subsequent successful read called checkReadWriteError(nil), wiping the streak; the next heartbeat then re-announced the suspect replica as read-only and master could send reads back to a disk that already failed IoErrorTolerance. Added an ioErrorQuarantined sticky flag set on the first heartbeat that observes the threshold and cleared only by MarkVolumeWritable (resetIoErrorState). clearIoError continues to reset just the streak so successful ops don't accumulate phantom errors. 2. Streamed reads bypassed the EIO counter. readNeedleDataInto and ReadNeedleBlob — the hot paths for large/range GETs — returned ReadNeedleData / needle.ReadNeedleBlob errors without threading them through checkReadWriteError, so a disk failing only on those paths would never trip IoErrorTolerance. Both now route the backend error through the tracker, and a fully clean readNeedleDataInto call clears the streak. Tests cover the sticky flag (TestQuarantineIsSticky) and the streamed read path (TestReadNeedleBlobTracksEIO via a fake EIO backend). |
||
|
|
7c60407897 |
fix(volume): don't nuke local data on transient IO error (#9378) (#9382)
* fix(volume): don't nuke local data on transient IO error (#9378) A single syscall.EIO from any read/write/delete set v.lastIoError, and the next CollectHeartbeat then called Volume.Destroy on the replica — removing the .dat/.idx/.vif/.sdx/.ldb/.rdb files. A brief NFS / fabric / controller blip hitting several replicas at once could cascade into removal of the last healthy copy, with no recovery for non-tiered volumes. Now require IoErrorTolerance (3) consecutive EIOs before acting, and on that threshold mark the volume read-only and stop announcing it to the master so re-replication kicks in from healthy peers — never delete the data files. The on-disk copy stays for operator inspection / recovery. * review: fix race, accounting, recovery, non-EIO streak break Addressing PR #9382 review: - Data race on lastIoError: guard lastIoError + lastIoErrorCount with a RWMutex and expose them through note/clear/get helpers so the heartbeat reader sees a consistent snapshot. Verified with -race. - Collection-size accounting: when a volume is quarantined for sustained EIO, skip the entire per-volume bookkeeping (`continue`) instead of flipping shouldDeleteVolume — the old branch subtracted a size that was never added, dragging the collection gauge to zero / negative. - Recoverability: MarkVolumeWritable now also calls clearIoError so an operator can rejoin a quarantined replica. The next failed op re-arms the streak if the disk is still bad. - Non-EIO streak break: a non-EIO error (e.g. ENOSPC) now resets the consecutive-EIO counter, so a sequence EIO,EIO,ENOSPC,EIO is treated as a streak of one — the counter only tracks consecutive EIOs. Reads already call checkReadWriteError (volume_read.go), so successful reads also clear the streak — no change needed there. |
||
|
|
c6ad6dcf74 |
feat(s3/lifecycle): sole-survivor delete-marker routing (Phase 5b/2) (#9381)
* feat(s3/lifecycle): sole-survivor delete-marker routing (Phase 5b/2)
Production stores delete markers under <object>.versions/<version-file>;
buildObjectInfo skips every version-folder event because it can't tell
current from noncurrent without sibling state. EXP_DM is the one rule
that can be routed from the file path alone: if the marker is the only
entry under .versions/<key>/ then it's necessarily the latest.
Add SiblingLister; gate the listing on (versioned, version-folder path,
delete-marker entry, EXP_DM rule active for the bucket, lister supplied)
so non-marker version writes pay nothing. The Match carries the LOGICAL
key in ObjectKey and the marker's version_id in VersionID so the
dispatcher can reach deleteSpecificObjectVersion(bucket, logical, vid);
without a version_id the dispatcher would BLOCK and freeze the cursor.
Server-side dispatch re-checks sole-survivor before deleting: another
PUT can land between schedule and dispatch, and identity-CAS only covers
the marker entry, not the directory shape. Lists .versions/<key>/ with
limit 2 and returns NOOP_RESOLVED if count != 1 or the surviving entry
is not the same version.
Fix isDeleteMarkerEntry to compare string(v) == "true" — production
writes []byte("true"), not {1}; every other reader of ExtDeleteMarkerKey
in this repo uses the string predicate.
filerSiblingLister.Count caps at limit 2 — callers only distinguish
"sole survivor" (1) from "more than one" (>=2).
Follows #9373.
* fix(s3/lifecycle): gate sibling listing on active event-driven EXP_DM
hasActionKind tested key presence only; an EXP_DM rule whose action was
inactive (BootstrapComplete=false) or scan-only would still trigger the
sibling-listing RPC even though no match could fire. Replace with a
helper that mirrors the per-key filter Route already applies before
emitting matches: kind matches, snap.Action returns non-nil, IsActive,
and Mode == ModeEventDriven.
Add a regression test that builds a versioned snapshot with the rule
but no PriorStates (action stays inactive) and asserts the lister is
never consulted.
* chore(s3/lifecycle): trim verbose comments
* fix(s3/lifecycle): null version, hard-delete trigger, latest pointer
Three correctness gaps in EXP_DM routing.
1. Null version. SiblingLister.Count saw only .versions/<key>/, but a
pre-versioning bare-key object survives outside that folder when
versioning is enabled later. Marker + null would look like count==1
and EXP_DM would re-expose the old object. Replace Count with
Survivors which also reports HasNullVersion; route- and dispatch-time
suppress when set.
2. Hard-delete trigger. The router skipped events with NewEntry==nil so
a noncurrent hard-delete that left the marker as sole survivor never
fired EXP_DM. Allow version-folder hard-deletes through; the lister's
LoneEntry carries the surviving marker's version_id and identity.
3. Latest pointer. checkSoleSurvivorMarker only checked count and
filename. The worker can race createDeleteMarker between the file
write and the .versions/ directory metadata update. Require
ExtLatestVersionIdKey == versionId; missing pointer returns
retry-later instead of deleting.
Adds a null-version exists check on the dispatch path too.
* fix(s3/lifecycle): normalize null-version lookup errors and detect dir markers
Two correctness bugs in the null-version check.
Route-side: filerSiblingLister called client.LookupDirectoryEntry
directly. SeaweedList wraps not-found via filer_pb.LookupEntry, which
normalizes the gRPC string-mapped not-found into ErrNotFound. The raw
client returns it as a generic error instead, so every absent
null-version (the common case) bubbled up as an error and the router
suppressed every otherwise-valid match. Use filer_pb.LookupEntry.
Both sides: explicit S3 directory-key markers (object names ending in /)
are stored as directory entries with Attributes.Mime set;
processExplicitDirectory in the listing path treats them as null
versions. The previous check was !IsDirectory only, which let the
marker delete proceed and re-expose the directory key. Add
IsDirectoryKeyObject() to both predicates. Also use
util.NewFullPath(...).DirAndName() for the parent/name split so a
trailing-slash key resolves to the same underlying entry path as the
listing code.
* fix(s3/lifecycle): EXP_DM ctx propagation, nil-entry guard, fast-path skip
Three small follow-ups on the EXP_DM dispatch path.
checkSoleSurvivorMarker now takes ctx instead of context.Background()
so worker shutdown / deadlines cancel the SeaweedList RPC instead of
stalling.
If SeaweedList fires the lone callback with entry==nil, firstName
stays empty and the marker-replaced check would short-circuit; that's
the one shape that bypasses the dispatch guard, so retry-later instead.
routeSoleSurvivorMarker now skips the Survivors RPC on regular
non-marker version creates — those always have Count >= 2, so the
listing was wasted load on every versioned write under an EXP_DM rule.
Hard-delete events (NewEntry==nil) and marker creates still flow
through. Added a regression test asserting the regular-create case
doesn't consult the lister.
Documented that logicalKeyFromVersionPath rejects bucket-root markers
intentionally.
|
||
|
|
196c41d21a |
test(s3/lifecycle): cover scheduler/bootstrap walker + MPU detection (#9380)
Locks down isMPUInitDir, walkBucketDir (regular/nested/MPU/error/cancel paths), and BucketBootstrapper.KickOffNew (per-bucket fanout and in-process dedup) against a fake SeaweedFilerClient. |
||
|
|
ee1d8f9e8c |
refactor(s3api): drop filer.conf TTL routing from PUT lifecycle (#9379)
PutBucketLifecycleConfiguration used to install /buckets/<bucket>/<prefix> day-TTL entries in filer.conf so the volume server's RocksDB compaction filter would expire matching writes. With 9377 the s3api server now stamps volume TTL per-write via LifecycleTTLResolver off the stored XML, which covers the same prefix-only Expiration.Days subset and additionally handles size filters and AWS overlapping-rule precedence. Maintaining both paths means a rule change has to mutate two stores in lockstep, and the filer.conf path can't represent everything the resolver does. Drop the add path. Keep a one-way cleanup loop so an upgrade still wipes day-TTL entries written by older builds — otherwise a stale entry would silently double-stamp writes (volume server expires under the old rule) or contradict the new XML after a rule change. Also removes resolveLifecycleDefaultsFromFilerConf (no longer needed) and the versioning-fast-path guard (the resolver itself returns nil for versioned/object-lock buckets, covered by TestNewLifecycleTTLResolver_NilOnVersionedBucket). Tests covering the deleted helpers are deleted with them; the GET fallback that synthesizes lifecycle rules from existing filer.conf TTLs is unchanged so users who historically configured TTL via filer.conf directly still see a rule. |
||
|
|
2458f6c81c |
feat(s3api): apply lifecycle TTL at write time (#9377)
* feat(s3api): apply lifecycle TTL at write time The S3 server already has the bucket's lifecycle XML at PUT time (via the cached BucketConfig), so volume-TTL routing is just a per-write decision instead of something that needs a separate filer.conf projection kept in sync via operator commands. - BucketConfig caches the canonical Rules parsed from the lifecycle XML once on load (BucketConfigCache invalidates on Put/Delete Lifecycle, so the rules stay current automatically). - resolveLifecycleTTLForWrite walks the cached rules: longest-prefix match, applies tag and size filters against the request, returns Days * 86400. Versioned buckets, non-Expiration.Days rules, and unevaluable size filters (no Content-Length) yield 0 — the lifecycle worker handles those at scan time. - putToFiler resolves TTL once and passes it through both the AssignVolumeRequest (so chunks land on a TTL volume) and the new entry's Attributes.TtlSec (so the filer's RocksDB compaction also expires the metadata). Lifecycle XML PUT/DELETE now influences write routing immediately — no operator command, no filer.conf bookkeeping. The lifecycle worker remains authoritative for the cases the fast path can't cover (existing objects via bootstrap, versioned buckets, noncurrent retention, abort-MPU, tag/size filters that didn't hold at PUT time). CompleteMultipartUpload and CopyObject still need wiring; left for follow-ups so this PR stays scoped. * perf(s3api): pre-filter and sort lifecycle rules for the per-PUT TTL walk resolveLifecycleTTLForWrite walked every lifecycle rule on every PutObject, including disabled / non-Expiration.Days rules that could never fire on the fast path, and computed "longest prefix wins" via a running max instead of an early exit. Cache a pre-filtered + pre-sorted slice in BucketConfig: - buildTTLFastPathRules drops everything except Status=Enabled + ExpirationDays>0; - sorts by descending prefix length (stable, so equal-length rules keep their XML order). The resolver returns on first prefix+filter match. A bucket whose lifecycle XML has no Expiration.Days rules is now O(1); a typical bucket with one Expiration.Days rule walks one HasPrefix per PUT. The cache is built once per bucket-config load. PutBucketLifecycle / DeleteBucketLifecycle already invalidate the cache, so the fast-path slice stays current automatically. * refactor(s3api): LifecycleTTLResolver object + four review fixes Pulls the per-PUT TTL resolution into a dedicated type so the bucket config holds one object instead of a slice + magic-walk function: - LifecycleTTLResolver wraps the pre-filtered, pre-sorted rules. nil-safe Resolve so the call site doesn't have to special-case buckets with no eligible rules. Four review findings: 1. (high) drop tag-filtered rules from the fast path. Tags are mutable post-PUT via PutObjectTagging but volume TTL is irreversible — an object that matched at write time would still expire after the tag was removed. Worker re-evaluates current tags at scan time. Fast path now keeps only stable predicates: prefix and size. 2. (high) move TTL resolution out of putToFiler. MPU parts, copy-part destinations, and other transient writes called putToFiler with object="" — bucket-wide rules (empty Prefix) matched and bound a TTL clock starting at part-upload time, before CompleteMultipartUpload existed. putToFiler now takes an explicit ttlSec parameter; only the user-visible PutObject paths (PutObjectHandler, postpolicy) feed it from the resolver. MPU and copy-part pass 0. 3. (medium) AWS overlapping-rule precedence is "shorter expiration wins", not "longest prefix wins". Sort by ExpirationDays ascending so the first prefix match is also the shortest applicable rule. 4. (medium) overflow no longer caps at math.MaxInt32 seconds (~68y). A longer policy would have expired early. Return 0 instead so the worker enforces the actual policy on its own schedule. Versioning gate moves into the resolver constructor — versioned buckets get a nil resolver. The five putToFiler callers all updated: PutObjectHandler + postpolicy resolve via lifecycleTTLForObjectWrite, suspended/versioned wrappers pass 0 by construction, MPU part and copy-part SSE pass 0 with a one-line comment about why. * refactor(s3api): drop unused BucketConfig.LifecycleRules field The full canonical rule set was set on every bucket-config load but never read — resolveLifecycleTTLForWrite worked off the resolver's filtered slice, and the lifecycle worker reads bucket entries straight off the meta-log instead of this cache. Remove the field and its s3lifecycle import. * perf(s3api): pre-compute LifecycleTTLResolver hot-path fields Resolve was doing per-call work that's actually constant per bucket- config load: int64 multiplication, max-int32 overflow check, field indirections through *s3lifecycle.Rule. Move it to the constructor and pack the rule into a compact ttlRule (prefix + ttlSec int32 + sizeGT/sizeLT) so the inner loop is HasPrefix → optional size check → return. Drop overflowing rules at construction rather than handling per- resolve: capping would expire long policies early, and returning 0 in the inner loop would prevent any shorter overlapping rule from firing. Drop-at-construction composes correctly with the ascending sort. Benchmarks (Apple M4): NilReceiver 0.99 ns/op 0 B/op OneRuleMatching 2.75 ns/op 0 B/op FiveRulesNoMatch 13.5 ns/op 0 B/op * fix(s3api): refresh LifecycleTTL resolver on bucket-config update storeBucketLifecycleConfiguration writes to Entry.Extended via updateBucketConfig, which clones the cached BucketConfig and calls the user fn, then caches the result. The clone inherits the prior LifecycleTTL pointer and nothing rebuilt it from the new XML, so add/replace/delete of a lifecycle policy left the wrong resolver in cache until eviction. Same gap on the meta-log side: peer-driven updates flowed through updateBucketConfigCacheFromEntry without re-deriving the resolver. Centralize the Entry -> derived-field mapping in one helper that resets every Extended-backed field then repopulates from the entry, and call it from getBucketConfig (initial load), updateBucketConfig (after updateEntry succeeds, before caching), and updateBucketConfigCacheFromEntry (meta-log path). Reset is the load-bearing part: deleting the lifecycle XML must yield a nil resolver, since stamping a stale TTL onto subsequent writes is irreversible. * fix(s3api): PostPolicy passes object size, not multipart wire size lifecycleTTLForObjectWrite was reading r.ContentLength, which on the PostPolicy path is the multipart envelope (form fields + boundaries), not the uploaded object body. A size-filtered rule would evaluate against that inflated total and stamp (or skip) a TTL the policy didn't intend. Take the object size as an explicit parameter. PutObject still passes r.ContentLength (correct there); PostPolicy passes the fileSize already extracted from the form part. Negative size means unknown and continues to skip any size-filtered rule. * fix(s3api): treat Object Lock as versioned for lifecycle TTL fast path Object Lock requires versioning at the API level, but it can be enabled at create time without S3 ever writing the explicit Versioning header. The lifecycle resolver construction site only checked Versioning, so an Object-Lock bucket with no Versioning byte would still get a fast-path resolver and stamp volume TTL onto writes — destroying noncurrent versions when the volume expires. Mirror the OR already used in BucketIsVersioned: ObjectLockConfig non-nil counts as versioned for resolver construction. Existing explicit-Versioning paths are unchanged. |
||
|
|
e55db58ca9 |
feat(s3/lifecycle): expose Prometheus metrics (Phase 7) (#9375)
* feat(s3/lifecycle): expose Prometheus metrics (Phase 7)
Five new gauges/counters under the s3_lifecycle subsystem so operators
can see what the worker is doing without grepping logs:
- dispatch_total{bucket,kind,outcome} — every LifecycleDelete RPC
bumps this. Outcome is the proto enum name (DONE, NOOP_RESOLVED,
RETRY_LATER, BLOCKED, …) plus a synthetic "RPC_ERROR" for transport
failures classified as RETRY_LATER.
- schedule_depth{shard} — pending matches in each shard's schedule,
sampled on the dispatcher tick.
- cursor_min_ts_ns{shard} — per-shard min cursor timestamp; lag is
derived as (now - min) by the scrape side.
- events_total{shard} — meta-log events the reader fed to the router.
- bootstrap_dispatch_total{bucket,kind} — bootstrap-walk dispatches.
Test asserts the dispatch counter increments for both DONE and
RPC_ERROR paths.
* fix(stats): purge lifecycle bucket label series in DeleteBucketMetrics
The two new bucket-labeled lifecycle counters
(S3LifecycleDispatchCounter, S3LifecycleBootstrapDispatchCounter)
weren't included in DeleteBucketMetrics, so explicit bucket teardown
left their label series behind — same cardinality leak the existing
counters above already avoid. Tack them onto the same DeletePartialMatch
chain.
|
||
|
|
05d31a04b6 |
fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite
The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.
Three changes that work together:
- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
-runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
shell command runs the full pipeline (reader + scheduler + dispatcher) for the
duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
out for now — versioned-bucket expiration via the worker still needs its own
pass.
Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (
|
||
|
|
159cfc97ce |
feat(s3/lifecycle): classify versioned events by storage path (Phase 5b/1) (#9373)
* feat(s3/lifecycle/router): classify versioned events by storage path Phase 5b first slice. Pass the bucket's Versioned flag from the engine snapshot into buildObjectInfo and: - Recognize <key>.versions/<vid> events as noncurrent versions. IsLatest=false, info.Key strips the .versions/<vid> suffix so a rule's Filter.Prefix matches the user's logical key, and the AWS-visible version_id rides on Match.VersionID for the dispatcher to target a single version on the server. - Read IsDeleteMarker from Extended unconditionally — the engine rejects ExpiredObjectDeleteMarker when NumVersions != 1, so without sibling listing the marker case stays correctly suppressed (a separate PR will add the listing). - Non-versioned buckets keep the existing behavior even when an object literally named "*.versions/v1" exists; Versioned=false short-circuits the path classification. Time-based NoncurrentDays now fires on noncurrent events. NewerNoncurrent and ExpiredObjectDeleteMarker still need sibling listing — left for a follow-up. * fix(s3/lifecycle/router): require ExtVersionIdKey to confirm noncurrent Path classification alone misclassifies a literal-key collision: a versioned bucket holding an object with key "logs/backup.versions/2023" would be flagged noncurrent and have its key stripped to "logs/backup", losing the user's actual rule-prefix-matching path. SeaweedFS doesn't reserve the .versions/ segment, so the path shape is necessary but not sufficient. Add an authoritative confirmation: the entry must declare the same version_id via ExtVersionIdKey (the field SeaweedFS sets when storing a tracked version). Also reject idx==0 paths so ".versions/<vid>" can't yield an empty logical key. Tests: - collision: versioned bucket + .versions/ in literal key + no metadata (and the mismatched-vid variant) → still classified as a current-version object; - root-versions: .versions/v1 (idx==0) → treated as a regular key; - existing noncurrent test now sets ExtVersionIdKey to mirror the storage shape. * fix(s3/lifecycle/router): skip versioned-bucket version-folder events The previous attempt tried to classify <key>.versions/<vid> events as noncurrent versions by storage path. That's broken on three counts: - SeaweedFS stores version files as v_<id> (getVersionFileName), so comparing the path suffix to the raw ExtVersionIdKey never matches. - The "current latest" version on a versioned bucket lives at the same .versions/v_<id> path shape as noncurrent versions; the latest pointer is on the parent .versions/ directory's Extended[ExtLatestVersionIdKey], which the router doesn't see. - Even with a correct vid match, IsLatest=false plus the storage path as ObjectKey would have the dispatcher recompose <storagepath>.versions/v_<id> and no-op (or worse, target the wrong file). Until we route from .versions/ directory pointer-transition events (or supply IsLatest/SuccessorModTime/index from sibling listing), skip every event under a *.versions/ folder. Bare-key events (null versions) still route normally; bootstrap walking covers the versioned-storage cases. Tests assert the skip across tracked, literal-collision, and bucket-root .versions paths. * feat(s3api): refuse noncurrent-kind delete on the current latest version Defense-in-depth for the noncurrent kinds: even when bootstrap (or a future event-driven path) thinks a version is noncurrent, the server must verify against the .versions/ directory's Extended[ExtLatestVersionIdKey] before deleting. If the target version matches the latest pointer the action is silently dropped as NOOP_RESOLVED:VERSION_IS_LATEST instead of deleting the live data. * refactor(s3/lifecycle): tidy versioning gates per review - router: skip directory entries (other than MPU init) in buildObjectInfo so .versions/ folder events never become ObjectInfo. Subtest "versions dir itself" added. - s3api: switch isCurrentLatestVersion's path split from filepath.Split (OS-dependent) to path.Split so filer paths always use '/'. |
||
|
|
935fb42e1d |
chore(weed/util/chunk_cache): remove unused functions (#9372)
* chore(weed/util/chunk_cache): remove unused functions * fix(chunk_cache): bound ReadAt buffer in readNeedleSliceAt When the caller-provided buffer is larger than the remaining needle bytes, ReadAt would spill into the next needle and trigger the n != wanted error. Slice to data[:wanted] so the read stops at the needle boundary. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
fd463155e4 |
fix(ec): planner treats each (server, disk_id) as a distinct target (#9369) (#9371)
* fix(ec): planner treats each (server, disk_id) as a distinct target (#9369) master_pb.DataNodeInfo.DiskInfos is keyed by disk type, so a volume server with multiple physical disks of the same type collapses into a single DiskInfo. Per-disk attribution survives only inside the VolumeInfos[].DiskId / EcShardInfos[].DiskId records, and the active topology never put it back together. The EC planner saw N candidates instead of N×disks, returned a short plan, and createECTargets round-robined extra shards onto the same (server, disk_id) — colliding with the #9185 disk_id-aware ReceiveFile. Reconstruct per-physical-disk view in UpdateTopology by splitting each DiskInfo into one entry per observed disk_id, and index volumes / EC shards by their own DiskId so lookups stay aligned. Refuse to plan an EC task when fewer than totalShards distinct disks are available rather than packing shards onto the same disk. Threads dataShards/parityShards through planECDestinations, createECTargets and createECTaskParams so the helpers don't depend on the OSS 10+4 constants — keeps enterprise merges clean. * trim verbose comments * align EC param signatures with enterprise - dataShards/parityShards: uint32 → int (matches enterprise's ratio API) - drop unused multiPlan from createECTaskParams - minTotalDisks: total/parity+1 → ceil(total/parity), correct for non-default ratios Reduces merge surface when this PR lands in seaweed-enterprise. |
||
|
|
194dce27bf |
fix(mount): preserve user-set mtime through async/periodic flush (#9363) (#9370)
* fix(mount): preserve user-set mtime through async/periodic flush (#9363) flushMetadataToFiler and flushFileMetadata both stamped time.Now() onto the entry before sending it to the filer, clobbering any mtime SetAttr had stored from utimes()/touch -m -d. The reproducer hit this ~1s after touch because the writebackCache deferred close from the prior write ran flushMetadataToFiler after the user's utimes call. Flush has no business inventing timestamps. Move the write-time stamp into Write (where it always belonged for POSIX correctness) and let flush persist whatever Write or SetAttr already put on the entry. * test(mount): tighten mtime regression test, drop tautological one - userMtime now has non-zero nanoseconds, so the *Ns assertions catch a regression that would zero the field. - Add CtimeNs assertion (was missing). - Drop TestWriteStampsEntryMtime: it duplicated the implementation it was supposed to test, so a regression in Write would not have failed it. Driving the real Write path needs a full PageWriter, which is out of scope for this fix; TestFlushFileMetadataPreservesUserMtime is the meaningful regression for #9363. |
||
|
|
89aab30821 |
feat(s3/lifecycle): wire AbortIncompleteMultipartUpload (Phase 5a) (#9368)
* feat(s3/lifecycle/router): emit ABORT_MPU events for .uploads/<id> init dirs Detect a meta-log event at exactly .uploads/<upload_id> (a directory) and build the ObjectInfo from its destination key (entry.Extended[key]) so a rule with Filter.Prefix=foo/ matches an MPU uploading to foo/bar. Sub-events under .uploads/<id>/<part> ride a different mtime and would over-fire the ABORT_MPU schedule, so they're rejected explicitly. m.ObjectKey stays as ev.Key (.uploads/<upload_id>) — the dispatcher needs the upload directory path, not the destination key, to actually remove the in-flight upload. * feat(s3api): wire LifecycleDelete ABORT_MPU to remove the upload dir Replaces the retryLater stub. Validates the .uploads/<upload_id> shape of req.ObjectPath (so a malformed event can't escalate to a wider rm), then deletes the upload directory under <bucket>/.uploads/<id>. Maps NotFound to NOOP_RESOLVED, transport errors to RETRY_LATER, success to DONE. * refactor(s3api): drop redundant exists check before lifecycle ABORT_MPU rm s3a.rm already does a NotFound-returning lookup, so the pre-check just adds a round-trip. Map filer_pb.ErrNotFound to NOOP_RESOLVED on rm, keep transport errors as RETRY_LATER. * refactor(s3/lifecycle/router): use s3_constants for MPU paths + Extended key Drop the hardcoded ".uploads/" and "key" string literals; the symbols already exist as s3_constants.MultipartUploadsFolder and ExtMultipartObjectKey, and the server side reaches them through the same constants. Keeping the test helpers tied to those names also makes the negative-result tests meaningful — they'd otherwise still pass if the lookup constant drifted. * fix(s3api): close lifecycle ABORT_MPU traversal + NOT_FOUND gaps Two issues with the recent ABORT_MPU plumbing: - "." and ".." passed the no-slash check but resolve to the bucket root via util.JoinPath, so .uploads/.. could rm the wrong directory. - filer.DeleteEntry suppresses ErrNotFound and returns success, so the rm path can't distinguish missing from deleted; the previous version reported DONE for an already-aborted upload instead of NOOP_RESOLVED. Reject the two reserved names explicitly and restore the existence pre-check so the outcome map stays correct. Add a table-test covering the rejected paths. * fix(s3/lifecycle/bootstrap): walk MPU init dirs by destination key A real MPU init record is a directory under .uploads/<id> created by mkdir; the bootstrap walker was skipping every directory entry, so an MPU that existed before the meta-log subscription was never aborted. Even with the skip relaxed, MatchPath used the .uploads/<id> path, so a rule with Filter.Prefix=logs/ would never fire on an MPU uploading to logs/foo.txt. Add Entry.DestKey, let IsMPUInit directories through, and use DestKey for both MatchPath and ObjectInfo.Key. A bare init directory with no DestKey means metadata hasn't landed yet — skip rather than guess. * fix(s3/lifecycle): gate (kind, info) shape so MPU init only fires ABORT_MPU An MPU init record carries IsMPUInit=true and IsLatest=false. Without gating, the router and bootstrap walker matched it against every active ActionKey for the bucket, so NONCURRENT_DAYS / NEWER_NONCURRENT fired (IsLatest=false reads as a noncurrent version). The dispatcher would then BLOCK on empty version_id and freeze the cursor. Add a shape gate at both call sites: - IsMPUInit + non-ABORT_MPU kind → continue - regular object + ABORT_MPU kind → continue Plus a defense-in-depth check at the top of EvaluateAction so future callers can't reintroduce the bug. Tests cover all three layers. * test(s3/lifecycle): tighten dual-action coverage at the call sites - Walk multi-action: replace the kinds-as-set check with an exact-shape DeepEqual on (path, kind) tuples. The set check would have missed an MPU init wrongly firing NONCURRENT_DAYS — exactly the regression the (kind, info) gate fixes. - Router: add a converse case for the dual ExpirationDays + AbortIncompleteMultipartUpload rule. A regular current-version object must fire only EXPIRATION_DAYS; without the gate the dispatcher would also receive ABORT_MPU and rm the object via the MPU code path. |
||
|
|
8b87ceb0d1 |
refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration (Phase 4) (#9367)
* refactor(s3api): strip back-stamp from PutBucketLifecycleConfiguration The handler used to walk every existing entry under the rule's prefix and stamp entry.Attributes.TtlSec + the SeaweedFSExpiresS3 flag so that the filer's compaction filter would expire them. With the event-driven lifecycle worker live, that retroactive walk is redundant — the worker drives expiration off the meta-log and a one-time bootstrap scan, so a PUT lifecycle stays O(rules) instead of O(objects). New writes still inherit TTL from the filer.conf location entry above; that volume-routing path is unchanged here and will move to an explicit operator command later (Phase 11). Drops updateEntriesTTL + processDirectoryTTL + processTTLBatch + updateEntryTTL from filer_util.go. * fix(s3api): clear stale lifecycle TTL entries on PUT PutBucketLifecycleConfiguration only ever appended/updated filer.conf entries — it never cleared ones the operator removed, renamed-prefix on, disabled, retagged with a tag filter, or bucket-versioned out of the fast path. The stale day-TTL kept routing new writes (and would expire old ones if any landed under the prefix) after the policy was updated. Treat PUT as a full replacement: walk this bucket's existing day-TTL entries, clear them, then add fresh entries from the new rule set. * test(command): bump mini default plugin job-type count to 7 The s3_lifecycle plugin handler registered in #9362 is the seventh default; the test still asserted six. * fix(s3api): delete stale lifecycle PathConf instead of blanking Ttl Just clearing pathConf.Ttl leaves the rule's Collection, Replication, and VolumeGrowthCount in place, so new writes still match the stale prefix and inherit outdated routing/placement. Use fc.DeleteLocationConf so the lifecycle-owned PathConf goes away entirely. Same fix in DeleteBucketLifecycleHandler, which had the same bug. |
||
|
|
5d43f84df7 |
refactor(plugin): rename detection_interval_seconds → detection_interval_minutes (#9366)
Minutes is the natural granularity for detection cadence — every production handler already set the seconds field to a 60-multiple (17*60, 30*60, 3600, 24*60*60). Switching to minutes drops the *60 arithmetic and matches the unit conventions used elsewhere in the plugin worker forms. - Proto: AdminRuntimeDefaults + AdminRuntimeConfig.detection_interval_* field renamed. - Helpers: durationFromMinutes / minutesFromDuration alongside the existing seconds variants in plugin_scheduler.go. - Handlers: vacuum, ec_balance, balance, erasure_coding, iceberg, admin_script, s3_lifecycle now declare DetectionIntervalMinutes. - Admin: scheduler_status + types + UI templ + plugin_api.go pass through the new field; UI label and table cells switch to "min". |
||
|
|
7f254e158e |
feat(worker/s3_lifecycle): plugin handler with admin UI config (#9362)
* feat(s3/lifecycle): scheduler — N pipelines over an even shard split
Scheduler.Run spawns Workers Pipeline goroutines plus one engine-refresh
ticker. Each worker owns a contiguous AssignShards(idx, total) slice of
[0, ShardCount) and runs Pipeline.Run with EventBudget bounding each
iteration; brief RetryBackoff between iterations avoids hot-loop on
errors. The refresh ticker rebuilds the engine snapshot from the filer's
bucket configs every RefreshInterval.
LoadCompileInputs / IsBucketVersioned / AllActivePriorStates are
exported from a configload.go sibling so the shell command can move to
this shared implementation in a follow-up.
* refactor(shell): reuse scheduler.LoadCompileInputs in run-shard
Drop the local copies of loadLifecycleCompileInputs / isBucketVersioned
/ allActivePriorStates / lifecycleParseError that the new
scheduler package now exports. Same behavior, one source of truth.
* feat(worker/s3_lifecycle): plugin handler with admin UI config
Registers a JobHandler for s3_lifecycle via pluginworker.RegisterHandler.
Admin pulls the descriptor over the worker plugin gRPC and renders the
AdminConfigForm + WorkerConfigForm in the existing UI:
Admin form (cluster shape):
- workers (1..16, default 1)
- s3_grpc_endpoints (comma list)
Worker form (operational tuning):
- dispatch_tick_ms (default 5000)
- checkpoint_tick_ms (default 30000)
- refresh_interval_ms (default 300000)
- event_budget (default 0 = unbounded)
Detect emits a single proposal whenever S3 endpoints + filer addresses
are configured. MaxExecutionConcurrency=1 so admin only ever runs one
lifecycle daemon per worker; a fresh proposal next cycle restarts it
if the prior Execute exits.
Execute dials the configured S3 endpoint + filer, builds a
scheduler.Scheduler with the parsed config, and runs it until
ctx cancellation. Reuses the existing scheduler / dispatcher /
reader / engine packages — the handler is the thin glue that
parses descriptor values and wires the long-running daemon.
* proto(plugin): add s3_grpc_addresses to ClusterContext
So workers can dial s3 servers discovered by the master rather than a
hand-typed list in the admin form.
* feat(admin): populate ClusterContext.s3_grpc_addresses from master
ListClusterNodes(S3Type) returns the live S3 servers; the plugin
scheduler now hands these to job handlers alongside filer/volume
addresses.
* feat(worker/s3_lifecycle): discover s3 endpoints from cluster context
Drop the s3_grpc_endpoints admin form field and read the master-supplied
ClusterContext.S3GrpcAddresses instead. Operators no longer maintain a
hand-typed list, and a stale entry self-heals when the master's view
updates.
* feat(worker/s3_lifecycle): time-based runtime cap, friendlier cadence units
- dispatch_tick_minutes (was *_ms): minutes is the natural granularity
for a daily batch; default 1 minute.
- checkpoint_tick_seconds: seconds for the durable cursor write; default
30 seconds.
- refresh_interval_minutes: minutes for the engine snapshot rebuild.
- max_runtime_minutes replaces event_budget. Each daily run is bounded
by wall clock — typical run wraps in well under an hour because the
cursor persists and the meta-log streams fast. Default 60 minutes.
- AdminRuntimeDefaults.DetectionIntervalSeconds = 86400 so the admin
schedules one job per day.
|
||
|
|
85abf3ca88 |
feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var |
||
|
|
c918660901 | build(deps): bump io.netty:netty-transport-native-epoll from 4.1.132.Final to 4.2.13.Final in /test/java/spark (#9365) | ||
|
|
9cb103cd35 | build(deps): bump github.com/apache/thrift from 0.22.0 to 0.23.0 (#9364) | ||
|
|
b7928637a0 |
refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml (#9360)
* refactor(s3api): move Lifecycle XML structs to leaf package lifecycle_xml The structs S3 PutBucketLifecycleConfiguration parses and the canonical conversion to s3lifecycle.Rule lived in package s3api, which transitively imports weed/server (which imports weed/shell). Any caller outside weed/s3api — the shell, the future lifecycle worker — that wanted to parse a bucket's lifecycle XML hit an import cycle. Moves: weed/s3api/s3api_policy.go -> lifecycle_xml/types.go weed/s3api/s3api_lifecycle_canonical.go -> lifecycle_xml/canonical.go s3api_lifecycle_canonical_test.go -> lifecycle_xml/canonical_test.go s3api_policy_test.go -> lifecycle_xml/round_trip_test.go Renames the public RuleStatus type (was unexported ruleStatus) and adds small accessor methods (Set/Val/AndSet/TagSet) for fields the s3api handler needs to read across the package boundary. Adds NewPrefix and NewExpirationDays constructors so the GET handler can build response rules without poking at unexported fields. Adds a Tag struct local to the package so it has zero internal seaweed deps. Adds a one-shot ParseCanonical(xml []byte) helper for non-server callers. s3api_policy.go was misnamed — its content is lifecycle XML, not S3 bucket policy. The new package name reflects the actual scope. * test(s3api/lifecycle_xml): exercise public API in tests - canonical_test.go's parseLifecycle helper went through xml.Unmarshal directly; route it through the package's exported Parse so tests validate the public entrypoint. - round_trip_test.go asserted internal flags (rule.Filter.tagSet, rule.Filter.andSet, Transition.set, NoncurrentVersionTransition.set); switch to TagSet(), AndSet(), Set() — exercises the public contract that downstream callers (s3api handler, future shell command) rely on. |
||
|
|
c567da7164 |
feat(s3): register SeaweedS3LifecycleInternal gRPC service (#9359)
Phase 2 added the LifecycleDelete handler on S3ApiServer but never registered it on a running gRPC server, so workers had no endpoint to dial. Embed UnimplementedSeaweedS3LifecycleInternalServer on S3ApiServer and register it on the s3 command's grpc server alongside SeaweedS3IamCacheServer. |
||
|
|
35e3fe89bc |
feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358)
* feat(s3/lifecycle): filer-backed cursor Persister FilerPersister persists per-shard cursor maps as JSON to /etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler. One file per shard keeps Save atomic — the filer writes the entry in a single mutation, so a crash mid-write doesn't leak partial state. Pipeline.Run loads on start; the periodic checkpoint and graceful-shutdown save go through this implementation. A small FilerStore interface wraps the SeaweedFilerClient surface the persister needs, so tests inject an in-memory fake instead of mocking the full gRPC client. * refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block A frozen cursor doesn't advance, so the durable cursor (FilerPersister) encodes the blocked state on its own. On worker restart the reader re-encounters the poison event at MinTsNs, the dispatcher walks the same retry budget to BLOCKED, and the cursor freezes at the same EventTs. Other in-flight events between freeze tsNs and prior cursor positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the underlying objects were already deleted on the prior pass. Removed: - BlockerStore interface + InMemoryBlockerStore + BlockerRecord - Dispatcher.Blockers + Dispatcher.ReplayBlockers - the BlockerStore.Put call in handleBlocked - Pipeline.Blockers field + the ReplayBlockers call on startup Added a TestDispatchRestartReFreezesNaturally that pins the self-recovery property: a fresh Dispatcher with a fresh Cursor, fed the same poison event, reaches the same frozen state at the same EventTs without any durable blocker store. Operator visibility: a cursor whose MinTsNs hasn't advanced is the signal — surfaced via the durable cursor file. * refactor(filer): SaveInsideFiler accepts ctx ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background() internally and silently dropped the caller's ctx. Symmetric API now; cancellation/deadlines propagate through LookupEntry / CreateEntry / UpdateEntry. Mechanical update of all callers — most pass context.Background() since the existing call sites have no ctx in scope. * fix(s3/lifecycle): deterministic order in cursor save Iterating Go maps yields random order, so json.Encode produced a different byte sequence on each save even when the state hadn't changed. Sort entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk file diffs cleanly. New test pins byte-identical output across two saves of the same map. * fix(s3/lifecycle): log reason when freezing cursor in handleBlocked handleBlocked dropped the reason via _ = reason with a comment claiming the caller logged it; none of the three callers do. A frozen cursor is the only surface where the operator finds out something stuck, so the reason has to land somewhere. glog.Warningf with shard, key, eventTs, and the original reason — same shape the rest of the package uses. |
||
|
|
ec83a87d68 |
perf(s3/lifecycle): defer pool Put on ShardID hasher
defer guarantees the hasher returns to the pool even if h.Write or h.Sum panic, preventing pool leak under unexpected failure modes. |
||
|
|
3a192c6c57 |
fix(s3/lifecycle): address Phase 3 post-merge review (#9354 #9355 #9356) (#9357)
* fix(s3/lifecycle): reader handles bare /buckets parent and pre-normalizes prefix extractBucketKey accepted /buckets/ but rejected /buckets (no trailing slash); some delete events emit the bare form, so bucket-root events were silently dropped. Pre-normalize BucketsPath once on Run instead of recomputing per event. * perf(s3/lifecycle): pool sha256 hashers in ShardID ShardID runs on every meta-log event before the shard filter; a fresh sha256.New per call produces measurable allocator pressure under load. sync.Pool reuses hashers across calls. * fix(s3/lifecycle): router skips hard deletes and missing-attribute events A hard delete carries no schedule-relevant state — Expiration would hit NOOP_RESOLVED at dispatch and ExpiredObjectDeleteMarker fires from a Create on the latest version. Skip rather than burn a schedule slot. Missing Attributes leaves ModTime at year 0001, which makes ExpirationDays fire immediately at dispatch. Skip the event instead. Drop the unused 'versioned' parameter from buildObjectInfo; the dispatcher's identity-CAS handles version drift in Phase 5. * fix(s3/lifecycle): EntryIdentity.MtimeNs holds true nanoseconds Both computeEntryIdentity (server) and buildIdentity (router) wrote entry.Attributes.Mtime (seconds) into a field named MtimeNs. The CAS worked because both sides agreed, but the encoding contradicted the field name and would break if either side later started using true nanoseconds. Combine Mtime*1e9 + the FuseAttributes.MtimeNs nanosecond component on both sides; the test was updated to match. * fix(s3/lifecycle): dispatcher distinguishes ctx cancel from transport errors A canceled or deadline-exceeded RPC is shutdown, not a transport failure: re-queue the Match at its original DueTime with no retry-budget burn so a quick restart can't escalate it to BLOCKED. * fix(s3/lifecycle): reader fallback prefix normalization mirrors Run The fallback path that builds prefix from r.BucketsPath when bucketsPathSlash is empty (test-only entry into extractBucketKey) was appending an unconditional '/', producing '//' if BucketsPath already ended with one. Use the same normalization Run does. * fix(s3/lifecycle): ObjectInfo.ModTime carries the nanosecond component ModTime dropped FuseAttributes.MtimeNs, leaving ExpirationDays one nanosecond off relative to EntryIdentity.MtimeNs. Pass both to time.Unix so the precision matches the CAS witness. |
||
|
|
5c991f38f5 |
feat(s3/lifecycle): dispatcher + per-shard pipeline (Phase 3 PR-D) (#9356)
feat(s3/lifecycle): dispatcher + blocker store + per-shard pipeline Dispatcher consumes due Matches from the schedule, calls LifecycleDelete, and routes outcomes: DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK -> Cursor.Advance RETRY_LATER (within budget) -> re-schedule with backoff RETRY_LATER (budget exhausted) / BLOCKED -> BlockerStore.Put + Freeze BlockerStore is a small interface with InMemoryBlockerStore for tests; the filer-backed impl follows when the worker task registration lands. Pipeline composes Reader + Router + Dispatcher into a single Run loop keyed by shard. Cursor is restored on start, blockers are replayed as freezes, checkpoints write at a configurable cadence, and a final save fires on shutdown. The meta-log itself is the durable buffer for in-flight schedule entries — restart re-derives them from the cursor's MinTsNs. |
||
|
|
8425c42858 |
feat(s3/lifecycle): event router + schedule (Phase 3 PR-C) (#9355)
feat(s3/lifecycle): event router + DueTime schedule Router consumes per-shard reader events, looks up matching ActionKeys via the engine's BucketActionKeys index, and emits Matches with DueTime = event_time + action.Delay. Evaluation runs at DueTime so the age gate passes for fresh events; the dispatcher's identity-CAS catches drift. Schedule is a min-heap by DueTime; duplicates allowed (RPC CAS handles the redundant dispatch as NOOP_RESOLVED). BucketActionKeys accessor added to engine.Snapshot. |
||
|
|
0f6c6b0524 |
feat(s3/lifecycle): shard-aware meta-log reader (Phase 3 PR-B) (#9354)
feat(s3/lifecycle): shard-aware meta-log reader - ShardCount=16; ShardID(bucket,key)=top-4-bits of sha256(bucket||/||key) - Reader subscribes via SubscribeMetadata starting at Cursor.MinTsNs(), filters events by shard, emits to caller-owned Events channel - Cursor: per-(shard, ActionKey) position with monotonic Advance, Freeze for blocked actions, MinTsNs for subscription resume - Persister interface with InMemoryPersister for tests; filer-backed impl lands with the worker integration |
||
|
|
3a76c0b027 |
feat(worker/proto): per-shard READ — add S3LifecycleParams.shard_id (#9353)
READ moves from a cluster-singleton to one task per (bucket, key-prefix-hash) shard. Cluster has 16 shards; workers receive one READ task per owned shard. Required for READ; ignored for BOOTSTRAP and DRAIN. |
||
|
|
4f79d8e358 |
feat(s3/lifecycle): bucket-level bootstrap walker (#9350)
* feat(worker): add TaskTypeS3Lifecycle constant Single job type for the lifecycle worker; the S3LifecycleParams.Subtype field (READ / BOOTSTRAP / DRAIN) dispatches inside the handler. The "s3_lifecycle" string is already wired to LaneLifecycle in admin/plugin/scheduler_lane.go so adding the constant doesn't change runtime behavior — it lets future commits reference the type name without sprinkling string literals. * feat(s3/lifecycle): bucket-level bootstrap walker Iterates entries in a bucket, evaluates every active ActionKey in the engine snapshot against each entry, and dispatches inline-delete for currently-due actions. Date-kind actions and pending_bootstrap actions are skipped — the former are handled by their own SCAN_AT_DATE bootstrap, the latter aren't IsActive() yet. Walker is callback-driven so callers supply the listing source (real filer_pb.SeaweedList or test fake) and the dispatcher (real LifecycleDelete client or test fake). This keeps the walker free of filer_pb dependencies and makes the per-action evaluation flow unit-testable in isolation. Checkpoint state (LastScannedPath, Completed) is returned to the caller, who is responsible for persisting it under /etc/s3/lifecycle/<bucket>/_bootstrap. Walk() honours opts.Resume so a kill-resumed task picks up where the previous walker stopped. Tests cover: prefix-mismatched skip, not-yet-due skip (reader's job), date-kind skip, pending_bootstrap skip, multi-action rule (one rule with three actions dispatches three times — the regression that per-action keying fixes), dispatch error halts at last-successful checkpoint, Resume skips entries up to and including the resume path. * test(s3/lifecycle): walker test uses bucket-scoped ActionKey Mechanical follow-up to the bucket-scoped ActionKey on lifecycle-engine: the bootstrap walker tests construct ActionKeys to seed PriorStates and need the Bucket field to match what engine.Compile keys against. * fix(s3/lifecycle): walker quick wins Two minor cleanups noted on review: - Drop the redundant Resume re-filter inside the Walk callback. ListFunc's contract already promises "skip entries with Path <= start"; trusting that contract avoids divergence if the filter logic ever changes on one side and not the other. - Hoist the ObjectInfo allocation out of the per-action loop in walkEntry. Multi-action rules previously allocated one ObjectInfo per (entry, kind) pair; now it's one per entry, reused across all matching kinds. * fix(s3/lifecycle): walker Entry.NoncurrentIndex tracks ObjectInfo's *int ObjectInfo.NoncurrentIndex is now *int so unset is unambiguous; mirror that on bootstrap.Entry so the per-entry construction stays type-clean. Phase 5 (versioned-bucket walks) is the first caller that will populate the field. * refactor(s3/lifecycle): trim narration from bootstrap walker Drop the inline step-by-step on Walk and the multi-paragraph package preamble; the function names already say it. Keep one-liner WHYs at the SCAN_AT_DATE skip and the once-per-entry ObjectInfo build. * fix(s3/lifecycle): walker skips directories and ModeDisabled actions Two safety findings from review: 1. SeaweedFS directory entries can appear in the listing alongside objects; without an IsDirectory check the walker would treat a dir like any other entry and could dispatch a delete against it. Add IsDirectory to bootstrap.Entry and short-circuit it before walkEntry. 2. ModeDisabled is set by the operator (e.g. shell pause) independent of the XML rule's Status field. EvaluateAction gates on Status and would still fire for an operator-disabled action whose XML status is "Enabled". Skip ModeDisabled explicitly in walkEntry alongside the existing SCAN_AT_DATE skip. Two regression tests pin both cases. * perf(s3/lifecycle): reuse ObjectInfo across walker entries Walker allocated one ObjectInfo struct per entry. For buckets with millions of objects that's measurable GC pressure. Hoist the allocation out of the per-entry callback (one per Walk) and reuse via field assignment in walkEntry. EvaluateAction reads ObjectInfo synchronously and doesn't retain a reference, so the reuse is safe — the next iteration's overwrite can't corrupt an in-flight evaluation. * refactor(s3/lifecycle): trim narration on walker Drop the multi-line Entry / ObjectInfo-reuse / SCAN_AT_DATE+DISABLED explanations. The walker's structure is small enough that the condition itself reads as the documentation. |
||
|
|
5ab3860005 |
feat(s3/lifecycle): LifecycleDelete RPC server (#9349)
* feat(s3/lifecycle): SeaweedS3LifecycleInternal.LifecycleDelete RPC
Adds the worker-to-S3 RPC the lifecycle worker calls to execute one
(rule, action) verdict against one entry. The S3 server is the only
component allowed to mutate filer state, so it gets the final word:
re-fetch, identity CAS, object-lock check, dispatch.
LifecycleDeleteRequest carries the routing tuple (bucket, object_path,
version_id, rule_hash, action_kind), the per-stream context echoed
into BlockerRecord on FATAL outcomes, and an EntryIdentity CAS witness
that lets the server detect mid-flight drift (mtime, size, head fid,
sorted-Extended hash).
LifecycleDeleteOutcome covers all five worker-actionable verdicts:
DONE / NOOP_RESOLVED / SKIPPED_OBJECT_LOCK / RETRY_LATER / BLOCKED.
Cursor-advance / pending-mutation rules per outcome are documented
inline.
Makefile updated to emit the gRPC stubs alongside the message types.
* feat(s3/lifecycle): LifecycleDelete server handler
Server-side handler for the worker-to-S3 RPC. Five-step flow:
1. Re-fetch live entry (NOT_FOUND -> NOOP_RESOLVED).
2. CAS on EntryIdentity (mtime / size / head fid / sorted-Extended
sha256). Mismatch -> NOOP_RESOLVED with reason STALE_IDENTITY.
3. enforceObjectLockProtections with governanceBypassAllowed=false.
Lifecycle never bypasses legal-hold or compliance retention; the
safety scan re-attempts after the hold lifts. Lock refusal ->
SKIPPED_OBJECT_LOCK (logged + counted, cursor advances).
4. Dispatch by action_kind:
- EXPIRATION_DAYS / EXPIRATION_DATE: createDeleteMarker on
versioned buckets, deleteUnversionedObjectWithClient otherwise.
- NONCURRENT_DAYS / NEWER_NONCURRENT / EXPIRED_DELETE_MARKER:
deleteSpecificObjectVersion (the marker is just a version).
- ABORT_MPU: stub returning RETRY_LATER pending Phase 5 wiring.
5. Helper failures the server can't classify as transient -> BLOCKED
with reason "FATAL_EVENT_ERROR: <detail>"; the worker writes a
durable BlockerRecord and pauses the failing stream. Filer fetch
transport errors -> RETRY_LATER (sustained transients eventually
promote to BLOCKED via the worker's retry budget).
hashExtended uses length-prefixed encoding so a forged Extended
payload (e.g. one tag value containing the byte sequence of two real
tags) can't collide with a real two-tag map. Regression test pins
this.
Tests cover identity-comparison fields, hashExtended order/delimiter
stability, empty-request rejection. Full filer-integration tests come
in Layer 2.
* fix(s3/lifecycle): use stdlib bytes.Equal for ExtendedHash compare
Replaces a hand-rolled byte slice comparator with bytes.Equal —
idiomatic, slightly faster (the stdlib version is intrinsified on
amd64/arm64), and one fewer test surface to maintain.
* refactor(s3api): trim narration from lifecycle RPC + canonicalizer
Drop step-by-step doc-block narrations on LifecycleDelete and the
helpers that reproduced what the code already says. Keep WHY one-
liners at non-obvious spots: the http.Request=nil safety claim,
why hashExtended is length-prefixed, the safety-scan revisit
contract for SKIPPED_OBJECT_LOCK, the TODO marker for ABORT_MPU.
* refactor(s3/lifecycle): retryLater helper, drop inline literals
Adds retryLater() alongside done/noopResolved/blocked so the four
LifecycleDeleteOutcome constructions look the same. Replaces the two
inline RETRY_LATER literal returns (live-fetch transport error and
the ABORT_MPU stub) with the helper. No behavior change.
* fix(s3/lifecycle): default filer-error classification to RETRY_LATER
Versioning lookup, createDeleteMarker, deleteUnversionedObjectWithClient,
and deleteSpecificObjectVersion all do filer round-trips. Most failures
are transient (filer unavailable, slow, network blip), not deterministic
per-event errors. Returning BLOCKED for them was too aggressive: BLOCKED
halts the stream until an operator intervenes, which would surface
on every transient filer hiccup.
Default these paths to RETRY_LATER. The worker's retry budget already
promotes sustained transients to BLOCKED after a configured threshold,
which is the right place for that escalation. BLOCKED stays for the
truly deterministic error: the request-shape check (missing version_id
on noncurrent delete) and the unknown-action-kind dispatch fallback.
* fix(s3/lifecycle): honor versioning Suspended on current-version expiration
Treating Suspended-versioning buckets like never-versioned ones was
wrong: per AWS S3, current-version expiration on Suspended must
remove the existing null version and insert a new delete marker (the
same behavior a user DELETE produces). The previous code branched
only on isVersioningEnabled() — which returns false for Suspended —
and dropped through to the actual-delete path, losing the version
history that Suspended buckets are still expected to keep.
Switch to getVersioningState() and dispatch on three branches:
Enabled -> createDeleteMarker (current becomes non-current)
Suspended -> deleteSpecificObjectVersion("null"), createDeleteMarker
Off / never configured -> deleteUnversionedObjectWithClient
The Suspended path's null-version deletion tolerates NotFound (the
null version may not exist when this is the first delete after a
versioning toggle); other errors classify as RETRY_LATER like the
rest of the dispatch.
* refactor(s3/lifecycle): trim narration on LifecycleDelete
Drop the 6-line versioning-state docblock and the inline
"transient -> RETRY_LATER" rationales; keep one-liners.
* fix(s3/lifecycle): bucket-not-found is NOOP; include ErrObjectNotFound
Three follow-ups from review (the rest were already addressed by
prior commits):
- Versioning lookup now distinguishes filer_pb.ErrNotFound
(BUCKET_NOT_FOUND -> NOOP_RESOLVED) from genuinely transient
failures (RETRY_LATER). A bucket deleted between live-fetch and
versioning-lookup is benign, not transient.
- deleteUnversionedObjectWithClient and deleteSpecificObjectVersion
now also recognise ErrObjectNotFound as a resolved-NOOP, matching
the live-fetch step's classification.
|
||
|
|
7f2b20d577 |
feat(s3/lifecycle): policy engine — XML conversion, Compile, decideMode, Match (#9348)
* feat(s3/lifecycle): XML lifecycle config to canonical Rule
LifecycleToCanonical takes a parsed *Lifecycle and returns
[]*s3lifecycle.Rule, the flat shape the engine compiles against.
Filter resolution mirrors AWS: <And> sub-elements (Prefix + Tags +
size filters) flatten into the canonical Rule's individual fields;
single <Tag> filter populates FilterTags with one entry; <Prefix>
filter takes precedence over the rule's top-level <Prefix>.
Multi-action rules (Expiration + NoncurrentVersion + AbortMPU on
the same XML <Rule>) populate every action field they declare.
RuleActionKinds expands the canonical rule into its compiled actions
downstream.
* feat(s3/lifecycle): engine snapshot skeleton + ActionKey type
Defines s3lifecycle.ActionKey{rule_hash, action_kind} as the engine's
primary identity, and adds the engine package's Snapshot type.
Snapshot is immutable after Compile (atomic-swapped on rebuild) and
holds the ActionKey-keyed routing indexes:
- originalDelayGroups: map[time.Duration][]ActionKey
- predicateActions: []ActionKey
- dateActions: map[ActionKey]time.Time
- actions: map[ActionKey]*CompiledAction
CompiledAction.engineState is an atomic.Uint32 so MarkActive (called
after the durable bootstrap_complete + mode write commits) is visible
to in-flight reader passes without a recompile. The reader filters on
IsActive() before dispatching, so stale-snapshot dispatches are
prevented.
No callers yet; downstream commits add Compile, decideMode, and the
Match functions.
* feat(s3/lifecycle): decideMode + retention gate
decideMode picks the scheduling mode for one (rule, kind) compiled
action. Disabled rule -> DISABLED; EXPIRATION_DATE -> SCAN_AT_DATE;
reader-driven kind whose eventLogHorizon + bootstrapLookbackMin
exceeds metaLogRetention -> SCAN_ONLY; otherwise EVENT_DRIVEN. The
gate runs per (rule, kind), so a 90d ExpirationDays sibling can
degrade to scan_only while its 7d AbortMPU sibling stays active.
MetaLogRetention=0 is treated as "unbounded" — matches the SeaweedFS
default (Phase 0 verified that meta-log files are written without
TtlSec by default), so the gate doesn't trip until an operator opts
in to volume-TTL pruning of /topics/.system/log/.
RuleMode is a Go-level enum here, separate from the wire-form
LifecycleState.RuleMode in the proto package; the worker maps between
them when reading/writing the durable state file.
* feat(s3/lifecycle): Compile builds the engine snapshot per-action
Compile produces a fresh Snapshot from per-bucket canonical rules.
Each input rule expands into N CompiledActions via RuleActionKinds;
mode comes from decideMode; activation requires both
bootstrap_complete (from PriorStates) and mode==EVENT_DRIVEN.
Routing indexes are populated by mode:
- SCAN_AT_DATE: always indexed in dateActions (detector schedules at
rule.date regardless of bootstrap status; the action runs once on
the date and is then done).
- EVENT_DRIVEN + active: indexed in originalDelayGroups (and in
predicateActions when the rule has tag/size filters).
- SCAN_ONLY / DISABLED / pending_bootstrap: not indexed; safety-scan
tick or operator action handle these.
snapshot_id is monotonic per process; pending writes stamp it. The
new snapshot replaces the engine's atomic pointer; in-flight reader
passes continue against their loaded snapshot.
Tests cover: single-action rule, multi-action expansion (one rule ->
three CompiledActions with three distinct delay groups), pending
bootstrap exclusion from indexes, retention gate, sibling actions
degrading independently under partial retention, ExpirationDate path,
disabled rule, MarkActive flipping IsActive(), Compile producing
monotonic snapshot ids.
* feat(s3/lifecycle): MatchOriginalWrite / MatchPredicateChange / MatchPath
The reader feeds events through the engine's match functions to find
the active ActionKeys whose filter applies. The minimal Event shape
the engine takes (bucket, path, tags, size, IsLatest, IsDeleteMarker,
IsMPUInit) keeps engine free of filer_pb dependencies; the reader
extracts these fields from the persisted *filer_pb.LogEntry payload
in Phase 3.
- MatchOriginalWrite: per-delay-group sweep entry. Filters on shape =
EventShapeOriginalWrite, prefix, tag, size, then per-kind shape
gating (ABORT_MPU only on IsMPUInit; EXPIRED_DELETE_MARKER only on
IsLatest+IsDeleteMarker).
- MatchPredicateChange: single near-now sweep. Returns only the
predicate-sensitive subset of active ActionKeys.
- MatchPath: bucket-level walker entry. Returns every active action
whose filter matches; bootstrap iterates these per object and calls
EvaluateAction per kind.
All filter on a.IsActive() at routing time so MarkActive flips become
visible without recompile.
* fix(s3/lifecycle): scope ActionKey by bucket; defensive copies; tidy compile
Three findings on the engine PR addressed:
1. Critical (cross-bucket collision): ActionKey was {RuleHash, ActionKind}
only. Two buckets with rules whose XML is identical produce the same
RuleHash; the second bucket's Compile would overwrite the first
bucket's CompiledAction in snap.actions. Add Bucket to ActionKey
so the engine's identity matches the on-disk path layout
/etc/s3/lifecycle/<bucket>/<rule_hash>/<action_kind>/. Regression
test pins it.
2. Major (immutability leak): OriginalDelayGroups, PredicateActions,
DateActions returned the snapshot's internal maps/slices by
reference, letting an external caller mutate routing state and
break the documented immutability contract. Return defensive
copies.
3. Minor (redundant condition): mode==EVENT_DRIVEN already implies
kind != EXPIRATION_DATE because decideMode routes the date kind
to SCAN_AT_DATE. Drop the redundant check.
Tests updated to construct ActionKey with the new Bucket field.
* fix(s3/lifecycle): drop size filters from rulePredicateSensitive
An object's size is immutable once written: any content change is a
fresh write that flows through the original-write stream, not the
predicate-change one. Tagging rules really can flip post-PUT
(operator adds/removes a tag without rewriting), so they belong; size
filters do not.
Including size filters here was adding rules to predicateActions for
no purpose — every predicate-change sweep would waste cycles
re-evaluating size predicates that physically can't have changed.
* perf(s3/lifecycle): pre-sort AllActions at Compile time
Snapshot is immutable after Compile (engineState bit-flips don't
change membership), so the (bucket, rule_hash, action_kind) ordering
is stable for the snapshot's lifetime. Build the sorted slice once
and serve every AllActions() call from it; drop the per-call
sort.Slice. The bootstrap walker is the primary caller and may
iterate this on every task entry.
* docs(s3/lifecycle): note the FilterSizeGreaterThan=0 ambiguity
Per AWS S3 spec, <ObjectSizeGreaterThan>0</ObjectSizeGreaterThan>
explicitly excludes 0-byte objects, but with the int64 zero value as
the unset sentinel we can't distinguish that from omitted-and-default.
Document the limitation inline so a future deployment that needs the
distinction can switch to *int64 (or a paired set-bool) and update
the matchers / RuleHash accordingly. Not fixing now: the explicit-zero
configuration is unusual, the canonical Rule shape mirrors the same
zero-as-unset convention as s3api.Filter, and a structural fix
touches every filter-using site (evaluator, due_at, match, RuleHash).
* fix(s3/lifecycle): make ObjectInfo.NoncurrentIndex *int
The previous int field had a zero-value collision: 0 is both "newest
non-current version" (a valid index) and "uninitialised by ObjectInfo{}
literal." A caller who built &ObjectInfo{IsLatest: false} without
explicitly setting NoncurrentIndex would have it implicitly read as
"newest non-current," and the count-based NewerNoncurrent retention
would use that bogus 0 to decide eligibility.
Switch to *int so nil is explicitly "not a non-current version /
index not yet computed." The evaluator's NoncurrentDays and
NewerNoncurrent paths conservatively return ActionNone when the
index is nil — the safety scan will revisit once the index is
supplied. This removes a class of latent footguns in test setup and
in any future code path that constructs ObjectInfo without a
versioning-aware builder.
idx() helper added in tests to keep the call sites a one-liner.
* refactor(s3/lifecycle): trim narration from engine + helpers
Drop "what" comments where well-named identifiers already say it
(IsActive, MarkActive, AllActions, etc.); collapse multi-paragraph
"why" docs to one-liners where the design rationale is already in
the design doc. Keep WHY comments only at non-obvious load-bearing
spots: the routing-index activation predicate, the *int rationale on
NoncurrentIndex, the field-tag namespace in RuleHash, the SmallDelay
horizon rule.
Files: action_kind.go, rule.go, rule_hash.go, evaluate.go, due_at.go,
min_trigger_age.go, event_log_horizon.go, engine/engine.go,
engine/compile.go, engine/match.go, engine/mode.go.
No behavior change; tests untouched and pass.
* fix(s3/lifecycle): durable PriorState.Mode wins over decideMode
PriorState.Mode was declared but never read; Compile recomputed mode
via decideMode and stored that on every CompiledAction. Effect: an
action durably persisted as SCAN_ONLY (lag fallback or operator
pause) or DISABLED would silently re-promote to EVENT_DRIVEN on the
next engine rebuild as soon as decideMode's XML+retention predicate
said so. Defeats the durability of mode state.
Use prior.Mode when set; fall through to decideMode only for new
actions (no prior at all) and for legacy entries persisted before
Mode existed (zero value). Regression test pins both branches.
* fix(s3/lifecycle): MarkActive routability — index every EVENT_DRIVEN key
MarkActive's documented contract was "flip visible without a
recompile," but the routing indexes (originalDelayGroups,
predicateActions) were only populated when active && mode ==
EVENT_DRIVEN at compile time. So a key compiled with
BootstrapComplete=false would never enter the indexes; a later
MarkActive flipped engineState but MatchOriginalWrite /
MatchPredicateChange iterated the indexes and never saw the key.
Only MatchPath (which walks bi.actionKeys) and DateActions worked.
Index every EVENT_DRIVEN key regardless of `active`. The runtime
IsActive() filter inside filterMatching already gates dispatch, so
inactive entries are matched-but-not-fired; flipping MarkActive
makes them routable without recompile, matching the documented
contract.
Tests updated: TestCompile_BootstrapPendingIndexedButInactive
asserts the indexed-but-inactive shape; TestMatchOriginalWrite_MarkActiveBecomesRoutable
asserts a MarkActive flip routes the next match.
* test(s3/lifecycle): pin nil NoncurrentIndex no-op behavior
Two regression tests for the *int pointer migration: nil index
combined with NewerNoncurrent (either paired with NoncurrentDays or
standalone) must short-circuit to ActionNone rather than guess at
the version's position in the keep-N window.
* refactor(s3/lifecycle): trim follow-up narration on engine + helpers
Comments accumulated since the last sweep — the durable-Mode rationale,
the MarkActive routability note, the routing-index doc, the
NoncurrentIndex pointer rationale, and the EvaluateAction docblock.
Trimmed each to one or two terse lines; the underlying contracts live
in the design doc.
* docs(s3/lifecycle): note CompileInput one-per-bucket invariant
|
||
|
|
b9bf45cb2e |
fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347)
* fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection Closes #9345. -volumeId only filtered which volume .idx files were pulled; the filer-side BFS still walked from "/", printing every directory under -v and making it look like the flag was ignored. When all requested volumes share a single non-empty collection that maps to an existing <bucketsPath>/<collection> directory, restrict the BFS root to that bucket. Empty-collection volumes or multi-collection selections fall back to the full walk, since chunks for those can live anywhere. * trim comments * address review: collapse getCollectFilerFilePath; unshadow receiver in loop |
||
|
|
4e10669221 |
docs(s3/lifecycle): event-driven redesign (#9346)
* docs(s3/lifecycle): event-driven redesign
Replaces the synchronous PUT-handler walk with an event-driven worker
model: meta-log reader subscribed to one filer, client-side heap merge
with per-filer-shard MessagePosition cursors, bucket-level bootstrap
with inline delete, blocked-cursor handling for fatal events, durable
retry budget for sustained-transient promotion, retention mode gate
that downgrades reader-driven rules to scan_only when log retention
falls below the rule's event-log horizon.
* docs(s3/lifecycle): record Phase 0 verified assumptions
ReadPersistedLogBuffer payload carries Extended (event marshaled via
SubscribeMetadataResponse → ToProtoEntry). Meta-log files at
topics/.system/log/<date>/<HH-MM>.<filerId> are written without TtlSec
in filer_notify_append.go; retention is unbounded by default and only
shrinks if an operator sets a filer.conf rule with a TTL on the
SystemLogDir prefix. .versions/ filenames are v_<16h-ts><16h-rand>
with old/new (inverted) format distinguished by threshold
0x4000000000000000; getVersionTimestamp / compareVersionIds give
format-agnostic ordering for successor-version discovery.
* feat(s3/lifecycle): rule evaluator and dueAt helper
Evaluate(rule, info, now) returns the EvalResult by object shape:
IsMPUInit -> AbortMultipartUpload, IsDeleteMarker -> ExpireDeleteMarker
when sole survivor, IsLatest -> DeleteObject (Days or Date), non-current
-> DeleteVersion (Days fallback to ModTime when SuccessorModTime is
zero; NewerNoncurrent retention enforced when both are set).
ComputeDueAt mirrors Evaluate's shape and returns the earliest eligible
wall-clock time for the same (rule, info), used by the reader/bootstrap
to decide pending vs inline-delete.
Adds StatusEnabled/Disabled and SmallDelay consts and an IsMPUInit
flag on ObjectInfo so .uploads/<id>/ entries route to AbortMPU without
overloading IsLatest.
No callers; package compiles standalone.
* feat(s3/lifecycle): MinTriggerAge for safety-scan cadence
Returns the smallest non-zero day threshold across the rule's actions.
Used as max(MinTriggerAge, kindFloor) for the per-kind cadence; date,
count-only, and delete-marker-only rules return 0 so callers fall
through to their kind-specific floor.
* feat(s3/lifecycle): EventLogHorizon for retention mode gate
Returns the maximum event age the reader needs for a rule. Days-based
kinds return their day threshold; pure NewerNoncurrent (count) and
ExpiredObjectDeleteMarker return SmallDelay. Date rules return 0 (the
gate skips them). Multi-action rules take the max — strictest horizon
wins.
Drives the Phase 2 mode gate: metaLogRetention < EventLogHorizon(rule)
+ bootstrapLookbackMin -> scan_only with RETENTION_BELOW_HORIZON.
* feat(s3/lifecycle): RuleHash for per-rule state CAS
sha256 over canonicalized form, first 8 bytes. Stable across tag-key
reorder, prefix trailing-slash variation, ID renames, and Status flips
(state continuity is preserved when an operator toggles
Enabled/Disabled). Different action shapes — different days, filter,
or action type — hash differently.
Used by the per-rule state directory layout
/etc/s3/lifecycle/<bucket>/<rule_hash_hex>/ and by the bootstrap
detector's reconcile-on-PUT.
* feat(s3/lifecycle): add s3_lifecycle.proto storage schema
Defines the durable types backing the lifecycle worker:
LifecycleState (per-rule mode + bootstrap_complete + degraded_reason
incl. RETENTION_BELOW_HORIZON / LOST_LOG), PendingItem, EntryIdentity,
BootstrapState, ReaderState (per-filer-shard cursors plus
tail_drained_streams marker), BlockerRecord (rule_hash optional for
pre-evaluation failures), RetryBudgetEntry with the four-shape
StreamKey oneof, and RetryTarget (no action / no expected_identity —
retry replays handler against current state).
No callers; schema only. Wired into the pb Makefile.
* feat(s3/lifecycle): add S3LifecycleParams to TaskParams
Wires lifecycle subroutines into the existing worker dispatch.
Subtype is READ (cluster-singleton meta-log reader), BOOTSTRAP
(per-bucket walker), or DRAIN (per-rule pending). bucket / rule_hash
populated for the latter two; ContinuationHint is an advisory resume
point for kill-resumable BOOTSTRAP / READ.
oneof tag = 14, after the existing ec_balance_params at 13.
* fix(s3/lifecycle): noncurrent delete markers honor NoncurrentDays
A non-current delete marker is just another version per AWS S3 spec
and is eligible under NoncurrentVersionExpirationDays. The
IsDeleteMarker special case is meant only for the *current* delete
marker (sole-survivor ExpiredObjectDeleteMarker action), so guard
that switch arm with IsLatest. Without IsLatest, the bootstrap walker
silently skipped pre-existing non-current delete markers under a
NoncurrentDays rule because ComputeDueAt returned zero.
Mirrors the same fix in evaluate.go so the runtime decision matches.
* fix(s3/lifecycle): preserve prefix trailing slash in RuleHash
"logs" and "logs/" match different object sets under literal
strings.HasPrefix semantics: "logs" matches "logsmore/x", "logs/" does
not. Collapsing them in the hash would let an XML edit silently bind
the new rule to the previous rule's durable state directory, causing
stale bootstrap_complete and stale pending entries against a rule
that now matches a different set.
* fix(s3/lifecycle): add UNSPECIFIED sentinels to enum zero values
Proto3 best practice: enum 0 should be an _UNSPECIFIED sentinel, not
an active value. Persisted state schemas care most: a partially
populated payload (or one written by an older binary that didn't set
the field) would otherwise silently default to a semantically active
value.
- LifecycleState.RuleKind: shifts EXPIRATION_DAYS..EXPIRED_DELETE_MARKER
from 0..5 to 1..6, with RULE_KIND_UNSPECIFIED at 0.
- LifecycleState.RuleMode: shifts EVENT_DRIVEN..PENDING_BOOTSTRAP from
0..4 to 1..5.
- LifecycleState.DegradedReason: replaces NONE=0 with
DEGRADED_REASON_UNSPECIFIED=0 (operators treat both as healthy).
- StreamKind: shifts ORIGINAL..PENDING from 0..3 to 1..4.
- S3LifecycleParams.Subtype: shifts READ..DRAIN from 0..2 to 1..3, so
an unset subtype no longer routes into the cluster-singleton READ task.
No on-disk state has been written yet; renumbering is safe.
* docs(s3/lifecycle): per-action state, not per-rule
A single AWS lifecycle XML <Rule> can declare multiple actions in
parallel (e.g. ExpirationDays=90 + AbortMPU=7 + NoncurrentDays=30).
Each must drive its own delay/horizon/mode/pending stream
independently. Modeling the rule as one compiled entry with one kind
collapses these — picking the smallest delay (7d MPU) means the 90d
expiration cursor advances past objects that aren't yet due, and the
90d action never re-fires.
Restructure storage to per-action: every XML rule expands into N
compiled actions; state lives at <bucket>/<rule_hash>/<action_kind>/.
The intermediate rule_hash directory keeps a rule's actions grouped
for operator listing. Each action has its own state file with its own
mode + bootstrap_complete + degraded_reason; sibling actions of the
same rule can degrade independently.
* fix(s3/lifecycle): per-action proto schema + safety-scan counters
Realigns the durable schema with the per-action storage model and the
safety-scan contract in the design doc.
- Promote LifecycleState.RuleKind to a top-level ActionKind enum and
rename rule_kind -> action_kind. The same enum now also keys
BootstrapKey, PendingKey, and BlockerRecord so per-action streams
under one rule never collapse.
- LifecycleState now keyed by (rule_hash, action_kind). Added
last_safety_scan_ts_ns / next_safety_scan_ts_ns and the four
observability counters the design specifies (evaluated_total,
expired_total, metadata_only_total, error_total). Dropped
last_evaluated_ns, deleted_total, skipped_object_lock_total, and
pending_size — subsumed or out-of-band.
- BlockerRecord.action_kind is OPTIONAL (UNSPECIFIED for
pre-evaluation failures, just like rule_hash).
No on-disk state has been written yet; the renumbering / rename is
free.
* fix(s3/lifecycle): per-action MinTriggerAge / EventLogHorizon helpers
The retention mode gate and safety-scan cadence run on each compiled
action independently. Taking a "min/max across all actions in the
rule" — as the previous helpers did — was wrong: a 90d ExpirationDays
sibling alongside a 7d AbortMPU action would cause the 90d cursor to
advance at 7d (because MinTriggerAge picked the smallest), and the
ExpirationDays action would never re-fire on objects that aged past
the 7d sweep window before reaching 90d.
Both helpers now take an ActionKind and return the threshold for that
specific action only. Returns 0 if the rule does not declare the
requested kind, which makes the gate a no-op and the cadence fall
through to the kind floor.
Also adds RuleActionKinds(rule), the canonical expansion that the
engine uses at compile time to turn one XML rule into N compiled
actions. NewerNoncurrentVersions paired with NoncurrentDays is
subsumed into a single NONCURRENT_DAYS action (AWS-paired
conditions); only when NewerNoncurrent stands alone does it become
NEWER_NONCURRENT.
* fix(s3/lifecycle): length-prefix RuleHash to remove delimiter ambiguity
The previous encoding used "tag=K=V\n" lines, which is ambiguous: a
tag (a=b, c) serializes identically to (a, b=c). A prefix containing
"\nexp_days=99" could likewise forge an action field. Either could
silently bind two semantically different rules to the same per-rule
state directory.
Switch to a length-prefixed canonical form: each scalar is written as
<field-tag-byte> <uvarint-length> <bytes>. Field-tag bytes namespace
each scalar so ("a=b" tag-key) and ("a" tag-key with "=b" leakage)
can't collide. Three regression tests pin the resistance.
* docs(s3/lifecycle): ActionKey is the engine identity, not rule_hash
Finishes the per-action restructure across the engine pseudocode,
bootstrap completion, drain locks, detector paths, policy CAS, and
Phase 2 plan. Every per-action data structure — engine indexes,
target modes, newly-completed sets, bootstrap completion bits, drain
keys, locks, metrics, status — is now keyed by
ActionKey{rule_hash, action_kind}, not by rule_hash alone.
Without this, sibling actions under one XML rule still shared
scheduling state in the engine: originalDelayGroups holding
[]ruleHash means a rule's 7d AbortMPU and 90d ExpirationDays
collapse into one entry, the smaller delay wins for cursor advance,
and the larger sibling never re-fires.
Helper API tightened: EvaluateAction(rule, kind, info, now) and
ComputeDueAt(rule, kind, info) replace the aggregate-rule signatures
so a caller asks one specific action's eligibility against one
entry, never "any action of this rule." Drain task lock includes
action_kind so siblings have independent re-arm timers. Policy CAS
moves from rule_hash to ActionKey membership.
* fix(s3/lifecycle): kind-aware EvaluateAction / ComputeDueAt
Replaces the aggregate-rule signatures with per-action ones:
EvaluateAction(rule, kind, info, now) and ComputeDueAt(rule, kind,
info). The old Evaluate(rule, info, now) could return the verdict of
ANY action declared on the rule, which sat wrong with the per-action
engine indexing (each ActionKey has its own delay group, mode, and
pending stream).
Each helper now decides exactly one (rule, kind) compiled action's
fate against one entry. Asking for a kind the rule doesn't declare,
or asking against the wrong object shape for that kind, returns
ActionNone / zero — never silently routes to a sibling.
Multi-action regression test: a rule with ExpirationDays=90 and
AbortMPU=7 evaluates each kind independently for the same entry; the
7d window has no influence on the 90d eligibility decision.
|
||
|
|
a1e5eb9dad |
Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes * Fix filer UI navigation for URL-sensitive object prefixes * Clarify filer UI path escaping test name Rename the legacy filer UI path test to describe the actual behavior being checked. The printpath helper preserves timestamp characters that are valid in URL path components, while the PR fix is focused on query-string escaping for path and cursor parameters. |
||
|
|
487b93eb49 |
fix(volume): don't panic on read when needle map is nil (#9342)
* fix(volume): don't panic on read when needle map is nil A failed CommitCompact reload (and #9335's new error path for a remote-tiered volume with a stray .vif but no .idx) leaves v.nm == nil on a volume that's still in the store. readNeedle / readNeedleDataInto dereferenced v.nm with no guard, so the next GET segfaulted the http handler instead of returning an error the client could retry on another replica. Add the same v.nm == nil check the other Volume accessors already use, including the slow-read inner loop where the lock is released between iterations and a failed reload can race in. Fixes #9339. * match rust nm-nil read behavior; trim comments seaweed-volume's read_needle_with_option / re_lookup_needle_data_offset already lift Option<NeedleMap> through ok_or(NotFound). Use ErrorNotFound on the Go side too instead of a generic 500-mapped error so both volume servers respond identically when v.nm is nil. * log once when reads hit nil needle map ErrorNotFound alone hides the real cause: a half-loaded volume just returns 404s and the operator has nothing to grep for. Add a once-per- volume Errorf on the nil path, reset on successful load. Mirror the same in seaweed-volume via nm_or_not_found(). * trim comments * drop once-flag, log inline on every nil-nm read |
||
|
|
e96190d128 |
fix(mount): skip pressure-eviction of gappy page chunks (#9330) (#9334)
* fix(mount): skip pressure-eviction of gappy page chunks (#9330) A page chunk whose written-interval list has an internal hole was being sealed under buffer-pressure eviction, then SaveContent would emit one volume chunk per maximal adjacent run with no chunk covering the hole; reads then silently zero-fill the gap (filer/stream.go:177-186). On a sequential cp through FUSE, that bakes in-flight 4 KiB writes into split volume chunks and leaves chunk-sized blocks of zeros on the destination. Filter the pressure-driven sealers (SaveDataAt's over-limit path, EvictOneWritableChunk, ProactiveFlush) to only seal chunks whose written intervals form one unbroken run. The flush-on-close path (FlushAll) is unchanged: at close every gap is by definition a sparse-file write that the app legitimately never made. * fix(mount): also gate IsContiguouslyWritten on leading zero-offset Tighten IsContiguouslyWritten to also reject empty lists and lists whose first interval does not start at offset 0. The internal-gap and leading-gap cases are symmetric for pressure-driven sealing: both put in-flight FUSE writeback for the missing range at risk of being baked into split volume chunks. The flush-on-close path is still unfiltered (sparse writes are sealed legitimately at FlushAll). Also align EvictOneWritableChunk's bestBytes initialization with SaveDataAt (start at 0) so an empty chunk is never picked, matching the new semantic. Addresses gemini-code-assist review on PR #9334. * fix(mount): preserve cap-pressure liveness in EvictOneWritableChunk The previous version of this fix had EvictOneWritableChunk return false whenever every dirty chunk was gappy. That broke the accountant's Reserve loop: cond.Wait only wakes on Release, Release only fires on upload completion, and refusing to seal anything means no upload starts — the writer hangs at the -writeBufferSizeMB cap forever. Two-pass selection: prefer the fullest gap-free chunk (issue #9330: this is what protects sequential cp from racing FUSE writeback), fall back to the oldest non-empty writer when nothing is gap-free. Oldest-first maximizes the chance that FUSE writeback for the gap range has already settled. The actual sealing path is unchanged — SaveContent still emits one volume chunk per maximal adjacent run; pages that arrive after the seal land in a fresh MemChunk for the same logicChunkIndex and are sealed in turn, so coverage is reconstructed at read time by readResolvedChunks. Sequential cp at default settings always hits the strict pass (writes arrive contiguous-from-0 within their logicChunkIndex), so the bug-fix behavior is preserved; the fallback only runs under genuinely sparse workloads or under FUSE writeback so backed up that no chunk has settled, where forced progress is preferable to a hung mount. * test(mount): pin ProactiveFlush gap-skip behavior (#9330) Sibling regression test for the ProactiveFlush guard added in this series: same 3-chunk setup as TestEvictOneWritableChunk_SkipsGappyChunks (internal gap, leading gap, contiguous). Verifies ProactiveFlush picks the contiguous chunk when staleness criteria are otherwise satisfied, returns false when only gappy chunks remain (no liveness fallback like EvictOneWritableChunk has — failing here is just a missed optimization), and that filling the holes lets the chunks auto-seal via maybeMoveToSealed. * style(mount): trim verbose comments on #9330 fix |
||
|
|
7b0b64db65 |
fix(admin/view): wrap plugin history URL with basePath (#9341)
Plugin tabs/sub-tabs use history.pushState/replaceState to keep the
URL bar in sync with the active view, but updateURL fed it the raw
output of buildPluginURL ("/plugin/lanes/<lane>/..."). Under a
urlPrefix deployment that strips the prefix, so reloading the page
hit /plugin/... directly and 404'd at the proxy.
Wrap with basePath() so the rewritten URL keeps the deployment
prefix.
Reported at #9240.
|
||
|
|
1c0e24f06a |
fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
12f283357f |
fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333)
* fix(iam): scope IAM-managed OIDC provider lookup by role account Two account-scoped OIDC records sharing an issuer were collapsed into a single map slot keyed only by the URL. The last-write-wins entry then served every AssumeRoleWithWebIdentity, so a token destined for account B's role could be validated by account A's record (its clientIDs and thumbprints), defeating the per-account isolation the records exist for. The role-account check in enforceProviderAccountScope still rejected the cross-account assumption, but only after the wrong record's audience and TLS pin had already accepted the token. Refresh now keys IAM-managed records as (issuer, account), and validation parses the requested role's account up front and matches the record under that issuer in this order: exact account, global (account-less), static-config fallback. An unknown account hint deliberately skips account-scoped entries — picking one arbitrarily is the bug this commit fixes — and falls through to global or static. * fix(iam): route public AssumeRoleWithWebIdentity through IAMManager handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity directly, bypassing the IAMManager wrapper. The wrapper is where enforceProviderAccountScope rejects cross-account assumption attempts and capDurationByRole clamps to the role's MaxSessionDuration; both silently became no-ops for any AWS-SDK caller hitting the public endpoint. Dispatch through the IAMManager (via the existing IAMManagerProvider interface that other handlers in this file already use) when one is wired. Embedded test setups without an IAM integration fall back to the bare STS service unchanged. * fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config initOIDCProviderStore mirrored only URL and ClientIDs. Once RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on boot once the metadata-subscribe loop kicked in), buildOIDCProviderFromRecord rebuilt the runtime provider from this truncated record. Because IAM-managed entries take precedence over the static-config map, the rebuild silently shadowed the bootstrap with a weaker provider: - Thumbprints: dropped, so TLS-pinned issuers fell back to the system trust store. - AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped reaching the session. - PolicyClaim: dropped, so claim-based policy mode stopped triggering. Pull all three from the provider's static Config map at mirror time so the stored record round-trips to a runtime provider equivalent to the one the static config produced directly. * fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler Phase 3b advertises that RoleArn MAY be omitted in claim-based policy mode — the STS service then derives the assumed-role ARN from the configured policy claim. The HTTP handler still rejected empty RoleArn up front with MissingParameter, so SDK callers using the documented omitted-role flow never reached the STS layer. Drop the pre-check; STS still validates that claim-based mode is configured and that the IDP emits policies, returning a precise error when either is missing. The existing error mapping below this point surfaces those as InvalidParameterValue, matching what an AWS SDK expects. * test(iam): update missing-RoleArn STS integration test for the new contract The previous commit drops the HTTP-layer RoleArn pre-check so claim-based mode can derive the ARN from a JWT claim. The integration test still asserted MissingParameter for the missing-RoleArn case, which now reaches the STS layer and surfaces a JWT-parse error instead. Update the assertion to match: missing RoleArn alone must no longer surface as MissingParameter, but a bogus JWT must still be rejected. |
||
|
|
9af1b212d3 |
feat(iam): OIDC provider audit trail (Phase 3e) (#9325)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323. * feat(iam): opt-in session revocation via JTI blocklist Add SessionRevocationStore (memory + filer implementations) and wire it into the IAMManager.IsActionAllowed path so a revoked session is rejected on the next signed request. Session JWTs now embed the session id as the JTI claim, giving the blocklist a stable key without requiring a second secret. Operators who don't configure a store keep the existing fully-stateless behaviour: every session stays valid until natural expiry. Operators who do configure one accept one filer lookup per signed request in exchange for being able to invalidate compromised tokens before expiry. Revocation entries carry the original session expiry so the blocklist self-trims via PurgeRevokedSessions. * fix(iam): hash JTI filenames + paginate Purge with proper EOF handling Three reviewer-flagged issues on the filer-backed revocation store: 1. Path traversal (security-medium): RevokeSession is exported and takes an arbitrary string. Using the JTI verbatim as a filename meant a caller could pass "../../etc/passwd" to write outside the basePath. SHA-1 hash the JTI to a fixed-width hex name; lookups still find the entry because Revoke and IsRevoked share the same hash function. 2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed the outer err and the loop just broke on any failure, so a mid-stream gRPC error returned (count, nil) and the caller had no idea the purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream and propagate everything else. 3. Purge had a hardcoded 10000-entry cap. Stream-paginate via StartFromFileName so the operator-cron can clean a backlog larger than that without losing rows. * feat(iam): OIDC provider audit trail Emit one structured event per IAM-managed OIDC provider lifecycle mutation (Create, Delete, Add/Remove ClientID, UpdateThumbprints, Tag, Untag). Three sinks ship in-tree: - GlogAuditSink — default; events become structured log lines. - MemoryAuditSink — in-process buffer for tests / inspection. - FilerAuditSink — durable, one file per event under /etc/iam/audit/oidc-providers (operator-overridable basePath). Audit emission is best-effort: a failing sink never blocks an IAM mutation that has already succeeded. Use-events (per token validation) are intentionally not emitted yet — too hot for unconditional sinks. * fix(iam): collision-free audit filenames + correct file mtime Two issues gemini flagged on FilerAuditSink.Emit: 1. Filename was %d-%s.json (UnixNano + Type). Two mutations at the same nano against different ARNs (think batch script touching several providers) collided on the same path and the second CreateEntry failed silently. Append a short ARN hash to make the name unique per-event without leaking the ARN. 2. File Mtime/Crtime were set to time.Now() at write time. For audit integrity those should reflect the event's occurrence time so filer-level "ls -lt" output matches the contents. Addresses three medium-priority gemini reviews on PR #9325. |
||
|
|
9d6a699b94 |
feat(iam): opt-in session revocation via JTI blocklist (Phase 3d) (#9324)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323. * feat(iam): opt-in session revocation via JTI blocklist Add SessionRevocationStore (memory + filer implementations) and wire it into the IAMManager.IsActionAllowed path so a revoked session is rejected on the next signed request. Session JWTs now embed the session id as the JTI claim, giving the blocklist a stable key without requiring a second secret. Operators who don't configure a store keep the existing fully-stateless behaviour: every session stays valid until natural expiry. Operators who do configure one accept one filer lookup per signed request in exchange for being able to invalidate compromised tokens before expiry. Revocation entries carry the original session expiry so the blocklist self-trims via PurgeRevokedSessions. * fix(iam): hash JTI filenames + paginate Purge with proper EOF handling Three reviewer-flagged issues on the filer-backed revocation store: 1. Path traversal (security-medium): RevokeSession is exported and takes an arbitrary string. Using the JTI verbatim as a filename meant a caller could pass "../../etc/passwd" to write outside the basePath. SHA-1 hash the JTI to a fixed-width hex name; lookups still find the entry because Revoke and IsRevoked share the same hash function. 2. Purge swallowed errors. The inner `err` from stream.Recv() shadowed the outer err and the loop just broke on any failure, so a mid-stream gRPC error returned (count, nil) and the caller had no idea the purge was incomplete. Switch to errors.Is(io.EOF) for end-of-stream and propagate everything else. 3. Purge had a hardcoded 10000-entry cap. Stream-paginate via StartFromFileName so the operator-cron can clean a backlog larger than that without losing rows. |
||
|
|
6483583491 |
feat(iam): account-scoped OIDC providers (Phase 3c) (#9323)
* feat(iam): claim-based policy mode for AssumeRoleWithWebIdentity When the caller passes the sentinel RoleArn arn:aws:iam:::role/sts-claim-based (or omits it entirely) and the matched OIDC provider has policyClaim set, mint a session whose effective policies come from that JWT claim instead of from a server-side role mapping. Accepts string, comma-separated string, or array shapes — MinIO-compatible behaviour for IDPs that already attach policies to the user. Trust-policy validation is skipped in claim-mode: the IDP is the sole authority for both authentication and authorization, mirroring the contract MinIO documents for its DummyRoleARN flow. Concrete-role mode is unchanged and still requires the role definition + trust policy. * fix(iam): trim policy-claim array elements + clean up stale comments Three medium-priority cleanups gemini flagged on the claim-based path: - extractClaimPolicies's array branch was leaving whitespace on each element while the string/comma-separated branch trimmed via splitPolicyClaimString. An IDP that emits ["readonly", " billing "] would create a "billing" policy lookup that didn't match the stored name. Trim every array element, drop empties. - The "synthetic ARN keyed on the session name" comment was wrong — effectiveRoleArn here is the literal sentinel; it's the assumed-role ARN generated downstream that's session-keyed. Reword. - The empty if/else block at the start of validateAssumeRoleWithWebIdentityRequest existed only to host a comment about deferred validation; the comment now lives in the function godoc and the empty branch is gone. Addresses three gemini medium reviews on PR #9322. * feat(iam): account-scoped OIDC providers Add OIDCProviderRecord.AccountID enforcement: when a role lives in account A, the OIDC provider validating the assume-role token must be either global (AccountID="") or also live in account A. Cross-account use is rejected at the IAM-manager layer before reaching the trust policy validator. OIDCProviderStore gains GetProviderByIssuerAndAccount; both the in- memory and filer-backed stores implement it. Static-config-only deployments are unaffected since they don't populate the store. * fix(iam): account-scoped lookup for cross-account check enforceProviderAccountScope was calling GetProviderByIssuer, which returns the first match arbitrarily when multiple providers share an issuer (one global + one per tenant is the canonical setup). On a two-record collision the wrong record could come back first and falsely reject a valid same-account or global-provider request. Use GetProviderByIssuerAndAccount as the primary lookup so the filter happens in the store. On miss, fall back to GetProviderByIssuer purely to distinguish "issuer entirely unknown" (let the STS layer reject) from "issuer registered in a different account" (surface a precise cross-account error). Addresses gemini high-priority review on PR #9323. |