8960 Commits

Author SHA1 Message Date
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
bbc075b353 feat(s3/lifecycle): plumb WalkerInterval through worker admin config (#9485)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.

* feat(s3/lifecycle): plumb WalkerInterval through worker admin config

#9484 added cfg.WalkerInterval to dailyrun.Config but left the worker
side wired to zero — operators couldn't actually use the throttle
without recompiling. Add the admin-schema knob:

- New constant WalkerIntervalMinutesAdminKey = "walker_interval_minutes"
  follows the MetaLogRetentionDaysAdminKey pattern (Int64, minutes
  unit, 0 = unbounded / fire every pass).
- New Config.WalkerInterval populated in ParseConfig from
  adminValues; negative / zero stay at zero so the prior "fire every
  pass" semantics keep the in-repo integration tests and the s3tests
  sub-minute driver working unchanged.
- handler.go: admin form field with operator-facing label and
  description, default in DefaultValues, value forwarded to
  dailyrun.Run via cfg.WalkerInterval.

Tests cover the default-zero, positive, and negative cases — same
shape as the MetaLogRetention tests so the parsing contract stays
consistent.

Stacked on #9484; rebase after that lands.
2026-05-13 14:09:31 -07:00
Chris Lu
c6582228b8 feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.
2026-05-13 14:09:13 -07:00
Lars Lehtonen
75c807b586 chore(weed/mq/kafka/protocol): remove unused functions and variables (#9488) 2026-05-13 13:59:24 -07:00
Chris Lu
d5c0a7b153 fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response

LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.

* fix(volume/ec): probe every DiskLocation when serving local shard reads

reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.

Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.

* test(volume/ec): cover the multi-disk EC lifecycle end-to-end

Two integration tests against a real volume server with two data dirs:

TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.

TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.

Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.

* review: trim verbose comments

* review: drop racy fast-path, use locked findEcShard directly

gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.

Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
2026-05-13 13:56:20 -07:00
Chris Lu
79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00
Chris Lu
e025ec2334 fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends (#9483)
* fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends

NewSortedFileNeedleMap never initialised the inherited
baseNeedleMapper.indexFileOffset, so it stayed at zero. The compact and
leveldb constructors both seed it from stat().Size(); this one did not.

When a read-only or noWriteCanDelete volume processed a Delete, the
inherited appendToIndexFile wrote the tombstone via WriteAt at
indexFileOffset=0 and advanced 16 bytes at a time. Every delete since
the SortedFileNeedleMap path was first exercised for writes (#5633,
which switched .sdx to O_RDWR) overwrote the front of .idx with
tombstones for unrelated keys.

Net effect on disk: the first N entries of .idx become sequentially
written tombstones with .dat offsets at the tail, while the original
Put records that lived in slots [0..N) are gone. fsck sees ~N phantom
orphans whose keys' runtime needle-map entries are already tombstoned
in .sdx, so the volume server replies 304 to every purge and the
orphan count is stable across retries.

Stat the .idx at open and seed indexFileOffset = size. The .idx now
grows on delete instead of being clobbered. Add a regression test that
populates .idx via the compact map, opens it as a SortedFileNeedleMap,
deletes one needle, and verifies the file grew by one entry, the
original Put records survived, and the tombstone landed at the tail.

The fix only stops further corruption. .idx files already damaged by
this bug have to be rebuilt from .dat before the next restart, or
isSortedFileFresh will regenerate .sdx from the bad .idx and propagate
the damage.

Refs #9479

* test: harden SortedFileNeedleMap regression assertions

Address PR review (gemini-code-assist, coderabbit):

- Close the writer before t.Fatalf in the Put loop so a failing seed
  doesn't leak the .idx file descriptor.
- Verify offset and size of preserved Put records, not just the key.
  Front-overwrite damage would clobber all three fields, but a key-only
  check would miss a different regression that corrupted offset/size
  while leaving the key intact.
2026-05-13 10:22:01 -07:00
Chris Lu
f5a4bfb514 fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete

deleteSpecificObjectVersion did two non-atomic filer ops: rm the version
blob, then update the .versions/ pointer. Step 2 failures were silently
logged and the client got 204 OK, so any transient blip (filer timeout,
process restart between RPCs, lock contention) left the .versions/
directory naming a missing file. Subsequent GETs paid the 10-retry
self-heal cost and returned NoSuchKey — surfacing as "Storage not found"
to Veeam, which is what triggered this investigation.

Three changes:

1. Pre-roll the pointer for the singleton / multi-version-deleting-latest
   cases. The pointer is repointed (multi) or cleared (singleton) before
   the blob rm. A failure between leaves a recoverable orphan blob —
   pointer is consistent, GETs succeed or correctly miss without
   entering the stale-pointer self-heal path.

2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion
   with bounded retries (~6.3s worst case). When retries are exhausted
   the function now returns a non-nil error instead of swallowing it.
   The caller logs at Error level and queues the path for the
   reconciler.

3. Background reconciler drains stranded .versions/ pointer-to-missing
   states off the hot path. Bounded in-memory queue with capped retries;
   read-path heal remains as a last-resort safety net.

* fix(s3/versioning): address review on #9460

Four fixes addressing review on PR #9460. All four are correctness;
no behavioural change for the happy path.

1. repointLatestBeforeDeletion: discriminate NotFound from transient
   errors when re-fetching the .versions/ entry. Previously any error
   returned rolled=true,nil — a transient filer hiccup at that point
   would cause the caller to skip the post-delete reconciliation AND
   proceed with the blob rm, producing exactly the dangling pointer
   state the PR aims to prevent. NotFound stays "vacuously consistent"
   (directory already gone); other errors surface so the caller aborts
   before removing the blob.

2. Move the singleton .versions/ teardown out of
   repointLatestBeforeDeletion (where it ran BEFORE the blob rm and
   always failed with "non-empty folder") into deleteSpecificObjectVersion
   AFTER the blob rm. Adds a wasSingleton return value so the caller
   knows when to run the teardown. Without this, every singleton-version
   delete in a versioned bucket leaked an empty .versions/ directory.

3. Wrap the list, getEntry, and mkFile calls inside
   repointLatestBeforeDeletion with retryFilerOp so the pre-roll has
   the same transient-failure resilience as the post-roll path. Without
   retries, a single transient blip causes the caller to fall back to
   the legacy non-atomic flow even when the filer recovers immediately.

4. healVersionsPointer in the reconciler: same NotFound-vs-transient
   discrimination on both the .versions/ getEntry and the latest-file
   presence probe. Previously a transient filer error would silently
   evict the candidate from the queue as "healed", leaving the real
   stranded state until a client read happened to surface it.

Also fixes the gemini-flagged consistency nit: the queued-for-reconciler
error log now uses normalizedObject instead of object so it matches the
queue entry's key.

* fix(s3/versioning): short-circuit terminal errors in retryFilerOp

Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound,
gRPC NotFound, context.Canceled, and context.DeadlineExceeded.
retryFilerOp now bails immediately on a terminal error and returns it
unwrapped, so callers like repointLatestBeforeDeletion.getEntry and
updateLatestVersionAfterDeletion.rm see the raw NotFound instead of
paying the ~6.3 s retry-budget delay AND parsing it out of an
"exhausted N retries" wrapper.

errors.Is and status.Code already walk the %w chain so today's call
sites still work, but the delay was real on the hot DELETE path
whenever a key was genuinely absent. Test added covering all five
terminal-error shapes — each must run the wrapped fn exactly once and
return in under 50 ms.
2026-05-13 10:14:27 -07:00
Chris Lu
de28c4df61 fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) (#9480)
* fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478)

handleFoundEcxFile only checks for .dat in the same disk location as the
EC shards. In a multi-disk volume server an interrupted encode can leave
.ec?? + .ecx on disk B while the source .dat still lives on disk A: the
per-disk loader sees no .dat next to .ecx, mistakes the leftover for a
distributed-EC layout, and mounts the partial shards. The volume server
then heartbeats both a regular replica and an EC shard for the same vid
and the master keeps both.

Sweep the store after per-disk loading and before the cross-disk
reconcile to delete partial EC files when a healthy .dat for the same
(collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for
every pruned shard so master forgets the new-shard message the per-disk
pass already emitted, instead of waiting for the next periodic heartbeat.

* fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478)

Rust port of the same Store-level prune added to weed/storage. The
per-disk EC loader in disk_location.rs only checks for .dat in the same
disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx
on disk B while the source .dat sits on disk A is mounted as if it were
a distributed-EC layout. The volume server then heartbeats both a
regular replica and an EC shard for the same vid.

Sweep the store after per-disk loading and before the cross-disk
reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT
shards when a .dat for the same (collection, vid) exists on a sibling
disk, and remove all on-disk EC artefacts for them. The Rust heartbeat
path already diff-emits deletes from the next ec_volumes snapshot, so no
explicit delete-channel push is needed here.

Tests cover both the issue 9478 layout and a distributed-EC layout with
no .dat anywhere on the store, which must be left alone.

* fix(storage): validate sibling .dat size before deleting partial EC (#9478)

The earlier prune deleted partial EC files whenever any .dat for the
same vid existed on a sibling disk — including a zero-byte shell. A
shell is no more useful than the partial shard it would replace, and
the partial shard might still combine with shards on other servers
in a recoverable distributed-EC layout. Wiping it based on a corrupt
sibling .dat is data loss masquerading as cleanup.

Tighten the check: when the EC's .vif recorded a non-zero source size
in datFileSize, require the sibling .dat to be at least that many
bytes; otherwise fall back to "at least a superblock". The .vif value
is what the encoder wrote at the moment the source was sealed, so a
sibling .dat smaller than that is provably truncated. Carry the size
through indexDatOwners alongside the location.

The Rust port had the same gap and an additional bug behind it:
EcVolume::new wasn't reading datFileSize from .vif, so the safety
check always fell back to the superblock floor. Wire datFileSize
through. The existing shard-size calculation in
LocateEcShardNeedleInterval already uses dat_file_size when non-zero,
so populating it also matches Go's behaviour there.

Tests cover the truncated-sibling case in both ports.
2026-05-13 09:25:10 -07:00
Chris Lu
3f1eaf9724 fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD

Successful GET/HEAD object requests never produced a fluent audit entry
because those handlers write the response directly (streaming for GET,
WriteHeader for HEAD) and never reach a PostLog call site. The wiki
advertises GET as an audited verb, so the asymmetry surprises operators
who rely on the log for read-access auditing.

Move the safety net into the track() middleware: tag each request with
an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it,
and emit a single fallback entry after the handler returns when nothing
fired. The recorder's status flows into the fallback so the audit row
still reflects 200/206 vs 404 etc. No double logging for handlers that
already emit (write helpers, error paths, bulk delete).

Refs #9463

* fix(s3/audit): defensive nil checks on audit-tracking helpers

Address PR review: guard against nil request and nil *atomic.Bool stored
under the audit-tracking key. The conditions are unreachable today (the
key is private and we only ever store new(atomic.Bool)), but the checks
are free and keep the helpers safe if a future caller misbehaves.

* test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469)

test(s3/audit): cover track() fallback wiring + cleanup

Adds two unit tests in weed/s3api/stats_test.go that exercise the
audit-tracking flag set up by track(): one verifies the fallback path
fires when a handler writes the response directly (the GET/HEAD object
regression in #9463), the other verifies the flag is set when a handler
emits PostLog itself so the fallback is skipped.

To make the wiring observable without standing up fluent, PostLog now
marks the audit flag before short-circuiting on a nil Logger; production
behavior is unchanged (no logger, no posting) but the flag stays
consistent.

Also drops two stale comments in s3api_object_handlers.go that still
referenced proxyToFiler — that helper was removed when GET/HEAD started
streaming from volume servers directly.

Stacks on #9467.
2026-05-13 09:24:59 -07:00
Chris Lu
d5372f9eb7 feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster
rate.Limiter that processMatches honors. A walker hitting a large
bucket on the recovery branch could burst LifecycleDelete RPCs past
the cluster_deletes_per_second cap that streaming-replay respects.

WalkerDispatcher now takes a *rate.Limiter and waits on it before
each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds
just like processMatches does. The handler passes the same limiter
to both paths so replay + walk share one budget; nil disables
throttling (unchanged default).

Tests pin: the limiter actually delays a dispatch when the burst
token is drained, and a ctx cancellation in Limiter.Wait surfaces
as an error without sending the RPC.
2026-05-13 09:24:50 -07:00
Chris Lu
37e505b8fd refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass

Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.

drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.

cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.

Filer load: 16x -> 1x SubscribeMetadata streams per pass.

* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)

The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.

Same regression I fixed in 6796ab6db for the per-shard subscription;
now applied at the shared level. computeGlobalStartTsNs loads every
shard's cursor and picks the minimum, falling back to the cold-start
floor only for shards with no persisted cursor.
2026-05-13 02:13:11 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
7b8647e8bc fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476)
The s3tests workflow (.github/workflows/s3tests.yml) backgrounds
`weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'`
and then runs `kill -0 $pid` to confirm the worker stayed alive.
The PR-9475 restore ran dailyrun.Run once and exited cleanly — even
faster when no buckets had lifecycle rules yet ("nothing to run").
The aliveness check then failed and the s3tests job died with
"lifecycle worker died on startup". Caught on
https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401.

Fix:

  - -refresh now drives an inter-pass loop. cadence=0 (default) is
    one-shot, matching the test/s3/lifecycle/ integration-test
    invocation that omits -refresh and expects synchronous return.
    cadence>0 (the CI case) keeps the command alive until -runtime
    expires, running a fresh dailyrun.Run on every tick.
  - Each iteration re-loads bucket configs via
    scheduler.LoadCompileInputs so rules created mid-run (the s3tests
    flow creates rules AFTER the worker starts) get picked up.
  - The "no rules; nothing to run" early return is gone — the
    command stays alive even with an empty initial snapshot, waiting
    for tests to add rules.
  - -dispatch, -checkpoint, -bootstrap-interval stay accepted-but-
    ignored (legacy streaming flags).
2026-05-12 19:19:46 -07:00
Chris Lu
4ce027c2f3 fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475)
fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper

PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the
premise that it was a debug-only tool. It wasn't: the s3tests CI
workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/
integration tests invoke it via `weed shell` to drive lifecycle
expirations on demand. Both started failing with
"unknown command: s3.lifecycle.run-shard".

This PR restores the command with the same flag set so existing
callers (CI scripts and integration tests) work unchanged. The
implementation no longer drives the streaming dispatcher.Pipeline +
scheduler.BucketBootstrapper (deleted) — instead it does one bounded
dailyrun.Run pass through the same daily-replay code path the
production worker exercises. The walker fires for walker-bound
rules just like in the worker.

Obsolete streaming flags (-dispatch / -checkpoint / -refresh /
-bootstrap-interval) are accepted-but-ignored so existing scripts
don't need to drop them.
2026-05-12 18:29:18 -07:00
Chris Lu
ce5768fab1 feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash (#9473)
* feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash

dailyrun.Config.RetentionWindow has been wired since Phase 4b but the
handler never supplied a value, so runShard always fell back to
maxTTL and engine.PromotedHash hashed nothing. The partition-flip
recovery trigger was dormant by design "until the handler plumbs the
real meta-log retention here."

This PR plumbs it via a new admin form field:
  Meta-Log Retention (days) — 0 = unbounded (current behavior).

When set, ParseConfig converts days to a time.Duration on
cfg.MetaLogRetention. The handler passes it as
dailyrun.Config.RetentionWindow, which runShard then feeds to
engine.PromotedHash. Rules whose TTL exceeds the declared window
land in the walk partition; the next time an operator shrinks
retention so a previously replay-eligible rule slips past it,
PromotedHash mismatches → recovery branch fires → walker re-evaluates
the rule across the whole filer tree.

0 stays the default, so existing deployments see no behavior change.

* chore(s3/lifecycle): rephrase days->duration conversion

gemini-code-assist flagged the original form as a compile error,
which it wasn't (time.Duration is a named int64 and supports * with
other time.Durations — the test suite verified the value was
correct). The suggested form is more idiomatic regardless:
days*24 happens in int64 space before the lift to time.Duration,
so the unit is unambiguous.
2026-05-12 18:26:52 -07:00
Chris Lu
f51468cf73 Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)"

This reverts commit f28c7ce6df.

The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.

After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .

Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
2026-05-12 18:22:21 -07:00
Chris Lu
43a8c4fdca Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"

This reverts commit 21054b6c18.

The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.

The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.

* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic

The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.

Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
2026-05-12 16:00:44 -07:00
Chris Lu
f28c7ce6df master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:

- Reject heartbeats whose Ip does not match the gRPC peer's source
  address. Loopback peers are still trusted; operators behind a proxy
  can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
  and drop foreign re-claims. Non-EC volume claims are bounded by the
  replica copy count so legitimate replicas still register. EC
  ownership is keyed by (vid, shard_id) so the same vid can legitimately
  be split across many peers as long as their EcIndexBits are disjoint;
  rejected bits are cleared from the bitmap and the parallel ShardSizes
  array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
  so disconnect cleanup is O(M) in what that peer held rather than O(N)
  over the whole map.

Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.

The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
2026-05-12 15:38:52 -07:00
Chris Lu
10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu
5004b4e542 feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b)

Phase 5a (PR #9465) retired the algorithm flag and made daily_replay
the only execution path. The streaming-side code (scheduler.Scheduler,
scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher,
dispatcher.FilerPersister, and their tests) has had no in-tree caller
since then. This PR deletes it.

Net change: ~4800 lines removed, ~130 added (the scheduler/configload
tests' helper file the deleted bootstrap_test.go used to host).

Removed:
  - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test,
    scheduler,scheduler_test,pipeline_fanout_test,
    refresh_default,refresh_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test,
    dispatcher_helpers_test,edge_cases_test,multi_shard_test,
    pipeline,pipeline_test,pipeline_helpers_test,toproto_test,
    dispatch_ticks_default,dispatch_ticks_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go
    (FilerPersister deleted; FilerStore tests don't need their own
    file)
  - weed/shell/command_s3_lifecycle_run_shard{,_test}.go
    (debug-only shell command that only ever wrapped the streaming
    pipeline; the production worker now exercises the same path
    every daily run)

Trimmed:
  - dispatcher/filer_persister.go down to FilerStore +
    NewFilerStoreClient — the small interface daily_replay's cursor
    persister (dailyrun.FilerCursorPersister) plugs into.

Kept (still consumed by daily_replay):
  - scheduler/configload.{go,_test.go} (LoadCompileInputs,
    AllActivePriorStates)
  - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister,
    FilerSiblingLister)
  - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient)

scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream,
dirEntry, fileEntry — helpers the configload tests used to share with
the deleted bootstrap_test.go.

Updates the handler-package doc strings and one reader-package
comment that still named the streaming pipeline.

* fix(s3/lifecycle): hold lock through tree read in test filer client

gemini caught an inconsistency in scheduler/testhelpers_test.go:
LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was
releasing the lock before reading c.tree. The map is effectively
static during tests so there's no actual race today, but matching
the convention keeps the helper safe if a future test mutates the
tree mid-run.
2026-05-12 12:54:52 -07:00
Chris Lu
745e864bda feat(s3/lifecycle): retire algorithm flag, daily_replay is the only path (Phase 5a) (#9465)
feat(s3/lifecycle): remove algorithm flag, daily_replay is the only path (Phase 5a)

With Phase 4b on master the daily_replay path covers every rule kind
and the streaming algorithm has no remaining responsibilities. This
PR retires the algorithm flag from the worker:

  - Drop the "Algorithm" enum field from AdminConfigForm and its
    DefaultValues entry.
  - Drop the if/else routing in Execute — every Execute call now
    routes straight into executeDailyReplay.
  - Drop the streaming-only worker fields (DispatchTick,
    CheckpointTick, RefreshInterval, BootstrapInterval) and their
    matching form fields. None of them are read by the daily_replay
    path; keeping them in the form would suggest tuning knobs that
    don't do anything.
  - Drop AlgorithmStreaming / AlgorithmDailyReplay constants and the
    Config.Algorithm field.

The streaming-path packages (s3lifecycle/scheduler, s3lifecycle/dispatcher)
remain on the tree; they're now reachable only by the
weed shell s3.lifecycle.run-shard debug command and the few helpers
(LoadCompileInputs, FilerStore, FilerSiblingLister) the daily_replay
worker still uses. Phase 5b deletes the dead code.

Tests prune the cadence-default assertions to the single remaining
field (max_runtime_minutes).
2026-05-12 12:39:37 -07:00
Chris Lu
2f682303fb fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464)
Follow-up to PR #9459 (merged before this fix landed). The walker
dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and
"NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay
drain (processMatches in run.go via #9462) use "RPC_ERROR" for the
same condition. Aligning so a single Prometheus query covers all
three delete paths.

Folds nil-response under RPC_ERROR rather than a separate label —
operationally it's the same class of failure (server returned no
usable response).
2026-05-12 12:38:52 -07:00
Parviz Miriyev
2212cc8a5f shell: expose retention flags on mq.topic.configure (#9416)
* shell: expose retention flags on mq.topic.configure

The ConfigureTopicRequest proto already carries a TopicRetention message
({retention_seconds, enabled}), and GetTopicConfigurationResponse / the
admin UI both surface it — but the `mq.topic.configure` shell command
sent Retention: nil unconditionally, leaving CLI and IaC users with no
way to set retention without going through the admin UI. Add three
flags so the shell matches the existing surface:

  -retention <duration>     Go duration string (e.g. 168h, 30m)
  -retentionSeconds <int>   raw seconds (mutually exclusive with -retention)
  -retentionEnabled         bool toggle for retention enforcement

Behavior:
- If none of the retention flags are set, the request omits Retention
  (`nil`) — preserving the prior "leave server-side state alone"
  semantics for callers that only care about partition count.
- Setting any retention flag populates TopicRetention with both fields
  so existing operators keep working when only one is provided.
- -retention and -retentionSeconds together is a hard error (ambiguous);
  negative values are rejected.

The new behavior is detected via flag.FlagSet.Visit so default values
of 0 / false are distinguishable from "not provided".

Help text updated with examples for both setting a TTL and disabling
retention on an existing topic. No proto / server changes needed; this
is a CLI-only patch.

* broker, shell: address review for mq.topic.configure retention

Three follow-up fixes for the review comments on the original commit:

1. Server-side: detect retention changes in the early-return path.
   Previously ConfigureTopic returned without persisting when
   partitionCount + schema were unchanged, even if the request supplied
   a different Retention. Now the early-return branch checks both
   schema and retention, persists either or both when they differ, and
   logs which fields actually changed.

2. Server-side: preserve existing retention when request.Retention is
   nil in the fresh-allocation path. Capture the prior resp.Retention
   before overwriting `resp = &mq_pb.ConfigureTopicResponse{}`, and
   only overwrite the new resp.Retention with `request.Retention` when
   the request actually supplies one. Otherwise carry the previous
   retention forward, so partition-count changes (or any path that
   bypasses the early-return branch) don't accidentally clear retention.

3. CLI: switch the `-retention` / `-retentionSeconds` mutual-exclusion
   check from value-based (`!= 0`) to flag.FlagSet.Visit-based, so an
   explicit `-retention=0 -retentionSeconds=N` is also rejected.

4. CLI: when any retention flag is set, fetch the current
   GetTopicConfiguration and fill in the fields the user didn't
   explicitly provide. This means `-retentionEnabled` alone no longer
   zeros the existing duration, and `-retention 24h` alone preserves
   the existing enabled flag. When the topic doesn't exist yet, the
   GetTopicConfiguration error is treated as "no current state" and
   we proceed with just the user-supplied values.

Build / vet / gofmt / `go test ./weed/shell/... ./weed/mq/broker/...`
clean.
2026-05-12 12:37:09 -07:00
Lars Lehtonen
89608e5499 chore(weed/util/log_buffer): remove unused functions (#9444) 2026-05-12 12:36:16 -07:00
Chris Lu
21054b6c18 volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use
it to gate destructive volume admin RPCs. IsWhiteListed keeps its
allow-all-when-empty semantics for HTTP compatibility.

For TCP peers with an empty whitelist, off-host callers are rejected but
loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly
cohabits with the master/filer on a single host and in integration-test
clusters; the loopback exception keeps cluster-internal admin traffic
working without -whiteList while still locking out off-host attackers.

Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check
entirely. When `weed server` runs master+volume+filer in a single process
the master dials the volume server in-process and the peer address surfaces
as "@", which has no parseable IP. Such a caller shares our OS process and
cannot be spoofed by a remote attacker, so we treat it as trusted by
construction.

The gate also tolerates a nil guard (developmental / embedded path) and only
enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR
fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.
2026-05-12 12:35:27 -07:00
Chris Lu
495632730c feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log

Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.

  - S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
    wall-clock per shard. p95 climbing toward MaxRuntime means the
    shard is brushing its budget.
  - S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
    events drainShardEvents processed. Pairs with the duration so a
    spike in events-per-shard correlates with a slow shard.
  - S3LifecycleDispatchCounter (existing, reused): processMatches now
    increments this with the outcome label, so streaming and
    daily_replay paths share one outcome view. Transport errors are
    counted under outcome="TRANSPORT_ERROR".

dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.

Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.

* fix(s3/lifecycle): align dispatch error label + clean test labels

Two PR-9462 review fixes from gemini:

1. processMatches' transport-failure label was "TRANSPORT_ERROR";
   streaming's dispatcher uses "RPC_ERROR" for the same condition
   (see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
   the same Prometheus query covers both delete paths.

2. The dispatch-counter assertion test now deletes its label row
   on exit so the in-process Prometheus registry doesn't accumulate
   per-test state across the suite.
2026-05-12 12:15:20 -07:00
Chris Lu
f954781169 feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459)
* feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config

Adds a Config.RetentionWindow field that runShard threads into
engine.PromotedHash. Zero (the default) falls back to maxTTL, which
matches Phase 4a behavior — PromotedHash stays empty and the
partition-flip recovery trigger stays dormant.

Pure plumbing. The handler still passes zero so nothing changes at
runtime. The walker work (Phase 4b proper) sets a real retention from
the meta-log boundary and the partition-flip trigger starts firing.

* feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker

Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient
so the same LifecycleDelete RPC drives both the meta-log replay path
and the walker. No CAS witness — the server's identityMatches treats
nil ExpectedIdentity as a bootstrap call and rebuilds the witness from
the live entry, which is the right contract for a full-tree walk.

Adds VersionID to bootstrap.Entry so versioned-bucket walks address
the right version. MPU init uses DestKey for ObjectPath (matching the
prefix-match contract); rejecting empty DestKey keeps malformed init
records out of the dispatch path.

Not wired yet — runShard still doesn't invoke the walker. Follow-up
commits add the ListFunc adapter and the recovery-branch wiring.

* feat(s3/lifecycle): wire Walker hook into runShard's recovery branch

Adds a Config.Walker callback that fires on rule-content edit /
partition flip BEFORE the cursor rewinds, so already-due objects across
the rewritten rule set get caught instead of waiting on meta-log
replay alone. The callback receives engine.RecoveryView(snap) and the
per-shard ID; nil disables it (Phase 4a behavior preserved).

Decoupling the wiring from the implementation: the handler-side
WalkerFunc that drives bootstrap.Walk via the filer is the follow-up
commit, and tests can stub the callback without standing up the full
filer/client/lister harness.

Tests pin: walker fires exactly once on hash mismatch, walker error
propagates and leaves the cursor unchanged, nil Walker is a no-op.

* feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard

Adds dailyrun.WalkBuckets — the composable driver the handler-side
WalkerFunc will call. Iterates a bucket list, wraps the supplied
bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey
for MPU init), and runs bootstrap.Walk per bucket using the supplied
Dispatcher. First bucket error wins; remaining buckets log and run to
completion so one filer flake doesn't kill the shard.

Composable rather than monolithic so callers and tests can swap parts:
production uses a filer-backed ListFunc + WalkerDispatcher; tests use
bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the
next commit.

Tests pin: shard filter routes only matching entries, MPU shard uses
DestKey not the .uploads/<id> path, single-bucket error propagates
while other buckets still run, ctx cancellation short-circuits between
buckets, nil guards on view/list/dispatch.

* feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker

Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that
streams entries under <bucketsPath>/<bucket> by paginated SeaweedList.
Recurses into regular directories; .versions/ and .uploads/ are
skipped at this stage so they don't surface as raw children — the
sibling expansion (versioned NoncurrentDays state, MPU init dispatch)
lands in the next commit.

listAll and isVersionsDir are ported from scheduler/bootstrap.go's
same-named helpers. Phase 5 deletes the scheduler copies along with
the streaming path.

Tests pin: flat listing, recursion through nested directories,
.versions/ and .uploads/ skipped, kill-resume via the start path
contract, nil-client error, attribute propagation (mtime / size /
IsLatest default).

* feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc

Adds the .versions/<key>/ expansion to the daily-run's filer-backed
ListFunc. Each call emits one bootstrap.Entry per sibling (real
version files + the bare null version, when found) with the same
sibling state the streaming bootstrap injects via reader.Event:

  - Path = logical key (not the .versions/<file> physical path), so
    bootstrap.Walk's MatchPath uses the user's intended path.
  - VersionID per sibling (version_id or "null").
  - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back
    to explicit-null-bare, falling back to newest-by-mtime.
  - NoncurrentIndex rank computed against the latest's position.
  - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the
    previous-newer sibling's mtime (legacy derivation).
  - IsDeleteMarker from ExtDeleteMarkerKey.
  - NumVersions = len(siblings).

Two-pass walk so .versions/ dirs run before regular files; the bare
null-version path is recorded in skipBare so pass 2 doesn't emit it
twice.

expandVersionsDir and lookupNullVersion are ported from
scheduler/bootstrap.go. Sort order, latest resolution, and successor
derivation must agree with that path verbatim so streaming and walker
reach the same verdict on the same objects. Phase 5 deletes the
scheduler copy.

MPU init (.uploads/<id>) remains skipped — the dedicated commit emits
it with IsMPUInit and DestKey.

Tests pin: pointer-wins latest resolution, no-pointer newest-sibling
fallback, explicit-null-is-latest with skipBare suppression of the
bare emission, coincidentally-named .versions folder recursing as a
regular subdir, delete-marker propagation.

* feat(s3/lifecycle): emit MPU init records from FilerListFunc

Last gap in the filer-backed ListFunc. A directory at .uploads/<id>
carrying ExtMultipartObjectKey is the MPU init record; emit one
bootstrap.Entry with IsMPUInit=true and DestKey set to the user's
intended path. The walker's MatchPath uses DestKey for prefix
matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's
ObjectPath. .uploads/<id> directories without the extended key are
mid-write before metadata landed and stay skipped.

isMPUInitDir is upgraded from the path-shape-only stub to the full
shape + extended-attr check that mirrors router.mpuInitInfo and
scheduler/bootstrap.go's same-named helper.

Tests pin: valid init record emits with the right DestKey, missing
ExtMultipartObjectKey skips the directory.

* feat(s3/lifecycle): wire walker into executeDailyReplay

Activates the recovery-branch walker. The handler composes the three
Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets
— into a dailyrun.WalkerFunc and passes it via Config.Walker. The
bucket list is derived from the compiled inputs so it matches the
engine snapshot exactly.

Effect on master behavior: when a worker observes a RuleSetHash or
PromotedHash mismatch on its persisted cursor (rule content edited /
partition flip), runShard now walks the live filer tree under the
RecoveryView before rewinding the cursor. Already-due objects across
the rewritten rule set fire immediately instead of waiting on the
sliding meta-log replay.

Still scoped to replay-eligible action kinds because
checkSnapshotForUnsupported continues to reject walker-bound rules
(ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and
scan_only-promoted rules at the top of Run. The follow-up commit
relaxes the gate once the steady-state walker over RulesForShard's
walk view is wired so those rules fire every day, not just on rule
edits.

* feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate

Adds the second walker invocation in runShard. After the recovery
check passes, runShard derives the walk view via snap.RulesForShard
(using the same retentionWindow PromotedHash used, so the partition
is consistent) and runs the walker over it. The view holds
walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) plus any replay-eligible rules promoted to walk by
retention shortage; an empty view skips the call so non-versioned,
replay-only deployments don't pay an O(N) bucket walk per run.

With the walker now servicing every rule kind, checkSnapshotForUnsupported
and its UnsupportedRuleError type are obsolete. router.Route gates
replay on Mode == ModeEventDriven, so walker-bound and scan_only
rules are silently dropped by replay and picked up by the walker
instead — no double-dispatch. Drop the gate, delete replayability.go
+ replayability_test.go, and remove the handler's redundant
IsUnsupportedRule branch.

* fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment

Two PR-review fixes on 9459:

1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return —
   add a defensive nil-response check so the walk halts cleanly
   instead. Spotted by coderabbit.

2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash
   "stays empty" in fallback mode, which gemini correctly pointed out
   is only true once rules are active. During bootstrap (rules
   compiled but IsActive=false) MaxEffectiveTTL is 0 while
   PromotedHash counts every non-disabled rule, so promoted becomes
   non-empty and the next post-activation run hits the recovery
   branch. That's the intended bootstrap walk — rewrite the comment
   to explain it rather than misstate the invariant.

Test: pins nil-response → error path on WalkerDispatcher.

* fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion

Reviewer caught a structural bug in expandVersionsDir's latest
resolution: when ExtLatestVersionIdKey was set but no scanned sibling
carried that id (stale pointer), the code left latestPos at the
default 0 without ever entering the no-pointer fallback. Today the
two paths yield the same value (newest sibling wins), but the
implicit fall-through makes the intent unclear and would break
silently if the no-pointer branch ever did anything more than
latestPos=0.

Track a pointerResolved flag explicitly so the no-pointer branch
(including the explicit-null-bare check) re-runs on a stale pointer.
Behavior unchanged today.

Test pins: stale pointer + two real versions falls back to
newest-sibling (vnew, not vold).

* feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher

Mirrors the Phase 6 instrumentation already on the replay side
(processMatches) onto the walker's Delete dispatch. Every walker
dispatch now bumps S3LifecycleDispatchCounter with the resolved
outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so
streaming, daily_replay's replay drain, and daily_replay's walker
share a single per-(bucket, kind, outcome) counter view.

Lands together with the rest of Phase 4b — no new metric, just an
extra observation site for the existing one.
2026-05-12 11:39:15 -07:00
Chris Lu
69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu
5e8f99f40a filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires
a Bearer token in the authorization metadata, signed with the filer
write-signing key. The service refuses to register on a filer that has
no jwt.filer_signing.key set, so the unauthenticated default is gone:
operators who use these RPCs must configure the key and attach a token
on every call.

Bearer scheme matching is case-insensitive (RFC 6750), every handler
nil-checks req before dereferencing it, and tests now cover the
expired-token path.
2026-05-12 10:11:08 -07:00
Chris Lu
05ed5c9ae8 filer: scope JWT allowed_prefixes to path components (#9439)
The allowed_prefixes check used a literal byte-prefix match, so a token
scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar
sibling paths. Match on /-separated path components after path.Clean
normalisation instead.
2026-05-12 10:10:48 -07:00
Chris Lu
18677a8430 fix(storage): refuse to load .vif-only entry as regular volume when .ecx exists (#9448) (#9461)
fix(storage): refuse to load .vif-only entry as regular volume when .ecx exists

Defensive root-cause fix for issue #9448. The detection-side guard in
the EC plugin worker already breaks the infinite re-encode loop, but
the underlying volume-server state — `.vif` preserved next to EC shards
when the source replica is destroyed — can still re-arm a phantom
regular volume via the MountVolume / LoadVolume path.

In loadExistingVolume, the existing `.ecx`-present check is gated on
the caller passing `skipIfEcVolumesExists=true`. The startup-scan path
(concurrentLoadingVolumes) does pass that flag and correctly skips
the `.vif`. The LoadVolume → loadExistingVolume(…, false, …) path
used by VolumeMount does NOT, so it falls through to NewVolume, which
calls v.load with createDatIfMissing=true and creates a phantom
empty `.dat`. The master then reports the volume as regular and EC
detection re-proposes it.

Hoist the `.ecx`-present check so it runs unconditionally for `.vif`
entries: if the EC index is on the disk, the `.vif` belongs to those
EC shards, never to a regular volume that should be resurrected.
Pure OSS clusters never reach this exact state today (OSS deletes
`.vif` unconditionally during Destroy), but the guard hardens the
load path against any future path that leaves the same state.

Test:
- TestLoadExistingVolumeSkipsVifWhenEcxPresent builds the exact
  post-#9448 disk layout (`.vif` + `.ecx`, no `.dat`) and asserts
  loadExistingVolume(skipIfEcVolumesExists=false) returns false,
  does not create a placeholder `.dat`, and does not register a
  phantom volume in l.volumes.
2026-05-12 09:30:42 -07:00
Chris Lu
d221a64262 fix(ec): skip re-encode when EC shards already exist for the volume (#9448) (#9458)
* fix(ec): skip re-encode when EC shards already exist for the volume (#9448)

When an earlier EC encoding succeeded but the post-encode source-delete
left a regular replica behind on one of the servers, the next detection
cycle proposes the same volume again. The new encode tries to redistribute
shards to targets that already have them mounted, the volume server
returns `ec volume %d is mounted; refusing overwrite`, the task fails,
and detection re-queues the volume. The cycle repeats forever — issue
#9448.

The existing `metric.IsECVolume` skip catches the case where the canonical
metric is reported on the EC-shard side of the heartbeat, but when the
master sees BOTH a regular replica AND its EC shards in the same volume
list, the canonical metric we pick is the regular replica and
IsECVolume is false. Add a second guard that checks the topology
directly via `findExistingECShards` (already present and indexed) and
skip the volume when any shards exist, logging a warning that points
the admin at the stuck source.

This breaks the loop. Auto-cleanup of the orphaned replica is left as
follow-up work — deleting a source replica from inside the detector is
only safe with a re-verification step right before the delete, plus a
config opt-in, and is best done in its own change.

* fix(ec): #9448 guard only fires when EC shard set is complete

The first version of the #9448 guard tripped on `len(existingShards) > 0`,
which is broader than necessary. The existing recovery branch in the
encode arm (around the `existingECShards` block, ~line 216) is designed
to fold partial leftover shards from a previously failed encode into
the new task as cleanup sources. Skipping unconditionally on any
existing shards made that branch dead code, regressing the recovery
behavior Gemini flagged in the review of af09e1ec7.

Two corrections:

  1. New helper `countExistingEcShardsForVolume` walks each disk's
     `EcIndexBits` bitmap and ORs the results into a `ShardBits`,
     returning the distinct-shard popcount. This is the right unit:
     a single `VolumeEcShardInformationMessage` can carry several
     shards, so `len(EcShardInfos)` is not the same as the number
     of present shards. Per Gemini's "use helper functions that walk
     the actual shard bitmap" note.
  2. The guard now fires only when `shardCount >= totalShards`.
     Partial shard sets fall through to the existing recovery branch,
     unchanged.

Tests:
  - TestDetectionSkipsWhenECShardsAlreadyExist: complete shards →
    no proposal (the regression test for #9448 itself, unchanged
    intent, rewritten on top of new helpers).
  - TestDetectionAllowsRegularReplicaWhenShardsPartial: partial
    shards → guard does NOT swallow the volume; the encode arm
    still gets a chance.
  - TestCountExistingEcShardsForVolume: the helper walks the
    bitmap correctly even when one info entry packs multiple
    shards on one disk.

The dangerous `volume.delete` hint in the warning is unchanged for
now — it gets fixed in the next commit.

* fix(ec): drop dangerous shell-command hint from #9448 warning

The previous warning told operators to run `volume.delete -volumeId=%d`
in the SeaweedFS shell to clean up the orphaned source replica. That
command is cluster-wide — it deletes every replica of the volume,
including the EC shards, which share the same volume id. Running it
in the state the message describes would cause the data loss the
guard exists to prevent.

Replace it with explicit guidance that the cleanup must be a targeted
VolumeDelete RPC against the source server only, and that the
shell command is the exact wrong thing to use here. The next two
commits add the plumbing and the auto-execution of that targeted
delete so most operators never see this hint at all.

Per Gemini comment on af09e1ec7.

* feat(worker): plumb grpc dial option through ClusterInfo

Add ClusterInfo.GrpcDialOption (optional) and set it in the
erasure_coding plugin handler. Lets the detector make targeted
gRPC calls during detection — used by the follow-up commit to
auto-clean orphan source replicas via VolumeDelete RPCs.

Zero-value safe: existing detectors that don't need RPC access
get a nil DialOption and ignore the field.

* feat(ec): auto-clean orphan source replica via targeted VolumeDelete

Builds on the previous commits: the guard now identifies the
#9448 stuck-source state and a gRPC dial option is available on
ClusterInfo. When both are true, detection auto-cleans the
orphaned regular replica instead of just warning the operator.

New helper `cleanupOrphanSourceReplicas`:

  1. Re-verifies the EC shard set is still complete via
     `countExistingEcShardsForVolume` against the live topology
     snapshot. If the count dropped between detection start and
     the cleanup decision (a volume server going down mid-cycle),
     it aborts — the source replica is the only complete copy and
     deleting it without a healthy shard set would be data loss.
  2. Issues targeted VolumeDelete RPCs to each regular-replica
     server via `operation.WithVolumeServerClient`. That RPC only
     touches the regular volume on the targeted server; EC shards
     live in a separate store path and are not affected. This is
     the safe alternative to the cluster-wide `volume.delete`
     shell command we previously warned against.

If the cleanup partially fails (one replica delete errors, others
succeed), detection logs the failure and continues to skip the
volume. The next detection cycle will try again. We deliberately
don't fall back to a re-encode because that would just collide
with the mounted shards on the targets again.

When no dial option is available the existing warning still
points operators at the safe manual procedure.
2026-05-11 23:12:57 -07:00
Chris Lu
644664bbee feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457)
* feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a)

Replace the local replay-content-hash / max-effective-TTL helpers in
dailyrun with the engine package's canonical versions (ReplayContentHash,
MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface.

Adds PromotedHash to the cursor's recovery triggers: a partition flip
(rule moving between replay and walk because retention shifted) now
fires the rule-change branch alongside RuleSetHash mismatch. The
retentionWindow is set to MaxEffectiveTTL today, which keeps the
promoted set empty and the trigger dormant; Phase 4b will plumb the
real meta-log retention boundary so true scan_only promotions are
detected.

Cursor schema is unchanged — PromotedHash was already persisted as
the zero hash in Phase 2.

* docs(s3/lifecycle): note the one-time cursor rewind on hash format change

gemini-code-assist flagged that swapping localReplayContentHash for
engine.ReplayContentHash changes the persisted RuleSetHash byte layout
(sort order + tagged-field encoding). Phase-2 cursors mismatch on first
post-upgrade run and drop into the rule-change branch.

Going with option 3 (document the intentional one-time rewind). The
rewind is bounded to runNow - maxTTL (not time-zero), self-healing on
the next save, and daily_replay is off by default so the affected
population is limited to early adopters of the algorithm flag. A
migration shim or a hash-compat layer would carry the legacy encoder
forever for one bounded re-scan; not worth it.

Comment in runShard makes the trade explicit so a future reader doesn't
hunt for the "why does my cursor rewind once after upgrade" mystery.

* chore(s3/lifecycle): trim verbose comments in dailyrun

Cut multi-paragraph headers and narration that just described what the
code does. Kept the small WHY notes (per-match skip vs per-rule, the
one-time post-upgrade cursor rewind, scan_only rejection rationale).
Same behavior, ~150 fewer lines of comment.

* fix(s3/lifecycle): persist PromotedHash on the successful runShard save

The comment-trim pass dropped the field alongside a "stays empty in
Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b
turns promoted into a real value — and a save that writes zero would
make the next run falsely detect drift and rewind. Spotted by
gemini-code-assist on PR 9457.

Other save paths (recovery, drain-error) already persisted it; the
success path is the only one that was missing it. Now consistent.
2026-05-11 21:18:19 -07:00
Chris Lu
532b088262 fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423)

When EC shards land on a target whose disk type differs from the
source volume's, master heartbeats wrongly reported under the target
disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the
target server applies it to the in-memory EcVolume via SetDiskType so
the mount notification and steady-state heartbeat both carry the
source's disk type. Empty value falls back to the location's disk
type (used by disk-scan reload paths).

The override is not persisted with the volume — disk type stays an
environmental property and .vif remains portable.

* fix(ec): plumb source disk type through plugin worker (#9423)

Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved),
populate it from the metric the detector already collects, thread it
through ec_task into the MountEcShards helper, and forward it on the
VolumeEcShardsMount RPC.

* fix(ec): mirror source disk type plumbing in rust volume server (#9423)

The volume_ec_shards_mount handler now forwards source_disk_type into
mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it
overrides ec_vol.disk_type (and each mounted shard's disk_type) via
the new set_disk_type method; empty value keeps the location's disk
type, so disk-scan reload and reconcile paths are unchanged.

Also picks up two pre-existing proto drifts that 'make gen' synced
from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds
in remote.proto).

* feat(ec): bias placement toward preferred disk type (#9423)

Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType.
When PreferredDiskType is non-empty, SelectDestinations partitions
suitable disks into matching/fallback tiers and runs the rack/server/
disk-diversity passes on the matching tier first; the fallback tier
is only consulted if the matching pool can't satisfy ShardsNeeded.
PlacementResult.SpilledToOtherDiskType lets callers warn on spillover.

Empty PreferredDiskType keeps the existing single-pool behavior.

* fix(ec): plumb source disk type into placement planner (#9423)

diskInfosToCandidates now copies DiskInfo.DiskType into the placement
candidate, and ecPlacementPlanner.selectDestinations forwards
metric.DiskType as PreferredDiskType so EC shards land on disks
matching the source volume's disk type when possible. A glog warning
fires when placement had to spill to other disk types.

* test(ec): integration coverage for source-disk-type plumbing (#9423)

store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a
shard physically lives on an HDD location, MountEcShards is called
with sourceDiskType="ssd", and the test asserts that the in-memory
EcVolume, the mounted shard, the NewEcShardsChan notification, and
the steady-state heartbeat all report under the source's disk type.
A companion test pins the empty-source path so disk-scan reload
keeps the location's disk type.

detection_disk_type_test exercises the worker plumbing: with a
cluster of nodes carrying both HDD and SSD disks, planECDestinations
must place every shard on SSD when metric.DiskType="ssd"; with only
one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout
via spillover (and log a warning).

* revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423)

make gen pulled two pre-existing OSS changes into the rust proto
tree (LockRingUpdate / by_plugin in master.proto,
listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as
scope creep — none of the rust EC fix references those fields.
Restore both files to origin/master so this branch only touches
EC-related symbols.

* fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423)

partitionByDiskType used raw string comparison, so a PreferredDiskType
of "hdd" never matched candidates whose DiskType is "" (the
HardDriveType sentinel that weed/storage/types uses). EC encoding of
an HDD source would spill onto any HDD reporting "" even when the
cluster has plenty of matching capacity. Normalize both sides
through normalizeDiskType, which lowercases and folds "" → "hdd",
mirroring types.ToDiskType without taking a dependency on it.

selectFromTier's rack-diversity pass also kept revisiting racks the
preferred tier had already used when running on the fallback tier,
which negated PreferDifferentRacks on spillover. Skip racks already
in usedRacks so fallback placements still spread onto new racks.

* fix(ec): empty-source remount must not clobber existing disk type (#9423)

mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount,
disk-scan reload, orphan-shard reconcile). After an RPC sets the
source-derived disk type, any later call passing source_disk_type=""
was resetting ec_vol.disk_type back to the location's value, which
reintroduces the heartbeat drift this PR is meant to fix. Only
default to the location's disk type when the EC volume is fresh
(no shards mounted yet); otherwise leave the recorded type alone so
empty-source reloads preserve whatever the original mount RPC set.
2026-05-11 20:21:50 -07:00
Chris Lu
884b0bcbfd feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)

Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.

Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.

Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
  non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
  the input ClusterContext and injects two metadata keys for
  s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
  decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
  config; other job types pass through unchanged.

Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
  on (admin-config field names, metadata keys). Plain strings on the
  admin side keep weed/admin/plugin free of a dependency on the
  s3_lifecycle worker package; the two sets of constants are pinned
  to identical values and a mismatch would silently disable rate
  limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
  builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
  Missing/empty/non-positive values → no limiter (legacy unlimited
  behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
  bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
  (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).

Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
  long each Limiter.Wait blocks before a LifecycleDelete RPC.
  Operators tune the cap by reading p95 — near-zero means the cap
  isn't binding, a long tail at 1/rate means it is.

Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
  pass-through for non-allocator job types, rps=0 / no-executors
  skip, even sharing, burst sharing, burst=0 omit (worker default
  kicks in), burst floor of 1, no mutation of input metadata, nil
  input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
  covering nil/empty/missing metadata, non-positive/invalid rate,
  positive rate builds correctly, burst missing defaults to 2× rate,
  tiny rate clamps burst to ≥1.

Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.

* fix(s3/lifecycle): divide cluster budget by active workers, not all capable

gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.

Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.

Tests swap the SharedEvenly case for two pinned scenarios:
  - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
  - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
  - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4

* feat(s3/lifecycle): drop Worker Count knob from admin config form

The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.

Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).

handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
2026-05-11 19:17:06 -07:00
Chris Lu
3f4cb6d2fb feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine) (#9447)
* feat(s3/lifecycle/engine): daily-replay view surface (Phase 4 engine)

Adds the engine-side API the new daily-replay worker reaches for:
per-view snapshot construction (RulesForShard, RecoveryView), the two
cursor hashes that gate recovery (ReplayContentHash, PromotedHash),
and the cursor sliding-window helper (MaxEffectiveTTL). CurrentSnapshot
is a stub keyed on a package-level atomic that the worker startup wiring
populates.

Views return new *Snapshot instances holding cloned *CompiledAction
values so per-clone active/Mode never leak across partitions. Replay
clones force Mode=ModeEventDriven to rehabilitate any persistent
ModeScanOnly carried over from PriorState; walk and recovery clones
preserve Mode as-is. Disabled actions are excluded from all views.

No production caller is wired here — Phase 4's walker/dailyrun
integration is the follow-up. dailyrun's local helpers
(localReplayContentHash, localMaxEffectiveTTL) become one-line
redirects to these exports.

API surface:
- CurrentSnapshot() *Snapshot — stub until Phase 4 wiring.
- SetCurrentEngine(*Engine) — Phase 4 wiring entry point.
- Snapshot.RulesForShard(shardID, retentionWindow) (replay, walk *Snapshot)
- RecoveryView(s *Snapshot) *Snapshot — force-active over the full set.
- ReplayContentHash(s *Snapshot) [32]byte — partition-independent.
- PromotedHash(s *Snapshot, retentionWindow) [32]byte — partition-flip.
- MaxEffectiveTTL(s *Snapshot) time.Duration — over active replay only.

30 unit tests covering clone isolation, Mode rewrite, partition
membership including the multi-action-kind XML rule split,
RecoveryView activating pre-BootstrapComplete actions,
ReplayContentHash partition-independence, PromotedHash sensitivity to
promotion in either direction, MaxEffectiveTTL aggregation. Build +
race-tests green.

* refactor(s3/lifecycle/engine): consolidate hash helpers; clarify shardID semantics

Addresses PR #9447 review feedback. Three medium-priority items from
gemini, all code-quality refinements (no behavior change):

1. Duplicated sort comparator between ReplayContentHash and
   PromotedHash. Extract sortHashItems shared helper so the two
   hashes use the same ordering by construction — if one drifted, the
   cursor could see a spurious "rule changed" on a no-op snapshot
   rebuild.

2. Duplicated writeField/writeInt closures. Extract hashWriter struct
   holding the sha256 running hash + lenbuf, with method helpers.
   Same allocation profile (one Hash, one tiny stack buffer per
   helper); just deduplicates ~20 lines.

3. shardID parameter on RulesForShard is unused. Per the design's
   open question, every shard sees every rule today (shard filter
   runs at the entry-iteration site, not view construction). Keep
   the parameter for API stability — removing it now would force
   a breaking change when bucket-shard ownership lands — and update
   the doc comment to explain why it's reserved.

go build ./... clean; engine test suite green.
2026-05-11 18:07:54 -07:00
Chris Lu
122ca7c020 feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446)
* docs(s3lifecycle): design for daily-replay worker

Captures the algorithm and dev plan iterated on in PR #9431 and the
discussion leading up to it: per-shard daily meta-log replay, walker
as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent
plus a recovery branch over engine.RecoveryView(snap), explicit
retention-window input to RulesForShard, two cursor hashes
(ReplayContentHash + PromotedHash) that together detect every
invalidation case. Implementation phases are sequenced so each can
ship independently — Phase 1 (noncurrent_since stamp) just landed.

* feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2)

New weed/s3api/s3lifecycle/dailyrun package implementing the bounded
daily meta-log scan from the design doc. One pass per Execute per
shard: load cursor, scan events forward, route each through router.Route,
dispatch any due Match, advance the cursor on success. Halt-on-failure
keeps the cursor at the last fully-processed event so tomorrow resumes
from the same point — head-of-line blocking is the deliberate failure
signal.

Replay-only in this phase. Phase 4 wires the walker for ExpirationDate,
ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules.
Until then a typed UnsupportedRuleError refuses runs on those buckets:
operators see the rejection in the activity log rather than silently
losing rules.

Behavior:
- Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted
  under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in
  Phase 2; Phase 4 turns it on.
- Rule-change branch rewinds cursor to now - max_ttl when the
  replay-content hash mismatches. Cold start uses the same floor.
- Transport errors retry 3x with exponential backoff capped at 5s;
  server outcomes (RETRY_LATER / BLOCKED) halt the run without retry.
- Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules
  exist, only the hash gates a future addition.

Worker shape:
- New admin config field "algorithm" with enum streaming|daily_replay,
  default streaming. Existing deployments are unaffected.
- handler.Execute branches on the flag: streaming routes through the
  current scheduler.Scheduler, daily_replay routes through
  dailyrun.Run.
- dispatcher.NewFilerSiblingLister exported so both paths share the
  same .versions/ + null-bare lookup.

Engine integration:
- Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase
  4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace
  them with one-line redirects; the local versions hash the same
  fields so the cursor stays valid across the swap.

Tests cover cursor persistence, unsupported-rule rejection,
hash stability under rule reordering, hash sensitivity to TTL edits,
max-TTL aggregation, dispatch retry budget, and request shape
including the identity-CAS witness.

Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers
and future phases share the same spec.

* feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob

The streaming dispatcher hasn't shipped to users yet, so there's no
backward-compat surface to preserve. Flip the algorithm default from
streaming to daily_replay so the new path is the standard from day
one. Streaming stays as an explicit opt-in escape hatch during the
Phase 4 walker rollout; Phase 5 deletes both the flag and the
streaming code.

Buckets whose lifecycle rules require walker-bound dispatch
(ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only)
will fail the daily_replay run with the existing
UnsupportedRuleError until Phase 4 walker integration ships. Operators
hitting that case can set algorithm=streaming until the follow-up
lands.

Updates the test for the default value and renames the
unknown-value-fallback case to reflect the new default.

* fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches

The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}.
That's only safe when each event produces at most one match per
ActionKey with a single deterministic due-time formula —
ExpirationDays and AbortMPU fit that shape because due_time
= ev.TsNs + r.days is monotonic in event TsNs.

But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed
in Phase 2 since it compiles to ActionKindNoncurrentDays) routes
through routePointerTransitionExpand, which emits matches for every
noncurrent sibling — each with its own SuccessorModTime taken from
the demoting event for that specific sibling. A single event can
therefore produce two matches for the same ActionKey on different
objects with wildly different DueTimes.

With the old code, a not-yet-due sibling encountered first would set
done[ActionKey] = true and then the next sibling — even though its
DueTime had already passed — would be skipped. Future events for the
same rule would also be suppressed for the rest of the run. Objects
that should have been deleted weren't.

Fix: drop the early-stop optimization. Process every match
independently. A future-DueTime match is now silently skipped without
affecting any later match. The performance hit is small (Phase 2 is a
single bounded daily pass, and the rate limiter is the real
throughput governor); the correctness gain is non-negotiable.

Also fixes the inverted comment in processMatches that described the
old check as "due_time is past now" when it actually checked
DueTime.After(now) (i.e., NOT yet due).

Adds four targeted tests:
- not-yet-due match first in slice does not suppress two later
  due matches for the same rule;
- reversed slice ordering produces identical dispatch;
- BLOCKED outcome halts the loop before later due matches are sent;
- empty match slice is a no-op.

Phase 4's walker-and-recovery integration can revisit a
per-(rule, object) memoization if profiling argues for it.

* fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency

Addresses PR #9446 review feedback. Eight distinct fixes:

1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the
   persisted cursor to lastOK = TsNs of the last event processed,
   including events whose matches were skipped as not-yet-due. Those
   skipped matches would never be re-scanned, so objects under
   long-TTL rules would never expire.

   Track a "stuck" flag in drainShardEvents: the first event with a
   skipped (future-DueTime) match stops cursorAdvanceTo from rising,
   but the loop keeps processing later events to dispatch any that ARE
   due. The persisted cursor sits at the last fully-processed event so
   tomorrow's run re-scans from the skipped event onward and the
   future-due matches get re-evaluated when they age in.

   processMatches now returns (skippedAny, halted, err) so the drain
   loop can tell apart "event fully drained" from "event had pending
   future-due matches."

2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the
   ActionKind. A replay-eligible kind with Mode != ModeEventDriven
   (e.g. ModeScanOnly via retention promotion) passed the check but
   then got silently ignored by router.Route, which gates dispatch
   on Mode == ModeEventDriven. Reject loudly with the typed error
   so admin sees the rejection in the activity log.

3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard
   goroutines regardless of cfg.Workers. Add a Workers field to
   dailyrun.Config and gate the goroutine fan-out on a semaphore of
   that size; the handler now passes cfg.Workers through.

4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one
   snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per
   shard. Mid-run Compile would let shards process different rule
   sets. Capture snap at the top of Run, pass it down to every shard.

5. FROZEN runNow (coderabbit). drainShardEvents and processMatches
   accepted a `now func() time.Time` and called it multiple times.
   DueTime comparisons would slip as the run wore on. Capture runNow
   once at the top of Run and thread it through as a time.Time value.

6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case
   broke out of the loop and returned nil, marking interrupted runs as
   successful. Return ctx.Err() instead so the caller propagates the
   interrupt; cursorAdvanceTo carries whatever progress was made.

7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently
   accepted empty files, mismatched shard_ids, and hash slices shorter
   than 32 bytes (copy() would zero-pad). Each now returns a typed
   error so the run halts and an operator investigates rather than
   silently re-scanning from time zero or persisting a zero-padded
   hash that masks corruption forever.

8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted"
   guard in runShard was unreachable because drainShardEvents
   initialized lastOK := startTsNs and only ever raised it. Removed
   along with the new cursor-advancement semantics that handle the
   "no events processed" case implicitly.

Plus markdown lint: DESIGN.md fenced code blocks now carry a `text`
language identifier to satisfy MD040.

Skipped from the review:
- gemini's "maxTTL == 0 incorrectly skips immediate expirations":
  actions with Days <= 0 don't compile to a CompiledAction (see
  weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new
  empty-replay sentinel uses `rsh == [32]byte{}` for clarity per
  gemini's suggested form, but the behavior is equivalent.

Tests added/updated:
- TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when
  all matches are past their DueTime.
- TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins
  the new Mode check.
- TestFilerCursorPersister_EmptyFileReturnsError,
  _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin
  the new validation rules.
- Existing process-matches tests reshaped for the
  (skippedAny, halted, err) return tuple.

Full build clean. Dailyrun + worker test packages green.
2026-05-11 18:07:17 -07:00
Chris Lu
b2d24dd54f volume: require admin auth on BatchDelete (#9438)
Run BatchDelete through checkGrpcAdminAuth like the other destructive
volume-server RPCs (VolumeDelete, DeleteCollection, vacuum, EC, ...),
so a whitelist-configured server denies non-admin callers.
2026-05-11 13:50:48 -07:00
Chris Lu
2b21d19e4c volume: require admin auth on ReadAllNeedles and VolumeNeedleStatus (#9437)
Both RPCs hand out raw needle bytes / cookies. Run them through
checkGrpcAdminAuth like the rest of the volume-server admin handlers.
2026-05-11 13:50:19 -07:00
Chris Lu
46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00
Chris Lu
6d12ebeefe fix(mount): fall through to filer when cached dir misses a tracked inode (#9436)
lookupEntry returned ENOENT whenever the metaCache had the parent marked
cached but the child entry was absent. That's only correct when the
kernel has no record of the path either — when inodeToPath still maps
it, the three layers disagree (#9139). Triggers in practice under bursts
of concurrent metadata ops and after delete/rename events from another
mount drop the local entry without clearing the inode mapping; the test
flake fixed in b94ad8247 was the same shape on a smaller scale.

Trust the filer in that case: fall through to the existing
GetEntry path, which already loudly logs (Warningf with layer state)
when the filer also returns ErrNotFound, and otherwise serves the live
entry. Drop the Warningf from the cached-dir miss branch; it fires
thousands of times under 16-task rclone imports while the real error
path downstream covers the genuine-drift signal.
2026-05-11 11:50:37 -07:00
Chris Lu
514ba7a233 fix(master): route ec shard vids to NewEcVids on initial subscribe (#9435)
* fix(master): route ec shard vids to NewEcVids on initial subscribe

ToVolumeLocations appended EC shard volume IDs to NewVids, so a freshly
subscribing master client registered them in the regular-volume map via
addLocation instead of addEcLocation until the next heartbeat-driven
delta arrived. Append to NewEcVids to match the incremental path.

Fixes #9429

* fix(master): dedupe ec vids in initial subscribe snapshot

dn.GetEcShards returns per-(vid, diskId) entries, so a single EC volume
spread across multiple physical disks on one DataNode emitted the same
vid multiple times in NewEcVids. Dedupe so the snapshot carries each
vid once.
2026-05-11 10:56:26 -07:00
Chris Lu
6001d65206 perf(volume): stream-count the gzip size when no Content-MD5 is set (#9433)
ParseUpload runs util.DecompressData on every gzipped multipart upload
just to record OriginalDataSize. The decompress materializes the full
uncompressed slice via bytes.Buffer.ReadFrom inside util.GunzipStream;
for a 64 MiB chunk that's a ~128 MiB heap spike per call (geometric
grow). On 6-way concurrent UploadPartCopy the spike dominated the
remaining heap profile after #9420/#9421/#9422/#9424/#9425.

When no Content-MD5 verification is requested the uncompressed bytes
aren't needed — only the length is. Stream the gunzip through
io.Discard and count: the pooled gzip.Reader's working set replaces
the materialized slice.

Unlike the previous attempt in #9426 the size still comes from the
real bytes, not from a client-set header.

  TotalAlloc per call, 4 MiB uncompressed body:
    materialize (was, still runs when MD5 is set):  ~16.8 MiB
    stream-count (no MD5):                             ~28 KiB

Refs #6541, #9426 (reverted in #9432).
2026-05-11 10:55:56 -07:00
Chris Lu
a64483885c fix(pb): skip Unix-socket gRPC registration on Windows (#9430) (#9434)
The same-host gRPC fast path registered /tmp/...sock paths and dialed
them with net.Listen("unix", ...) / net.Dial("unix", ...). Those paths
are POSIX-only, so on Windows the listener failed at startup and every
local gRPC call routed through the registered port lost its transport.

Gate RegisterLocalGrpcSocket to return early on Windows. ServeGrpcOnLocalSocket
and resolveLocalGrpcSocket already short-circuit when no socket is registered,
so all same-host RPCs fall back to TCP without touching any callsite.
2026-05-11 10:25:37 -07:00
Chris Lu
ac65c6c2ca revert(volume): drop X-Seaweedfs-Original-Size hint (#9432)
Reverts #9426. The header had the volume server record OriginalDataSize
from a value set by the multipart upstream — a client-controlled
metadata field. On a volume server that isn't JWT-protected, a caller
can lie and the needle stores the lie; bounds-checking the value
doesn't change the trust shape, only the magnitude of the lie. Derive
the size from the bytes again.

The optimization only fired on multipart Content-Encoding: gzip parts
(the s3 chunk-copy fast path), a narrow case that doesn't justify the
trust dependency. A future change can attack the same heap profile by
stream-decompressing to count bytes instead of materializing the
uncompressed slice — no client-trust surface.

Refs https://github.com/seaweedfs/seaweedfs/pull/9426#issuecomment-4417862793
2026-05-11 10:08:55 -07:00
Chris Lu
b456628a7a ui(s3_lifecycle): plain-English labels for cadence fields
Dispatch Tick / Cursor Checkpoint Tick / Engine Refresh / Bootstrap Re-walk
are internal terms — operators tuning the form had to read the descriptions
to guess what each field meant. Renames the visible labels and the section
blurb; underlying field names are unchanged so stored configs still load.
2026-05-10 23:46:12 -07:00
Chris Lu
8efa32258a feat(volume): X-Seaweedfs-Original-Size hint skips redundant gunzip (#9426)
* feat(volume): X-Seaweedfs-Original-Size hint skips redundant gunzip

The full-chunk gzip pass-through (#9425) fixed source-volume
decompression but moved the cost to the destination volume:
parseUpload still ran util.DecompressData on the forwarded gzipped
bytes, just to learn the uncompressed length so it could record
OriginalDataSize in the needle metadata. For 6-way concurrent 64 MiB
UploadPartCopy that decompress-and-discard pass dominated the
remaining heap profile after the streaming chain landed (~297 MiB
inuse via bytes.Buffer.ReadFrom inside util.GunzipStream).

Add an X-Seaweedfs-Original-Size header on the multipart part. When
the upstream sets it (the s3 chunk-copy fast path always knows the
uncompressed size — it's the source chunk's logical size) and no
Content-MD5 verification is requested (which would require
decompressed bytes to compute against), parseUpload uses the hint
directly and skips the decompress.

Header is X-* prefixed (not Seaweed-*) so it doesn't get auto-stored
as a needle pair by PairNamePrefix.

Backward compatible:
- old s3 servers don't set the header, parseUpload decompresses as
  before
- new s3 servers talking to old volumes: header is ignored, volume
  decompresses
- bad header values (non-numeric, negative, garbage) fall back to the
  existing decompress path

End-to-end repro impact (512 MiB src, 6 parallel UploadPartCopy,
post-#9420/#9421/#9422/#9424/#9425 baseline):

  RSS, round 2:        1149 MiB → 594 MiB
  heap inuse_space:     545 MiB → 349 MiB
  HeapSys:             1.35 GiB → 777 MiB
  TotalAlloc cum:        ~9 GiB → 3.5 GiB

Total reduction from pre-#9420 baseline: 3134 → 594 MiB (-81%).

Test exercises the four matrix corners (hint+no-MD5,
hint+part-MD5, hint+req-MD5, no-hint, garbage-hint) and bounds
allocation per case so a regression that re-introduces the
unconditional decompress fails the hint-present-no-MD5 case.

* review: bound X-Seaweedfs-Original-Size by sizeLimit and uint32

CodeQL on PR 9426 traced a new taint flow: the strconv.Atoi(hint)
value flows into pu.OriginalDataSize -> originalSize -> the existing
uint32(originalSize) cast in volume_server_handlers_write.go:73. The
cast was always there but its input was previously bounded by the
ParseUpload read path (capped at sizeLimit). Adding a user-controlled
hint bypassed that bound, so a malicious header could overflow the
uint32 silently.

Bound the hint at parse time by sizeLimit (the largest needle this
volume will accept anyway) and by math.MaxUint32 (belt-and-suspenders
in case sizeLimit is configured > 4 GiB).
2026-05-10 15:57:07 -07:00
Chris Lu
9a70bbfcc6 feat(s3api): full-chunk gzip pass-through skips volume-side decompress (#9427)
Building on the io.Pipe streaming chunk copy: when a copy operation
covers an entire source chunk (the common case for Harbor's
part-size = chunk-size assemble pattern), ask the source volume for
compressed bytes via Accept-Encoding: gzip and forward them to the
destination as-is.

This trades a Range fetch (where the volume decompresses the chunk
internally to satisfy the byte range) for a full-chunk fetch that
returns whatever wire bytes the chunk is stored as. For gzipped
chunks the source volume avoids the decompression entirely; we never
allocate a chunk-sized decompress buffer.

Implementation: build the source GET directly instead of going
through ReadUrlAsStream, because that helper auto-decompresses gzip
responses (which would defeat the point). Trust the response's
Content-Encoding header over caller hints — for partial ranges the
volume always returns raw bytes regardless of how the chunk is
stored, so labeling those as gzip would corrupt subsequent reads.

End-to-end repro impact (512 MiB src, 6 parallel UploadPartCopy):
  + #9420/#9421/#9422       : 2236 MiB
  + io.Pipe streaming       : 1521 MiB
  + this commit             : 1149 MiB  (round 2 RSS, perfectly flat)

Round 3 now completes (was hitting volume-full before, since
chunks took up uncompressed space on disk; we now store the gzipped
chunks the volume gives us, which fit in the test's 8 GiB volume
budget).

Heap inuse_space (after force GC):
  before all: ~1.5 GiB
  this PR:    266 MiB

Volume-side bytes.Buffer.ReadFrom inuse:
  before:     611 MiB
  streaming:  571 MiB
  this PR:    297 MiB (now in destination-volume parseUpload's
                       size-hint decompression — separate
                       optimization opportunity for a hint header)
2026-05-10 14:55:59 -07:00
Chris Lu
4a04594826 feat(s3api): stream chunk copy via io.Pipe to cut peak working set (#9424)
* fix: cap pool retention so chunk-copy buffers don't hoard memory

Two pool-retention sites kept the runaway-RSS pattern in #6541 visible
even after #9420 and #9421:

* weed/util/buffer_pool: SyncPoolPutBuffer dropped a buffer back into
  sync.Pool regardless of how big it had grown. After a 64 MiB chunk
  upload through volume.PostHandler -> needle.ParseUpload, the pool
  hoarded a 64 MiB byte array per cached entry for the rest of the
  process's lifetime. Cap retention at 4 MiB; oversized buffers are
  dropped so GC can reclaim the backing array.

* weed/s3api/...copy.go: uploadChunkData left UploadOption.BytesBuffer
  unset, so operation.upload_content fell back to the package-global
  valyala/bytebufferpool. That pool also retains high-water buffers
  forever, and concurrent UploadPartCopy filled it with one chunk-sized
  buffer per concurrent upload. Provide a fresh per-call bytes.Buffer
  pre-sized to chunk + multipart framing; it's GC'd as soon as the
  upload returns.

Tests:
- weed/util/buffer_pool/sync_pool_test.go: pin the cap (oversized
  buffers don't round-trip), the inverse (right-sized buffers do), and
  nil-safety.
- weed/s3api/...copy_chunk_upload_test.go: extract newChunkUploadOption
  and pin that BytesBuffer is always non-nil and pre-sized, and that
  each call gets a distinct buffer.

* feat(s3api): stream chunk copy via io.Pipe to cut peak working set

Final piece for #6541. The buffered chunk-copy path holds two
chunk-sized buffers per copy in flight (download buffer + multipart-
encoded upload buffer). Under concurrent UploadPartCopy that put a
floor on RSS at concurrency × 2 × chunk_size — about 768 MiB for the
6-way / 64 MiB Harbor-style assemble repro, even after the previous
pool/retention fixes.

Replace the buffered path with an io.Pipe between the source GET and
the destination POST: ReadUrlAsStream pumps data into the pipe via a
multipart.Writer, the http.Client reads from the pipe end and POSTs
the body. In-flight per copy is now ~32 KiB (pipe hand-off + http
buffers), regardless of chunk size.

The streaming path is gated by canStreamCopyChunk: only used when no
in-transit transformation is needed (no per-chunk CipherKey, no SSE).
SSE-C / SSE-KMS / SSE-S3 paths still go through the buffered path,
which already handles re-encryption correctly.

Benchmarks (Apple M4, httptest source/dest, B/op = bytes per copy):

  Buffered  1 MiB:   6.0 MB B/op,  443 MB/s
  Streamed  1 MiB:   374 KB B/op,  727 MB/s
  Buffered  8 MiB:    56 MB B/op,  559 MB/s
  Streamed  8 MiB:   379 KB B/op, 1138 MB/s
  Buffered 64 MiB:   455 MB B/op,  718 MB/s
  Streamed 64 MiB:   304 KB B/op, 1387 MB/s

End-to-end repro (512 MiB src, 6 parallel UploadPartCopy):
  pre-#9420 RSS round 2: 3134 MiB
  + #9420/#9421/#9422  : 2236 MiB
  + this PR            : 1521 MiB
  heap inuse_space     :  350 MiB (was 1422 / 1187 MiB)
  HeapSys (MemStats)   : 1.74 GiB (was 2.49 GiB)

* review: surface shouldRetry, add int32 guard, drop redundant drains

Address review on PR 9424:

* coderabbit (HIGH, line 122): ReadUrlAsStream can set shouldRetry=true
  with readErr=nil. Before this fix, that fell through to mw.Close()
  and the destination POST succeeded against a possibly-truncated
  multipart body. Mirror downloadChunkData's explicit check and
  surface shouldRetry as a producer error so the dst POST aborts.
* gemini (line 98): chunk size is int64 but ReadUrlAsStream takes int.
  Reject sizes above MaxInt32 up front so the int(size) cast can't
  truncate negative on 32-bit platforms — same guard downloadChunkData
  uses.
* gemini (line 151): util_http.CloseResponse already drains the body
  (io.Copy(io.Discard, ...) inside the helper) before closing, so the
  manual io.Copy drains we added are redundant. Drop them.

* review: cancel source GET when destination POST fails

Address coderabbit review (line 165 / second pass on PR 9424): when
the POST leg fails or returns an error status, closing pipeReader
only fails the producer's *writes*. ReadUrlAsStream's own read loop
runs under the parent ctx, so it keeps draining the source body in
the background until EOF — wasting source-volume bandwidth and CPU
on a copy that's already failed.

Wrap streamCopyChunkRange in a child context cancelled on return.
ReadUrlAsStream checks ctx.Done() per 256 KiB tick, so the in-flight
read aborts on the next iteration once the function returns. The POST
also moves to streamCtx so the in-flight request can be cancelled the
same way if the producer fails first.

Defer-cancel runs after both legs return, so the success path still
sends EOF cleanly through pipeWriter.Close before cancellation.
2026-05-10 14:29:39 -07:00