13866 Commits

Author SHA1 Message Date
Chris Lu
01b3e4a71c template 4.26 2026-05-17 23:12:04 -07:00
Chris Lu
6cab199400 fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527)
* fix(iceberg): dial filer gRPC address verbatim in plugin worker

dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress,
whose single-port fallback adds +10000 to any host:port — so when the admin
forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker,
the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent
28888 and dispatched jobs fail with connection refused.

Drop the conversion; the address is already dialable. Tests that produced fake
filer addresses in dual-port form now return host:grpcPort to match the new
contract.

* test(ec): use renamed detection_interval_minutes field

The admin_runtime.detection_interval_seconds field was renamed to
detection_interval_minutes back in May. This integration test was not
updated, so the unknown JSON field was silently ignored and the scheduler
fell back to the default detection interval (17 min for erasure_coding),
which exceeds the test's 5-minute wait and times out.

Switch to detection_interval_minutes: 1 — local run completes in ~120s.
2026-05-17 23:03:00 -07:00
Chris Lu
136eb1b7c8 4.26 2026-05-17 21:05:25 -07:00
Chris Lu
c11ff6657b fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525)
* fix(ec): mirror EC sidecars onto every shard-bearing disk at startup

In a multi-disk volume server, ec.balance and ec.rebuild can land shards
on a disk that does not also hold the matching .ecx / .ecj / .vif index
files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks
already loads those shards by pointing the EcVolume at the sibling
disk's index files; reads work, but any failure on the index-owning
disk silently disables every shard on the other disk, even though those
shards are physically fine.

This change adds mirrorEcMetadataToShardDisks, a startup pass that
physically replicates .ecx / .ecj / .vif onto each disk that holds
shards but is missing them. Each copy is atomic (tmp + fsync + rename)
and idempotent (a destination that already has the sidecar is
preserved). After mirroring, the cross-disk reconciler prefers the
local IdxDirectory so the EcVolume mounts self-contained; the
cross-disk virtual mount remains as a fallback for volumes whose mirror
failed (read-only target, out of space, partial copy on a previous
boot).

The same-disk invariant the EC lifecycle (encode / decode / balance /
vacuum / repair) was already documented as promising is now actually
restored at boot, so a future failure of one disk in a split-shards
layout no longer takes the other disk's shards with it.

Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj /
.vif from dir1) and idempotency (an existing destination .ecx is not
overwritten with the owner's copy).

* fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check

hasAllEcSidecarsLocally checked only the modern destination path
(IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk
that still had a legacy .ecx in its data dir (written before -dir.idx
was set) would report "not present" and the mirror would write a
second copy to IdxDirectory, leaving two .ecx files on disk.

Matches HasEcxFileOnDisk's open-with-fallback contract: check the
modern path first, then the opposite directory. Factored the
exists-and-not-a-dir check into a small statRegular helper so the
fallback ladder stays readable.

* rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup

Port of the Go fix (commit 088e26ea6) to the Rust volume server.
Adds Store::mirror_ec_metadata_to_shard_disks, called from
add_location / load_new_volumes before the cross-disk orphan
reconciler. Physically copies .ecx / .ecj / .vif from the disk that
owns the index files onto every disk holding shards but missing
sidecars, so each shard-bearing disk ends up self-contained.

The reconciler now prefers the local idx_directory when the mirror
has installed a .ecx there; the cross-disk virtual mount remains as
the fallback for volumes whose mirror failed (read-only target, out
of space, partial copy on a previous boot). Adds ec_local_ecx_path
helper shared between reconcile and mirror to detect the post-mirror
fast path.

Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when
-dir.idx is configured and the destination still has a legacy .ecx
in its data dir, that's recognized so the mirror does not write a
duplicate copy into idx_directory.

Tests cover the two key cases: orphan layout (dir0 receives the
sidecars from dir1) and idempotency (a pre-existing destination .ecx
is not overwritten).

* trim verbose comments on EC mirror code

Comments now lead with the WHY (non-obvious constraints, the
post-mirror fast path, why local copies are authoritative) and drop
restate-the-code blocks, headers, and section dividers. Behavior is
unchanged; all existing tests still pass on both the Go volume
server and the seaweed-volume Rust port.

* drop github issue refs from added comments

Two stray "#9212" references slipped into comments I added on the
cross-disk reconciler call site. The git log carries the issue
history; comments stand on their own.

* test(ec): accept rebuild on either disk after sidecar mirror

TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must
land at the disk-0 path. With the boot-time sidecar mirror, every
shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now
picks whichever disk hosts the most shards — disk 1 in this layout
after the deletion. The shard can legitimately rebuild on either
disk; the test now accepts both and uses the chosen path for the
subsequent mount + read verification.
2026-05-17 19:55:15 -07:00
Chris Lu
6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Chris Lu
ff6f9fd90a iam: honor configured credential store for IAM API policies and propagate to S3 caches (fixes #9518) (#9522)
* iamapi: route managed policies through credential manager (fixes #9518)

CreatePolicy via the IAM API wrote straight to the filer
/etc/iam/policies.json, ignoring any non-filer credential store. When
credential.postgres was configured, policies created via the IAM API
landed only in the filer while the Admin UI wrote to postgres,
producing a split-brain where ListPolicies/GetPolicy never saw the
Admin UI's policies and vice versa.

GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies
from credentialManager and persist Create/Update/Delete as a delta
against the store. Inline user/group policies still live in the legacy
policies.json file (no credential-store API for them yet). Pre-existing
managed policies in the legacy file are merged on read so deployments
don't lose data, and re-persisted to the store on the next write so
the legacy file is drained over time.

* credential: route IAM API inline policies through credential manager

Extends the #9518 fix to user-inline and group-inline policies so the
IAM API never writes the legacy /etc/iam/policies.json bundle directly.
The previous patch only routed managed policies; this one finishes the
job for the other two policy types.

- Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional
  interfaces, mirroring the existing user-inline ones, and matching
  Put/Get/Delete/List/LoadAll wrappers on CredentialManager.
- Implement group-inline storage in memory (new map), filer_etc (new
  field on PoliciesCollection, reusing the legacy file under policyMu),
  and postgres (new group_inline_policies table with ON DELETE CASCADE
  off the groups FK).
- Wire the new methods through PropagatingCredentialStore so wrapped
  stores still delegate correctly.
- IamS3ApiConfigure.PutPolicies now applies managed + user-inline +
  group-inline as deltas through the credential manager; the legacy
  /etc/iam/policies.json file is never written when a credential
  manager is wired up. GetPolicies still reads the legacy bundle once
  as a fallback so unmigrated data is picked up and re-persisted into
  the store on the next write.

* credential: propagate SaveConfiguration writes to running S3 caches

Postgres (and any non-filer) credential stores never fired the S3 IAM
cache invalidation path on bulk identity / group updates. The
PropagatingCredentialStore had explicit Put/Remove handlers for
single-entity calls (CreateUser, PutPolicy, etc.) but inherited
SaveConfiguration unchanged from the embedded store, so the bulk path
the IAM API takes at the end of every handler was silent. Inline-policy
changes recompute identity.Actions and persist via SaveConfiguration,
so until restart the cached Actions on each S3 server stayed stale and
authorization decisions used the pre-change view.

Override SaveConfiguration to snapshot the prior user / group lists,
delegate the save, then fan out PutIdentity / PutGroup for what's in
the new config and RemoveIdentity / RemoveGroup for what got pruned.
Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes.

* iamapi: drain legacy policies.json after authoritative credential-store writes

Review pointed out a resurrection bug: GetPolicies still reads
/etc/iam/policies.json as a one-way migration fallback, but PutPolicies
in the credential-manager path never wrote that file, so legacy-only
entries reappeared on the next read even after the IAM API "deleted"
them. PutPolicies now overwrites the bundle with an empty {} after a
successful credential-store write, unless the store is filer_etc
(which owns the bundle as its own inline-policy backing — clearing it
would wipe filer_etc's data). Also wraps the filer read, JSON
unmarshal, and marshal errors with context per the other review
comments.
2026-05-17 13:15:27 -07:00
Chris Lu
b4289abb0a admin: convert filer address to gRPC form before dispatch (#9523)
The master returns each registered filer in pb.ServerAddress dual-port
form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's
plugin context builder forwarded that string verbatim as
filer_grpc_address, so workers calling grpc.DialContext on it failed
every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port".

Run each entry through pb.ServerAddress.ToGrpcAddress before populating
ClusterContext.FilerGrpcAddresses.

The lifecycle integration test now pins filer.port.grpc to a value that
breaks the FILER_PORT+10000 assumption, and a new dispatch test drives
the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and
asserts the dispatched job both reaches the filer and deletes the
backdated object.
2026-05-17 11:33:54 -07:00
Chris Lu
2a41e76101 fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range

The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.

Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.

Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.

* log(ec): include disk_id in EC unmount/delete/refusal log lines

The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:

  UnmountEcShards 4121.1                              -> add disk_id
  ec volume video-recordings_4121 shard delete [1 5]  -> add per-loc disk_id
  volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
  ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids

ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.

Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).

* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server

The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
2026-05-17 11:31:37 -07:00
Chris Lu
bf9110ebd3 fix(ec): mount falls back to sibling-disk .ecx (fixes #9519) (#9521)
* fix(ec): mount falls back to sibling-disk .ecx (fixes #9519)

MountEcShards iterated DiskLocations and on each disk called LoadEcShard
with that disk's IdxDirectory as the .ecx home. When ec.balance lands the
.ec?? shard on disk A but the .ecx on sibling disk B of the same volume
server, NewEcVolume ENOENTs the .ecx and returns "cannot open ec volume
index ...". That error is not os.ErrNotExist, so the per-disk continue
branch did not engage and the mount loop bailed before trying any other
disk.

The startup reconciliation in reconcileEcShardsAcrossDisks already
handles this layout for orphan shards discovered on boot (issue #9212).
This change mirrors the same primitive on the mount path: look up the
.ecx owner across all DiskLocations once and route NewEcVolume at that
directory whenever the disk being mounted does not own its own copy of
the .ecx. Same-disk mounts are unaffected because HasEcxFileOnDisk keeps
LocalIdxDirectory in play.

Adds a regression test that plants the index files on a sibling disk
AFTER NewStore returns (so the startup reconcile is a no-op for that
vid) and verifies MountEcShards succeeds; also pins the same-disk
baseline against accidental re-routing.

* fix(ec): skip redundant stats in cross-disk .ecx lookup (review)

Two follow-ups from gemini-code-assist on #9521:

1. MountEcShards: when findEcxIdxDirForVolume already returned a path
   that lives on this disk's IdxDirectory or Directory, the disk owns
   the .ecx — skip the HasEcxFileOnDisk stat and use the local idx dir
   directly. Only re-check when the disk's directories are neither, so
   the duplicate-.ecx-on-multiple-disks edge case is still honored.

2. findEcxIdxDirForVolume: hoist the seen map across the location loop
   so a shared IdxDirectory (one -dir.idx paired with several -dir
   entries) is only stat'd once per call.

Both are I/O optimizations; behavior is unchanged. Existing
cross-disk and same-disk regression tests still pass.

* docs(ec): drop issue/PR references from cross-disk mount comments

Comments and test docstrings stand on their own; the issue number
adds nothing a reader can act on and goes stale across forks. Keep
the description of *what* the layout is and *why* the fallback
exists, just without the reference.
2026-05-17 11:10:37 -07:00
Chris Lu
d51454adf4 rust(seaweed-volume): distributed EC read across peer servers (#9516)
* feat(seaweed-volume): distributed EC read across peer servers

EcVolume::read_ec_shard_needle previously errored with NotFound when
any interval's shard wasn't local. In an RS(10,4)-across-N deployment
each server holds one shard, so every read needed >=9 peer fetches and
post-EC GETs returned 404 on volumes whose shards lived on more than
one server.

Mirror of weed/storage/store_ec.go's readOneEcShardInterval ->
readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain:

  * server/store_ec.rs (new): entry point
    read_ec_shard_needle_distributed. Snapshots locate-needle + local
    reads under the Store sync lock, drops the lock, then async-fetches
    missing intervals via the peer's VolumeEcShardRead RPC. Falls back
    to Reed-Solomon reconstruction (read every other shard at the same
    (shard_offset, size) and run rs.reconstruct) when the direct peer
    read fails. Refreshes the per-EcVolume shard_locations cache from
    the master's LookupEcVolume RPC using Go's freshness thresholds
    (11s / 7min / 37min).
  * erasure_coding/ec_volume.rs: shard_locations now sits behind a
    std::sync::RwLock so the read path can refresh the map without
    holding the Store write lock. Adds shard_locations_refresh_time
    (Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's
    ShardLocationsLock / ShardLocationsRefreshTime fields. set/get
    helpers updated for interior mutability.
  * server/handlers.rs: GET handler now tries the local-only fast
    path first, then falls through to the distributed path on
    NotFound.

* review: address PR 9516 feedback on distributed EC read

Five of the six PR-review comments addressed; the sixth (JWT on
outgoing peer gRPC) is deferred with an explicit TODO because the
crate-wide outgoing-JWT signing surface doesn't exist yet — adding it
in this one call site would split the credential plumbing across
peer paths that already lack it (copy_file_from_source, batch_delete,
…). Revisit when an outgoing-JWT helper lands.

Fixed in this commit:

  * Handlers: drop the two-tier (local-first, then distributed)
    read in handlers.rs. read_ec_shard_needle_distributed already
    does the local-first pass under the same store read lock; the
    redundant outer attempt re-read local intervals twice for any
    needle that spanned mixed-locality shards.
  * Scanner snapshot: replace inline locate-needle math with
    `ecv.locate_needle(needle_id)`. Same routine the local-only
    read path uses, so byte-identical on shard-size + interval
    boundaries.
  * EcVolume::set_shard_locations also advances
    shard_locations_refresh_time so the staleness check honors
    callers that populate the cache directly without going through
    the master LookupEcVolume RPC.
  * parse_grpc_address moved from grpc_server.rs into
    grpc_client.rs as `pub` and is reused by both grpc_server.rs
    and the new store_ec module. Single source of truth for the
    HTTP↔gRPC port-offset convention.
  * Reconstruction (recover_one_remote_ec_shard_interval) now seeds
    bufs from locally-mounted survivor shards BEFORE the remote
    fan-out. Previously the fan-out was remote-only, so when the
    shard_locations cache was cold or the master lookup failed,
    reconstruction errored even though enough siblings were on
    local disk to recover the missing interval.

* review: tighten parse_grpc_address; atomic shard-locations cache swap

Two follow-up findings from the PR 9516 review round 2:

  * `parse_grpc_address` now validates BOTH port components in the
    dotted form (`host:port.grpcPort`) — previously a non-numeric
    HTTP port like `host:abc.18080` slipped through and tripped a
    less-useful downstream URI parse error. The implicit form
    (`host:port` → port + 10000) also gains an overflow check so
    inputs like `host:60000` (which silently wrap past u16) are
    rejected here instead of producing an opaque connection
    failure later. Six unit tests cover each rejection path.

  * `EcVolume::set_shard_locations` no longer bumps the per-volume
    refresh timestamp. The previous fix introduced a freshness
    race: a multi-shard population that inserts shard-by-shard
    would flip `needs_refresh == false` on the first write, letting
    a concurrent reader observe a half-populated map already
    marked "fresh" and return NotFound for the not-yet-inserted
    shards. Added `EcVolume::replace_shard_locations(map)` for the
    atomic bulk swap; `write_back_shard_locations` in the
    distributed-read path uses it so the cache transitions
    old → fresh in a single observable step.
2026-05-16 20:44:28 -07:00
Chris Lu
f892b445b3 helm(admin): support secretExtraEnvironmentVars (refs #9511) (#9513)
* helm(admin): support secretExtraEnvironmentVars

The admin statefulset only honored extraEnvironmentVars, forcing the
OIDC client secret (and any other sensitive WEED_* value) to be inlined
as plain text in values.yaml — not GitOps-friendly. The filer chart has
had secretExtraEnvironmentVars for this exact case; mirror that pattern
on admin so secrets can be projected via valueFrom.secretKeyRef.

Surfaced by an enterprise OIDC deployment (issue #9511) where the only
workaround was hardcoding WEED_ADMIN_OIDC_CLIENT_SECRET in values.yaml.

* helm(admin): sort secretExtraEnvironmentVars keys for stable output

Helm/Go template map iteration is non-deterministic, so the env entries
could shuffle between renders and trigger spurious StatefulSet rollouts
in GitOps tooling (ArgoCD/Flux). Sort the keys with sortAlpha, mirroring
the extraEnvironmentVars block immediately above.

Flagged by gemini-code-assist and coderabbitai on PR #9513.
2026-05-15 13:19:05 -07:00
Chris Lu
62821964dd filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509) (#9514)
PR #9442 made the filer refuse to register the IAM gRPC service unless
jwt.filer_signing.key was set in security.toml, which broke the admin
UI Users/Groups/Policies pages for every deployment that ships without
a security.toml — weed mini, plain Helm, vanilla weed filer. The Users
tab returns Unimplemented and the page is unusable. Issues #9504,
#9505 and #9509 all trace to this gap.

The rest of the filer's gRPC surface is unauthenticated by default;
treat IAM the same way. The service now always registers, and the
auth gate is a no-op when no signing key is configured. When the key
is set, every RPC still requires an admin-signed Bearer token, matching
the post-#9442 behaviour. Operators who expose the filer gRPC port
beyond a trusted network should set the key on both filer and admin.

The admin client (IamGrpcStore.withIamClient) already skips attaching
the authorization metadata when its key is empty, so no changes there.
2026-05-15 13:15:20 -07:00
Konstantin Lebedev
7d1b16fbcd fix: ListBucketsHandler for pathStyleDomains (#9510) 2026-05-15 13:12:55 -07:00
Chris Lu
2ed95d7ea9 helm: decouple JWT signing from cert-manager mTLS (fixes #9506) (#9508)
* helm(security): decouple JWT signing from cert-manager mTLS

The filer needs jwt.filer_signing.key to register the IAM gRPC service the
Admin UI Users tab calls (PR #9442). The chart only rendered security.toml
under enableSecurity, which also pulls in cert-manager for mTLS — much heavier
than the Admin UI needs. Operators on Helm without cert-manager have no way
to flip the JWT key on, so the Users tab fails with Unimplemented after
upgrading past 4.24.

Introduce seaweedfs.securityConfigEnabled, true when enableSecurity OR any
explicit jwtSigning toggle (volumeRead/filerWrite/filerRead) is set. The
configmap renders under that helper; the [grpc.*]/[https.*] sections inside
stay gated on enableSecurity. Each pod template splits the security-config
mount onto the helper and keeps the cert volume mounts on enableSecurity.

volumeWrite is intentionally excluded from the helper trigger because it
defaults to true; including it would silently start mounting security.toml on
every fresh install. With this change, enableSecurity=false + defaults
renders nothing (unchanged), enableSecurity=true renders the full toml
(unchanged), and enableSecurity=false + filerWrite=true renders just the
[jwt.*] sections so the Admin UI works without mTLS.

Fixes #9506.

* helm(security): trim verbose comments

* helm(security): handle null securityConfig in helper

Address review feedback: (.Values.global.seaweedfs.securityConfig).jwtSigning
errored if a user explicitly set securityConfig: null in their values. Drop
into intermediate $sec/$jwt with default dict at each step so a missing or
nulled-out parent is tolerated.

* helm(ci): cover IAM gRPC decoupling (issue #9506)

Five regression assertions exercised against the rendered chart so a
future change cannot silently re-couple jwt.filer_signing to mTLS:

1. defaults render no security-config ConfigMap (preserves baseline)
2. filerWrite=true alone renders [jwt.filer_signing] with no [grpc.*]
3. filerWrite=true mounts security-config on filer + admin without
   pulling in cert volumes — the actual fix for the Admin UI Users tab
4. enableSecurity=true still produces the full toml with [grpc.master]
5. securityConfig=null and securityConfig.jwtSigning=null both render
   cleanly (gemini-code-assist review nit, applied chart-wide)

Patch a pre-existing direct-access in filer-statefulset.yaml that
crashed on securityConfig=null, surfaced by the new null assertion.

* helm(ci): drop issue numbers from comments

* helm(ci): install pyyaml; assert [jwt.signing] in mTLS path

Address coderabbit review:

- The new IAM gRPC test block uses `import yaml` but ran before the
  later `pip install pyyaml -q` step that the security+S3 block
  performs. CI happens to pass because the runner image carries
  PyYAML, but make the dependency explicit so a future runner change
  cannot silently break the regression test.

- The enableSecurity=true assertion only checked for [grpc.master].
  Also assert [jwt.signing] so a refactor that drops the volume-side
  JWT stanza from the mTLS path fails the test instead of slipping
  through.
2026-05-14 23:43:24 -07:00
Chris Lu
bfb2661fec fix(tests): make 32-bit GOARCH tests build and run (#9507)
fix(tests): make 32-bit GOARCH tests build and run (#9503)

verifyTestFilerClient had bare int64 atomic counters after a map header,
so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on
linux/386. Switch to atomic.Int64, which the stdlib guarantees is
8-byte aligned on all platforms.

rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to
t.Errorf, where it default-promoted to int and overflowed 32-bit int.
Bind it to a typed uint32 const used in both the comparison and the
error message.
2026-05-14 20:55:37 -07:00
Chris Lu
7acba59a5c 4.25 4.25 2026-05-14 12:16:26 -07:00
Chris Lu
2c1482f7a6 fix(ec): clear cross-server stale EC shards before re-distribute (#9478) (#9499)
* fix(ec): clear cross-server stale EC shards before re-distribute (#9478)

A previous failed encode leaves partial .ec?? shards mounted on
destination volume servers that are not the .dat owner. PR #9480 only
prunes when the .dat sits on a sibling disk of the SAME store, so the
cross-server case stays stuck: every retry trips
volume_grpc_copy.go:570's "ec volume %d is mounted; refusing overwrite"
guard and the scheduler loops.

Detection already lists existing EC shards as CleanupECShards sources;
plumb the shard ids through (ActiveTopology.GetECShardLocations,
TaskSourceSpec, TaskSource.shard_ids) and have the EC worker call
VolumeEcShardsUnmount + VolumeEcShardsDelete on each destination after
the local shard set is generated and before distributeEcShards. Skip
EC-shard sources in getReplicas so the post-encode VolumeDelete step
does not target destination-only nodes.

Integration test mounts a partial shard subset, asserts the
mounted-volume refusal, runs cleanupStaleEcShards, and asserts the
next ReceiveFile lands.

* chore(ec): tighten code comments in stale-shard cleanup

Drop issue-number refs from code comments and shorten the docstrings
on cleanupStaleEcShards / unmountAndDeleteEcShards / getReplicas plus
the new test file. Behavior unchanged.

* fix(ec): skip empty-ShardIds locations; dedupe getReplicas by node

GetECShardLocations dropped entries where ecShardMatchesCollection saw a
phantom info record with EcIndexBits=0 — without ShardIds, getReplicas
misread the resulting source as a regular replica and would have called
VolumeDelete on a destination-only node.

getReplicas now dedupes by Node since VolumeDelete is server-wide;
per-disk source rows on the same server collapse to one call.

* refactor(ec): use MaxShardCount and ShardBits in collectShardIdsForDisk

Drop the literal 32 bit-iteration bound for erasure_coding.MaxShardCount
and treat the EcIndexBits union as a ShardBits so Count() drives the
slice preallocation. Keeps the helper aligned with the rest of the EC
code and survives any future expansion of the shard-count ceiling.
2026-05-14 11:57:45 -07:00
Chris Lu
e56a3ee4a2 ci(s3-lifecycle): split into per-test matrix jobs
Each test now runs against a fresh `weed mini`, so per-collection TTL
volume budget no longer leaks across tests and exhausts the pool.
2026-05-14 11:47:24 -07:00
Chris Lu
db2d975b80 ci(docker): tag latest in unified release instead of rebuilding (#9500)
The separate container_latest.yml workflow rebuilt the latest image from
scratch on every tag push (full multi-arch build + QEMU + trivy gate),
which is slow and frequently fails — leaving `latest` stranded on the
prior release (e.g. 4.23 after 4.24 shipped, #9497).

Drop the rebuild. The unified release workflow already publishes the
exact same content as `<tag>` and `<tag>_large_disk`, so just re-tag
those manifests with `crane tag` on both GHCR and Docker Hub once
copy-to-dockerhub completes. Seconds, not hours, and no QEMU.

Move the trivy scan into the unified workflow as report-only: SARIF
still uploads to GitHub Security for visibility, but vuln findings
no longer block the release.

container_latest.yml stays as a workflow_dispatch-only manual fallback.

Refs #9497.
2026-05-14 11:26:28 -07:00
Chris Lu
c47eab1a5d admin: attach admin-signed Bearer token on filer IAM gRPC calls (#9498)
* admin: attach admin-signed Bearer token on filer IAM gRPC calls

PR #9442 added Bearer-JWT enforcement on the filer's IAM gRPC service
but didn't update its only production client, IamGrpcStore. The admin
UI Users/Groups pages went through that client and started failing in
4.24 with either Unimplemented (filer refuses to register the service
when jwt.filer_signing.key is empty) or Unauthenticated (the client
sent no token). Issues #9495 and #9496 both trace to this gap.

Plumb jwt.filer_signing.key into IamGrpcStore via a new SetAdminSigning
hook called from the admin server, and append a freshly minted Bearer
token to outgoing metadata on every call. The mint helper
security.GenJwtForFilerAdmin existed since #9442 but had no production
caller; this wires it up.

Add an integration test alongside grpc_store.go that runs a real
IamGrpcServer over a real grpc.Server listener and exercises the store
end-to-end: matching key succeeds, wrong key returns Unauthenticated,
no key returns Unauthenticated. Without the client-side token attach
the success path fails, so the regression cannot land again.

* address review: include adminSigningExpiresAfterSec in mu comment
2026-05-14 10:51:04 -07:00
Chris Lu
4bac9985b4 fix(build): pin apache/thrift to v0.22.0 for 32-bit GOARCH
thrift v0.23.0 uses math.MaxUint32 as an untyped int constant in
lib/go/thrift/framed_transport.go:206, which overflows int on 32-bit
targets (openbsd/arm, linux/arm, freebsd/arm, netbsd/arm) and breaks
the release binary builds.
4.24
2026-05-13 20:56:32 -07:00
Chris Lu
1dfea8a502 4.24 2026-05-13 19:56:22 -07:00
Chris Lu
3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
Chris Lu
0dde6a8c84 refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout (#9494)
* refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout

"Per-Run Time Limit (minutes)" duplicated the admin scheduler's
"Execution Timeout (s)" — both are wall-clock caps on the same
Execute call, stacked via context.WithTimeout. Whichever was shorter
won. Under defaults the scheduler's 90s timeout always clobbered the
worker's 60-min cap, so the "Per-Run Time Limit" knob was effectively
dead unless an operator also raised Execution Timeout, and operators
had to keep two values in agreement.

Remove the worker-side knob and declare a sane scheduler default on
the handler descriptor:

- WorkerConfigForm: nil (was: one section with one field)
- Config.MaxRuntime removed; ParseConfig drops max_runtime_minutes
- Handler no longer wraps ctx in context.WithTimeout(MaxRuntime);
  runCtx is just the ctx the scheduler passes
- AdminRuntimeDefaults.ExecutionTimeoutSeconds = 3600 (1h) and
  JobTypeMaxRuntimeSeconds = 3600 — the scheduler's global 90s
  default would otherwise kill every real run

Tests:
- TestParseConfigDefaults loses the MaxRuntime check; new
  TestParseConfigIgnoresWorkerValues documents the contract
- TestDescriptor_WorkerConfigFormIsAbsent pins that the form is gone
  so a future re-add forces a conscious revisit
- TestDescriptor_AdminRuntimeDefaultsBoundExecutionTimeout pins the
  1h default with a comment about the 90s scheduler floor

* fix(s3/lifecycle): no per-pass timeout by default

Lifecycle is a scheduled batch — its natural duration is "as long as
today's events take." The 1h default ExecutionTimeoutSeconds from the
previous commit was still a footgun: too low truncates legitimate
large-bucket passes; too high makes the value meaningless.

Set both ExecutionTimeoutSeconds and JobTypeMaxRuntimeSeconds to
math.MaxInt32 (~68 years) to say "no timeout in practice" in a
code-review-readable way. Operators who genuinely want a wall-clock
cap can set one in the admin UI; the scheduler's context.WithTimeout
machinery is unchanged (we just hand it an effectively-infinite
duration).

Note: the scheduler floors ExecutionTimeout at 90s
(defaultScheduledExecutionTimeout in weed/admin/plugin/plugin_scheduler.go),
so 0 doesn't mean "unlimited" — it clamps back to 90s. A literal
math.MaxInt32 is the way to express the intent without touching the
shared scheduler code.

Test updated to pin math.MaxInt32 and document the rationale so a
future tighter cap fails the test and forces conscious revisit.
2026-05-13 19:29:06 -07:00
Chris Lu
e9bcb8f4ad docs(s3/lifecycle): refresh DESIGN.md as-built (#9491)
* docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages

DESIGN.md was written as a phased implementation plan ("Phase 2 will
ship X, Phase 4 will ship Y"). All phases are now merged, plus the
post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially
changed the worker model (single subscription, walker throttle,
observability gauges). Rewrite the doc in present tense describing
what's actually there.

Net changes vs the prior plan-style doc:
- Algorithm pseudo-code reflects the single-subscription fan-out plus
  walkedThisPass within-pass guard.
- Walker invocation table replaces the implicit "two distinct calls"
  prose with three call sites (recovery / steady-state / empty-replay)
  and their throttle gates.
- New section on the subscription model (one Reader, ShardPredicate,
  fan-out by ev.ShardID).
- New section on cursor.LastWalkedNs and the WalkerInterval throttle.
- Observability section: gauges, heartbeat tokens, what each means.
- "Implementation history" table maps phases to merged PRs.
- "Future work" lists the four optimizations we deferred (long-lived
  subscription, bucket-coordinated walker, per-bucket lag metric,
  filer meta-log retention).

Drop the "Phase N — ..." narrative from the bottom; the PR history
table is the durable artifact now.

Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for
the operator-facing docs. README explains the sync workflow with the
external seaweedfs.wiki.git repo. Five pages:

- Home.md — landing page, supported rule shapes, what the worker does
- Operator-Guide.md — config knobs, when to change each, walker
  interval recommendations by cluster size
- Monitoring.md — Prometheus metric reference + heartbeat token table
  + suggested PromQL alerts
- Troubleshooting.md — stuck cursor, walker stuck, failure outcomes,
  cursor schema for manual inspection
- Architecture.md — high-level overview for newcomers; sits between
  Home.md (operator) and DESIGN.md (developer)

* docs(s3/lifecycle): address PR review feedback on docs

Coderabbit + gemini findings on #9491:

- Monitoring.md: clarify the "matches all dispatched" phrasing; note
  that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value
  (shouldn't appear in healthy systems); filter PromQL alerts to
  ignore zero-valued gauges so fresh-install heartbeats don't trip.
- Operator-Guide.md, Troubleshooting.md: clarify weed shell -master
  format as host:http_port.grpc_port (SeaweedFS ServerAddress).
- Troubleshooting.md: pause the s3_lifecycle job in the admin UI
  before manually editing a cursor file, otherwise the worker's
  save races with the operator's edit.
- Architecture.md, Home.md, Operator-Guide.md, Monitoring.md,
  Troubleshooting.md, DESIGN.md: add language tags (`text`) to
  fenced code blocks for markdownlint MD040 compliance.
- DESIGN.md: standardize on the S3 spec rule names
  (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`,
  `AbortIncompleteMultipartUpload`) and add a one-line note mapping
  them to the engine's `ActionKind*` constants.
- README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the
  sync workflow so the `cp` commands' repo-root-relative paths work
  whether the operator's shell is at the repo root or at
  docs/wiki/s3-lifecycle/.
- Home.md: was lagging the wiki-repo merged version (had the older
  pre-merge content). Re-sync from the wiki repo so source matches.

* docs(s3/lifecycle): remove wiki pages from PR

The wiki pages belong in seaweedfs.wiki.git, not the main repo. The
source-of-truth concern that motivated adding them here is real but
the cost — every code-review touchpoint requires reviewers to load
operator-facing pages too — outweighs it. The wiki pages are already
pushed locally (~/dev/seaweedfs.wiki); they'll publish on the
operator-side workflow.

This PR remains scoped to DESIGN.md (the developer-facing reference
that does belong with the code).

* docs(s3/lifecycle): drop Implementation history section

git log is the durable record of what shipped when; the prose table
duplicates it and goes stale faster than commit metadata.

* docs(s3/lifecycle): soften 'exactly once per run' in Goal

The prior phrasing overstated the guarantee versus the failure model
documented later in the same file. Reword to: 'process due objects
each pass; retryable/blocked outcomes get retried from the cursor on
later runs.' Surfaces the head-of-line-blocking semantics up front so
the rest of the doc reads consistently.

Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those
pages live in the wiki repo, not the main repo.
2026-05-13 17:06:14 -07:00
Chris Lu
813f1351f8 feat(s3/lifecycle): enable scheduler by default (#9492)
S3 lifecycle is a standard bucket feature — operators set
PutBucketLifecycleConfiguration through the S3 API expecting the
configured expirations to actually fire. With the prior default
(scheduler enabled=false), buckets with lifecycle XML silently
retained data past their declared expiration until an operator
noticed and turned the scheduler on.

The failure mode of enabled-by-default is "worker runs every day
and fast-exits on buckets with no lifecycle rules" — cheap.
The failure mode of disabled-by-default is "data lingers, looks
like it expired, doesn't" — bad. Enabled-by-default matches both
the AWS S3 default behavior and the operator's natural mental
model.

Operators who want the worker off can still disable it via the
admin UI; once a persisted config exists, this descriptor default
no longer applies (the persisted Enabled state wins).

Test pins the choice so a future flip to false fails loud.
2026-05-13 16:57:10 -07:00
dependabot[bot]
453c735d02 build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 in /test/kafka (#9489)
build(deps): bump github.com/go-git/go-billy/v5 in /test/kafka

Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.8.0 to 5.9.0.
- [Release notes](https://github.com/go-git/go-billy/releases)
- [Commits](https://github.com/go-git/go-billy/compare/v5.8.0...v5.9.0)

---
updated-dependencies:
- dependency-name: github.com/go-git/go-billy/v5
  dependency-version: 5.9.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-05-13 14:45:44 -07:00
Chris Lu
d5e54f217d feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":

- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
  `now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
  can confirm WalkerInterval is actually being honored. A stuck value
  means the scheduler isn't invoking the worker, the throttle is too
  long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
  gauges and the persisted state stay aligned (gauges only advance on
  successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
  `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
  Existing tokens stay positional-stable; new ones append at the end.
  Marker `cold` distinguishes "not started" from "0s caught up."

Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).

Stacked on #9485.
2026-05-13 14:18:35 -07:00
Chris Lu
bbc075b353 feat(s3/lifecycle): plumb WalkerInterval through worker admin config (#9485)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.

* feat(s3/lifecycle): plumb WalkerInterval through worker admin config

#9484 added cfg.WalkerInterval to dailyrun.Config but left the worker
side wired to zero — operators couldn't actually use the throttle
without recompiling. Add the admin-schema knob:

- New constant WalkerIntervalMinutesAdminKey = "walker_interval_minutes"
  follows the MetaLogRetentionDaysAdminKey pattern (Int64, minutes
  unit, 0 = unbounded / fire every pass).
- New Config.WalkerInterval populated in ParseConfig from
  adminValues; negative / zero stay at zero so the prior "fire every
  pass" semantics keep the in-repo integration tests and the s3tests
  sub-minute driver working unchanged.
- handler.go: admin form field with operator-facing label and
  description, default in DefaultValues, value forwarded to
  dailyrun.Run via cfg.WalkerInterval.

Tests cover the default-zero, positive, and negative cases — same
shape as the MetaLogRetention tests so the parsing contract stays
consistent.

Stacked on #9484; rebase after that lands.
2026-05-13 14:09:31 -07:00
Chris Lu
c6582228b8 feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval

The steady-state and empty-replay walker fired on every dailyrun.Run
invocation, which is fine when Run is called at the bucket-walk cadence
the operator intends (e.g., once per hour or once per day), but
catastrophic when a fast driver like the s3tests CI workflow or the
admin worker scheduler invokes Run at multi-second cadence — each tick
ran a full subtree scan per shard, crushing the filer.

Decouple walker cadence from Run() invocation cadence: persist
LastWalkedNs in the per-shard cursor and fire the steady-state /
empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval.
Cold-start and recovery walker fires (RecoveryView) stay unconditional
since those are bounded events that must run when their trigger
condition (no cursor, hash mismatch) is met. Recovery walker fires also
update LastWalkedNs so the subsequent steady-state pass doesn't
double-walk.

cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the
in-repo integration tests and s3tests fast driver continue to work
unchanged. Production deployments should set this to the walk cost
budget (typically 1h-24h depending on cluster size).

Cursor file is back-compat: last_walked_ns is omitempty, so cursor
files written before this change decode as LastWalkedNs=0, which
walkerDue treats as "never walked steady-state" → walker fires next
pass to establish the anchor (same path a cold-start cursor takes).
No version bump.

Operator surface for WalkerInterval is the dailyrun.Config struct;
plumbing through worker.tasks.s3_lifecycle.Config and the admin
schema is a follow-up.

* fix(s3/lifecycle): suppress walker double-fire within a single pass

Two gemini-code-assist findings:

1. walkerDue with interval=0 returned true even when lastWalkedNs ==
   runNow.UnixNano() — the cold-start / recovery branch already fired
   the walker this pass, and the steady-state fall-through fired it
   again. RecoveryView is a superset of every per-shard partition, so
   the second walk added zero coverage and burned a full subtree scan.
   Add a within-pass guard at the front of walkerDue: if the cursor's
   LastWalkedNs equals runNow's UnixNano, the walker already ran this
   pass — skip.

2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue
   instead of the local lastWalkedNs variable the rest of runShard
   threads through. Trivially equal at this point in the function, but
   the inconsistency would mask a future bug if any code above the
   branch ever sets lastWalkedNs.

Test updates: TestWalkerDue gains the within-pass guard case plus a
companion "earlier same pass still fires" sanity check.
TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration:
cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly
once, not twice.

* fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard

Two coderabbit findings:

1. validate() now rejects negative cfg.WalkerInterval. A typo like
   -1h previously fell through walkerDue's `interval <= 0` branch and
   silently re-enabled "walk every pass" — the exact behavior the
   throttle was added to prevent. The admin-config parser already
   clamps negative input to zero, but callers using dailyrun.Config
   directly (tests, embedders) now get a loud error instead.

2. Within-pass double-fire suppression moves out of walkerDue and
   into runShard's walkedThisPass local flag. walkerDue's equality
   check (lastWalkedNs == runNow.UnixNano) was correct in production
   (each pass freezes runNow at time.Now().UTC, no collisions) but
   fragile in tests that inject the same runNow across distinct
   passes — the test would see false suppression. Separating the
   concerns also makes walkerDue answer one question (persisted-state
   throttle) and runShard another (within-pass call-site dedup).

walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval
pins the new validation. TestWalkerDue's within-pass cases move out
(the function is pure throttle now); TestRunShard_ColdStartDoesNot
DoubleWalk still pins the integration behavior end-to-end.
2026-05-13 14:09:13 -07:00
Lars Lehtonen
75c807b586 chore(weed/mq/kafka/protocol): remove unused functions and variables (#9488) 2026-05-13 13:59:24 -07:00
Chris Lu
d5c0a7b153 fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response

LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.

* fix(volume/ec): probe every DiskLocation when serving local shard reads

reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.

Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.

* test(volume/ec): cover the multi-disk EC lifecycle end-to-end

Two integration tests against a real volume server with two data dirs:

TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.

TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.

Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.

* review: trim verbose comments

* review: drop racy fast-path, use locked findEcShard directly

gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.

Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
2026-05-13 13:56:20 -07:00
Chris Lu
79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00
Chris Lu
e025ec2334 fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends (#9483)
* fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends

NewSortedFileNeedleMap never initialised the inherited
baseNeedleMapper.indexFileOffset, so it stayed at zero. The compact and
leveldb constructors both seed it from stat().Size(); this one did not.

When a read-only or noWriteCanDelete volume processed a Delete, the
inherited appendToIndexFile wrote the tombstone via WriteAt at
indexFileOffset=0 and advanced 16 bytes at a time. Every delete since
the SortedFileNeedleMap path was first exercised for writes (#5633,
which switched .sdx to O_RDWR) overwrote the front of .idx with
tombstones for unrelated keys.

Net effect on disk: the first N entries of .idx become sequentially
written tombstones with .dat offsets at the tail, while the original
Put records that lived in slots [0..N) are gone. fsck sees ~N phantom
orphans whose keys' runtime needle-map entries are already tombstoned
in .sdx, so the volume server replies 304 to every purge and the
orphan count is stable across retries.

Stat the .idx at open and seed indexFileOffset = size. The .idx now
grows on delete instead of being clobbered. Add a regression test that
populates .idx via the compact map, opens it as a SortedFileNeedleMap,
deletes one needle, and verifies the file grew by one entry, the
original Put records survived, and the tombstone landed at the tail.

The fix only stops further corruption. .idx files already damaged by
this bug have to be rebuilt from .dat before the next restart, or
isSortedFileFresh will regenerate .sdx from the bad .idx and propagate
the damage.

Refs #9479

* test: harden SortedFileNeedleMap regression assertions

Address PR review (gemini-code-assist, coderabbit):

- Close the writer before t.Fatalf in the Put loop so a failing seed
  doesn't leak the .idx file descriptor.
- Verify offset and size of preserved Put records, not just the key.
  Front-overwrite damage would clobber all three fields, but a key-only
  check would miss a different regression that corrupted offset/size
  while leaving the key intact.
2026-05-13 10:22:01 -07:00
Chris Lu
f5a4bfb514 fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete

deleteSpecificObjectVersion did two non-atomic filer ops: rm the version
blob, then update the .versions/ pointer. Step 2 failures were silently
logged and the client got 204 OK, so any transient blip (filer timeout,
process restart between RPCs, lock contention) left the .versions/
directory naming a missing file. Subsequent GETs paid the 10-retry
self-heal cost and returned NoSuchKey — surfacing as "Storage not found"
to Veeam, which is what triggered this investigation.

Three changes:

1. Pre-roll the pointer for the singleton / multi-version-deleting-latest
   cases. The pointer is repointed (multi) or cleared (singleton) before
   the blob rm. A failure between leaves a recoverable orphan blob —
   pointer is consistent, GETs succeed or correctly miss without
   entering the stale-pointer self-heal path.

2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion
   with bounded retries (~6.3s worst case). When retries are exhausted
   the function now returns a non-nil error instead of swallowing it.
   The caller logs at Error level and queues the path for the
   reconciler.

3. Background reconciler drains stranded .versions/ pointer-to-missing
   states off the hot path. Bounded in-memory queue with capped retries;
   read-path heal remains as a last-resort safety net.

* fix(s3/versioning): address review on #9460

Four fixes addressing review on PR #9460. All four are correctness;
no behavioural change for the happy path.

1. repointLatestBeforeDeletion: discriminate NotFound from transient
   errors when re-fetching the .versions/ entry. Previously any error
   returned rolled=true,nil — a transient filer hiccup at that point
   would cause the caller to skip the post-delete reconciliation AND
   proceed with the blob rm, producing exactly the dangling pointer
   state the PR aims to prevent. NotFound stays "vacuously consistent"
   (directory already gone); other errors surface so the caller aborts
   before removing the blob.

2. Move the singleton .versions/ teardown out of
   repointLatestBeforeDeletion (where it ran BEFORE the blob rm and
   always failed with "non-empty folder") into deleteSpecificObjectVersion
   AFTER the blob rm. Adds a wasSingleton return value so the caller
   knows when to run the teardown. Without this, every singleton-version
   delete in a versioned bucket leaked an empty .versions/ directory.

3. Wrap the list, getEntry, and mkFile calls inside
   repointLatestBeforeDeletion with retryFilerOp so the pre-roll has
   the same transient-failure resilience as the post-roll path. Without
   retries, a single transient blip causes the caller to fall back to
   the legacy non-atomic flow even when the filer recovers immediately.

4. healVersionsPointer in the reconciler: same NotFound-vs-transient
   discrimination on both the .versions/ getEntry and the latest-file
   presence probe. Previously a transient filer error would silently
   evict the candidate from the queue as "healed", leaving the real
   stranded state until a client read happened to surface it.

Also fixes the gemini-flagged consistency nit: the queued-for-reconciler
error log now uses normalizedObject instead of object so it matches the
queue entry's key.

* fix(s3/versioning): short-circuit terminal errors in retryFilerOp

Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound,
gRPC NotFound, context.Canceled, and context.DeadlineExceeded.
retryFilerOp now bails immediately on a terminal error and returns it
unwrapped, so callers like repointLatestBeforeDeletion.getEntry and
updateLatestVersionAfterDeletion.rm see the raw NotFound instead of
paying the ~6.3 s retry-budget delay AND parsing it out of an
"exhausted N retries" wrapper.

errors.Is and status.Code already walk the %w chain so today's call
sites still work, but the delay was real on the hot DELETE path
whenever a key was genuinely absent. Test added covering all five
terminal-error shapes — each must run the wrapped fn exactly once and
return in under 50 ms.
2026-05-13 10:14:27 -07:00
Chris Lu
de28c4df61 fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) (#9480)
* fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478)

handleFoundEcxFile only checks for .dat in the same disk location as the
EC shards. In a multi-disk volume server an interrupted encode can leave
.ec?? + .ecx on disk B while the source .dat still lives on disk A: the
per-disk loader sees no .dat next to .ecx, mistakes the leftover for a
distributed-EC layout, and mounts the partial shards. The volume server
then heartbeats both a regular replica and an EC shard for the same vid
and the master keeps both.

Sweep the store after per-disk loading and before the cross-disk
reconcile to delete partial EC files when a healthy .dat for the same
(collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for
every pruned shard so master forgets the new-shard message the per-disk
pass already emitted, instead of waiting for the next periodic heartbeat.

* fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478)

Rust port of the same Store-level prune added to weed/storage. The
per-disk EC loader in disk_location.rs only checks for .dat in the same
disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx
on disk B while the source .dat sits on disk A is mounted as if it were
a distributed-EC layout. The volume server then heartbeats both a
regular replica and an EC shard for the same vid.

Sweep the store after per-disk loading and before the cross-disk
reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT
shards when a .dat for the same (collection, vid) exists on a sibling
disk, and remove all on-disk EC artefacts for them. The Rust heartbeat
path already diff-emits deletes from the next ec_volumes snapshot, so no
explicit delete-channel push is needed here.

Tests cover both the issue 9478 layout and a distributed-EC layout with
no .dat anywhere on the store, which must be left alone.

* fix(storage): validate sibling .dat size before deleting partial EC (#9478)

The earlier prune deleted partial EC files whenever any .dat for the
same vid existed on a sibling disk — including a zero-byte shell. A
shell is no more useful than the partial shard it would replace, and
the partial shard might still combine with shards on other servers
in a recoverable distributed-EC layout. Wiping it based on a corrupt
sibling .dat is data loss masquerading as cleanup.

Tighten the check: when the EC's .vif recorded a non-zero source size
in datFileSize, require the sibling .dat to be at least that many
bytes; otherwise fall back to "at least a superblock". The .vif value
is what the encoder wrote at the moment the source was sealed, so a
sibling .dat smaller than that is provably truncated. Carry the size
through indexDatOwners alongside the location.

The Rust port had the same gap and an additional bug behind it:
EcVolume::new wasn't reading datFileSize from .vif, so the safety
check always fell back to the superblock floor. Wire datFileSize
through. The existing shard-size calculation in
LocateEcShardNeedleInterval already uses dat_file_size when non-zero,
so populating it also matches Go's behaviour there.

Tests cover the truncated-sibling case in both ports.
2026-05-13 09:25:10 -07:00
Chris Lu
3f1eaf9724 fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD

Successful GET/HEAD object requests never produced a fluent audit entry
because those handlers write the response directly (streaming for GET,
WriteHeader for HEAD) and never reach a PostLog call site. The wiki
advertises GET as an audited verb, so the asymmetry surprises operators
who rely on the log for read-access auditing.

Move the safety net into the track() middleware: tag each request with
an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it,
and emit a single fallback entry after the handler returns when nothing
fired. The recorder's status flows into the fallback so the audit row
still reflects 200/206 vs 404 etc. No double logging for handlers that
already emit (write helpers, error paths, bulk delete).

Refs #9463

* fix(s3/audit): defensive nil checks on audit-tracking helpers

Address PR review: guard against nil request and nil *atomic.Bool stored
under the audit-tracking key. The conditions are unreachable today (the
key is private and we only ever store new(atomic.Bool)), but the checks
are free and keep the helpers safe if a future caller misbehaves.

* test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469)

test(s3/audit): cover track() fallback wiring + cleanup

Adds two unit tests in weed/s3api/stats_test.go that exercise the
audit-tracking flag set up by track(): one verifies the fallback path
fires when a handler writes the response directly (the GET/HEAD object
regression in #9463), the other verifies the flag is set when a handler
emits PostLog itself so the fallback is skipped.

To make the wiring observable without standing up fluent, PostLog now
marks the audit flag before short-circuiting on a nil Logger; production
behavior is unchanged (no logger, no posting) but the flag stays
consistent.

Also drops two stale comments in s3api_object_handlers.go that still
referenced proxyToFiler — that helper was removed when GET/HEAD started
streaming from volume servers directly.

Stacks on #9467.
2026-05-13 09:24:59 -07:00
Chris Lu
d5372f9eb7 feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster
rate.Limiter that processMatches honors. A walker hitting a large
bucket on the recovery branch could burst LifecycleDelete RPCs past
the cluster_deletes_per_second cap that streaming-replay respects.

WalkerDispatcher now takes a *rate.Limiter and waits on it before
each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds
just like processMatches does. The handler passes the same limiter
to both paths so replay + walk share one budget; nil disables
throttling (unchanged default).

Tests pin: the limiter actually delays a dispatch when the burst
token is drained, and a ctx cancellation in Limiter.Wait surfaces
as an error without sending the RPC.
2026-05-13 09:24:50 -07:00
dependabot[bot]
31c7996671 build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 (#9482) 2026-05-13 09:19:48 -07:00
Chris Lu
37e505b8fd refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass

Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.

drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.

cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.

Filer load: 16x -> 1x SubscribeMetadata streams per pass.

* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)

The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.

Same regression I fixed in 6796ab6db for the per-shard subscription;
now applied at the shared level. computeGlobalStartTsNs loads every
shard's cursor and picks the minimum, falling back to the cold-start
floor only for shards with no persisted cursor.
2026-05-13 02:13:11 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
7b8647e8bc fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476)
The s3tests workflow (.github/workflows/s3tests.yml) backgrounds
`weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'`
and then runs `kill -0 $pid` to confirm the worker stayed alive.
The PR-9475 restore ran dailyrun.Run once and exited cleanly — even
faster when no buckets had lifecycle rules yet ("nothing to run").
The aliveness check then failed and the s3tests job died with
"lifecycle worker died on startup". Caught on
https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401.

Fix:

  - -refresh now drives an inter-pass loop. cadence=0 (default) is
    one-shot, matching the test/s3/lifecycle/ integration-test
    invocation that omits -refresh and expects synchronous return.
    cadence>0 (the CI case) keeps the command alive until -runtime
    expires, running a fresh dailyrun.Run on every tick.
  - Each iteration re-loads bucket configs via
    scheduler.LoadCompileInputs so rules created mid-run (the s3tests
    flow creates rules AFTER the worker starts) get picked up.
  - The "no rules; nothing to run" early return is gone — the
    command stays alive even with an empty initial snapshot, waiting
    for tests to add rules.
  - -dispatch, -checkpoint, -bootstrap-interval stay accepted-but-
    ignored (legacy streaming flags).
2026-05-12 19:19:46 -07:00
Chris Lu
4ce027c2f3 fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475)
fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper

PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the
premise that it was a debug-only tool. It wasn't: the s3tests CI
workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/
integration tests invoke it via `weed shell` to drive lifecycle
expirations on demand. Both started failing with
"unknown command: s3.lifecycle.run-shard".

This PR restores the command with the same flag set so existing
callers (CI scripts and integration tests) work unchanged. The
implementation no longer drives the streaming dispatcher.Pipeline +
scheduler.BucketBootstrapper (deleted) — instead it does one bounded
dailyrun.Run pass through the same daily-replay code path the
production worker exercises. The walker fires for walker-bound
rules just like in the worker.

Obsolete streaming flags (-dispatch / -checkpoint / -refresh /
-bootstrap-interval) are accepted-but-ignored so existing scripts
don't need to drop them.
2026-05-12 18:29:18 -07:00
Chris Lu
ce5768fab1 feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash (#9473)
* feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash

dailyrun.Config.RetentionWindow has been wired since Phase 4b but the
handler never supplied a value, so runShard always fell back to
maxTTL and engine.PromotedHash hashed nothing. The partition-flip
recovery trigger was dormant by design "until the handler plumbs the
real meta-log retention here."

This PR plumbs it via a new admin form field:
  Meta-Log Retention (days) — 0 = unbounded (current behavior).

When set, ParseConfig converts days to a time.Duration on
cfg.MetaLogRetention. The handler passes it as
dailyrun.Config.RetentionWindow, which runShard then feeds to
engine.PromotedHash. Rules whose TTL exceeds the declared window
land in the walk partition; the next time an operator shrinks
retention so a previously replay-eligible rule slips past it,
PromotedHash mismatches → recovery branch fires → walker re-evaluates
the rule across the whole filer tree.

0 stays the default, so existing deployments see no behavior change.

* chore(s3/lifecycle): rephrase days->duration conversion

gemini-code-assist flagged the original form as a compile error,
which it wasn't (time.Duration is a named int64 and supports * with
other time.Durations — the test suite verified the value was
correct). The suggested form is more idiomatic regardless:
days*24 happens in int64 space before the lift to time.Duration,
so the unit is unambiguous.
2026-05-12 18:26:52 -07:00
Chris Lu
f51468cf73 Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)"

This reverts commit f28c7ce6df.

The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.

After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .

Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
2026-05-12 18:22:21 -07:00
Chris Lu
43a8c4fdca Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"

This reverts commit 21054b6c18.

The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.

The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.

* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic

The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.

Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
2026-05-12 16:00:44 -07:00
Chris Lu
f28c7ce6df master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:

- Reject heartbeats whose Ip does not match the gRPC peer's source
  address. Loopback peers are still trusted; operators behind a proxy
  can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
  and drop foreign re-claims. Non-EC volume claims are bounded by the
  replica copy count so legitimate replicas still register. EC
  ownership is keyed by (vid, shard_id) so the same vid can legitimately
  be split across many peers as long as their EcIndexBits are disjoint;
  rejected bits are cleared from the bitmap and the parallel ShardSizes
  array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
  so disconnect cleanup is O(M) in what that peer held rather than O(N)
  over the whole map.

Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.

The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
2026-05-12 15:38:52 -07:00
Chris Lu
10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu
5004b4e542 feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b)

Phase 5a (PR #9465) retired the algorithm flag and made daily_replay
the only execution path. The streaming-side code (scheduler.Scheduler,
scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher,
dispatcher.FilerPersister, and their tests) has had no in-tree caller
since then. This PR deletes it.

Net change: ~4800 lines removed, ~130 added (the scheduler/configload
tests' helper file the deleted bootstrap_test.go used to host).

Removed:
  - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test,
    scheduler,scheduler_test,pipeline_fanout_test,
    refresh_default,refresh_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test,
    dispatcher_helpers_test,edge_cases_test,multi_shard_test,
    pipeline,pipeline_test,pipeline_helpers_test,toproto_test,
    dispatch_ticks_default,dispatch_ticks_s3tests}.go
  - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go
    (FilerPersister deleted; FilerStore tests don't need their own
    file)
  - weed/shell/command_s3_lifecycle_run_shard{,_test}.go
    (debug-only shell command that only ever wrapped the streaming
    pipeline; the production worker now exercises the same path
    every daily run)

Trimmed:
  - dispatcher/filer_persister.go down to FilerStore +
    NewFilerStoreClient — the small interface daily_replay's cursor
    persister (dailyrun.FilerCursorPersister) plugs into.

Kept (still consumed by daily_replay):
  - scheduler/configload.{go,_test.go} (LoadCompileInputs,
    AllActivePriorStates)
  - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister,
    FilerSiblingLister)
  - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient)

scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream,
dirEntry, fileEntry — helpers the configload tests used to share with
the deleted bootstrap_test.go.

Updates the handler-package doc strings and one reader-package
comment that still named the streaming pipeline.

* fix(s3/lifecycle): hold lock through tree read in test filer client

gemini caught an inconsistency in scheduler/testhelpers_test.go:
LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was
releasing the lock before reading c.tree. The map is effectively
static during tests so there's no actual race today, but matching
the convention keeps the helper safe if a future test mutates the
tree mid-run.
2026-05-12 12:54:52 -07:00
Chris Lu
745e864bda feat(s3/lifecycle): retire algorithm flag, daily_replay is the only path (Phase 5a) (#9465)
feat(s3/lifecycle): remove algorithm flag, daily_replay is the only path (Phase 5a)

With Phase 4b on master the daily_replay path covers every rule kind
and the streaming algorithm has no remaining responsibilities. This
PR retires the algorithm flag from the worker:

  - Drop the "Algorithm" enum field from AdminConfigForm and its
    DefaultValues entry.
  - Drop the if/else routing in Execute — every Execute call now
    routes straight into executeDailyReplay.
  - Drop the streaming-only worker fields (DispatchTick,
    CheckpointTick, RefreshInterval, BootstrapInterval) and their
    matching form fields. None of them are read by the daily_replay
    path; keeping them in the form would suggest tuning knobs that
    don't do anything.
  - Drop AlgorithmStreaming / AlgorithmDailyReplay constants and the
    Config.Algorithm field.

The streaming-path packages (s3lifecycle/scheduler, s3lifecycle/dispatcher)
remain on the tree; they're now reachable only by the
weed shell s3.lifecycle.run-shard debug command and the few helpers
(LoadCompileInputs, FilerStore, FilerSiblingLister) the daily_replay
worker still uses. Phase 5b deletes the dead code.

Tests prune the cadence-default assertions to the single remaining
field (max_runtime_minutes).
2026-05-12 12:39:37 -07:00