seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-21 09:11:29 +00:00

Author	SHA1	Message	Date
Chris Lu	01b3e4a71c	template 4.26	2026-05-17 23:12:04 -07:00
Chris Lu	6cab199400	fix(iceberg): dial filer gRPC address verbatim in plugin worker (#9527 ) * fix(iceberg): dial filer gRPC address verbatim in plugin worker dialFiler was running its address argument through pb.ServerAddress.ToGrpcAddress, whose single-port fallback adds +10000 to any host:port — so when the admin forwards ClusterContext.FilerGrpcAddresses (already host:grpcPort) to the worker, the iceberg handler turns the real gRPC port (e.g. 18888) into a non-existent 28888 and dispatched jobs fail with connection refused. Drop the conversion; the address is already dialable. Tests that produced fake filer addresses in dual-port form now return host:grpcPort to match the new contract. * test(ec): use renamed detection_interval_minutes field The admin_runtime.detection_interval_seconds field was renamed to detection_interval_minutes back in May. This integration test was not updated, so the unknown JSON field was silently ignored and the scheduler fell back to the default detection interval (17 min for erasure_coding), which exceeds the test's 5-minute wait and times out. Switch to detection_interval_minutes: 1 — local run completes in ~120s.	2026-05-17 23:03:00 -07:00
Chris Lu	136eb1b7c8	4.26	2026-05-17 21:05:25 -07:00
Chris Lu	c11ff6657b	fix(ec): mirror EC sidecars onto every shard-bearing disk at startup (#9525 ) * fix(ec): mirror EC sidecars onto every shard-bearing disk at startup In a multi-disk volume server, ec.balance and ec.rebuild can land shards on a disk that does not also hold the matching .ecx / .ecj / .vif index files. The orphan-shard reconciler in reconcileEcShardsAcrossDisks already loads those shards by pointing the EcVolume at the sibling disk's index files; reads work, but any failure on the index-owning disk silently disables every shard on the other disk, even though those shards are physically fine. This change adds mirrorEcMetadataToShardDisks, a startup pass that physically replicates .ecx / .ecj / .vif onto each disk that holds shards but is missing them. Each copy is atomic (tmp + fsync + rename) and idempotent (a destination that already has the sidecar is preserved). After mirroring, the cross-disk reconciler prefers the local IdxDirectory so the EcVolume mounts self-contained; the cross-disk virtual mount remains as a fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). The same-disk invariant the EC lifecycle (encode / decode / balance / vacuum / repair) was already documented as promising is now actually restored at boot, so a future failure of one disk in a split-shards layout no longer takes the other disk's shards with it. Tests cover the orphan-layout mirror (dir0 receives the .ecx / .ecj / .vif from dir1) and idempotency (an existing destination .ecx is not overwritten with the owner's copy). * fix(ec): handle legacy pre-dir.idx sidecar layout in mirror skip-check hasAllEcSidecarsLocally checked only the modern destination path (IdxDirectory for .ecx/.ecj, Directory for .vif). A destination disk that still had a legacy .ecx in its data dir (written before -dir.idx was set) would report "not present" and the mirror would write a second copy to IdxDirectory, leaving two .ecx files on disk. Matches HasEcxFileOnDisk's open-with-fallback contract: check the modern path first, then the opposite directory. Factored the exists-and-not-a-dir check into a small statRegular helper so the fallback ladder stays readable. * rust(seaweed-volume): mirror EC sidecars onto shard-bearing disks at startup Port of the Go fix (commit `088e26ea6`) to the Rust volume server. Adds Store::mirror_ec_metadata_to_shard_disks, called from add_location / load_new_volumes before the cross-disk orphan reconciler. Physically copies .ecx / .ecj / .vif from the disk that owns the index files onto every disk holding shards but missing sidecars, so each shard-bearing disk ends up self-contained. The reconciler now prefers the local idx_directory when the mirror has installed a .ecx there; the cross-disk virtual mount remains as the fallback for volumes whose mirror failed (read-only target, out of space, partial copy on a previous boot). Adds ec_local_ecx_path helper shared between reconcile and mirror to detect the post-mirror fast path. Mirrors the Go-side fallback in hasAllEcSidecarsLocally: when -dir.idx is configured and the destination still has a legacy .ecx in its data dir, that's recognized so the mirror does not write a duplicate copy into idx_directory. Tests cover the two key cases: orphan layout (dir0 receives the sidecars from dir1) and idempotency (a pre-existing destination .ecx is not overwritten). * trim verbose comments on EC mirror code Comments now lead with the WHY (non-obvious constraints, the post-mirror fast path, why local copies are authoritative) and drop restate-the-code blocks, headers, and section dividers. Behavior is unchanged; all existing tests still pass on both the Go volume server and the seaweed-volume Rust port. * drop github issue refs from added comments Two stray "#9212" references slipped into comments I added on the cross-disk reconciler call site. The git log carries the issue history; comments stand on their own. * test(ec): accept rebuild on either disk after sidecar mirror TestEcLifecycleAcrossMultipleDisks asserted the rebuilt shard 9 must land at the disk-0 path. With the boot-time sidecar mirror, every shard-bearing disk owns its own .ecx, so VolumeEcShardsRebuild now picks whichever disk hosts the most shards — disk 1 in this layout after the deletion. The shard can legitimately rebuild on either disk; the test now accepts both and uses the chosen path for the subsequent mount + read verification.	2026-05-17 19:55:15 -07:00
Chris Lu	6b94701213	mini: quieter startup with a docker-compose-style progress board (#9524 ) * mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway\|service).*(started\|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases.	2026-05-17 19:13:09 -07:00
Chris Lu	ff6f9fd90a	iam: honor configured credential store for IAM API policies and propagate to S3 caches (fixes #9518 ) (#9522 ) * iamapi: route managed policies through credential manager (fixes #9518) CreatePolicy via the IAM API wrote straight to the filer /etc/iam/policies.json, ignoring any non-filer credential store. When credential.postgres was configured, policies created via the IAM API landed only in the filer while the Admin UI wrote to postgres, producing a split-brain where ListPolicies/GetPolicy never saw the Admin UI's policies and vice versa. GetPolicies/PutPolicies on IamS3ApiConfigure now load managed policies from credentialManager and persist Create/Update/Delete as a delta against the store. Inline user/group policies still live in the legacy policies.json file (no credential-store API for them yet). Pre-existing managed policies in the legacy file are merged on read so deployments don't lose data, and re-persisted to the store on the next write so the legacy file is drained over time. * credential: route IAM API inline policies through credential manager Extends the #9518 fix to user-inline and group-inline policies so the IAM API never writes the legacy /etc/iam/policies.json bundle directly. The previous patch only routed managed policies; this one finishes the job for the other two policy types. - Add GroupInlinePolicyStore + GroupInlinePoliciesLoader optional interfaces, mirroring the existing user-inline ones, and matching Put/Get/Delete/List/LoadAll wrappers on CredentialManager. - Implement group-inline storage in memory (new map), filer_etc (new field on PoliciesCollection, reusing the legacy file under policyMu), and postgres (new group_inline_policies table with ON DELETE CASCADE off the groups FK). - Wire the new methods through PropagatingCredentialStore so wrapped stores still delegate correctly. - IamS3ApiConfigure.PutPolicies now applies managed + user-inline + group-inline as deltas through the credential manager; the legacy /etc/iam/policies.json file is never written when a credential manager is wired up. GetPolicies still reads the legacy bundle once as a fallback so unmigrated data is picked up and re-persisted into the store on the next write. * credential: propagate SaveConfiguration writes to running S3 caches Postgres (and any non-filer) credential stores never fired the S3 IAM cache invalidation path on bulk identity / group updates. The PropagatingCredentialStore had explicit Put/Remove handlers for single-entity calls (CreateUser, PutPolicy, etc.) but inherited SaveConfiguration unchanged from the embedded store, so the bulk path the IAM API takes at the end of every handler was silent. Inline-policy changes recompute identity.Actions and persist via SaveConfiguration, so until restart the cached Actions on each S3 server stayed stale and authorization decisions used the pre-change view. Override SaveConfiguration to snapshot the prior user / group lists, delegate the save, then fan out PutIdentity / PutGroup for what's in the new config and RemoveIdentity / RemoveGroup for what got pruned. Reuses the existing SeaweedS3IamCache RPCs, no protobuf changes. * iamapi: drain legacy policies.json after authoritative credential-store writes Review pointed out a resurrection bug: GetPolicies still reads /etc/iam/policies.json as a one-way migration fallback, but PutPolicies in the credential-manager path never wrote that file, so legacy-only entries reappeared on the next read even after the IAM API "deleted" them. PutPolicies now overwrites the bundle with an empty {} after a successful credential-store write, unless the store is filer_etc (which owns the bundle as its own inline-policy backing — clearing it would wipe filer_etc's data). Also wraps the filer read, JSON unmarshal, and marshal errors with context per the other review comments.	2026-05-17 13:15:27 -07:00
Chris Lu	b4289abb0a	admin: convert filer address to gRPC form before dispatch (#9523 ) The master returns each registered filer in pb.ServerAddress dual-port form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's plugin context builder forwarded that string verbatim as filer_grpc_address, so workers calling grpc.DialContext on it failed every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port". Run each entry through pb.ServerAddress.ToGrpcAddress before populating ClusterContext.FilerGrpcAddresses. The lifecycle integration test now pins filer.port.grpc to a value that breaks the FILER_PORT+10000 assumption, and a new dispatch test drives the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and asserts the dispatched job both reaches the filer and deletes the backdated object.	2026-05-17 11:33:54 -07:00
Chris Lu	2a41e76101	fix(ec): blanket-clean every destination over the full shard range (#9512 ) * fix(ec): blanket-clean every destination over the full shard range The previous cleanup pass walked t.sources only, with the shard ids the topology had reported at detection time. In the wild, a destination can end up with EC shards mounted that the topology snapshot didn't list — shards on a sibling disk that hadn't heartbeated, or shards left over from a concurrent attempt's mount step. FindEcVolume still returns true, so the next ReceiveFile trips the mounted-volume guard. Cleanup now unions t.sources (with ShardIds) and t.targets and issues unmount + delete over [0..totalShards-1] on each. Both RPCs are idempotent on missing shards, so the wider sweep is free. Two new tests cover the gap: shards mounted beyond what t.sources lists, and a target-only destination with no source row. * log(ec): include disk_id in EC unmount/delete/refusal log lines The current logs identify the volume and shard but leave disk_id off, which makes the cross-server cleanup story hard to follow when multiple disks of one server hold pieces of the same volume: UnmountEcShards 4121.1 -> add disk_id ec volume video-recordings_4121 shard delete [1 5] -> add per-loc disk_id volume server X:Y deletes ec shards from 4121 [...] -> add disk_id ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids ReceiveFile's refusal now names the disk_ids actually holding the mount so operators can see whether the next cleanup pass needs to target a sibling disk. Added Store.FindEcVolumeDiskIds / Store::find_ec_volume_disk_ids as the supporting primitive. Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard, heartbeat delete log in diff_ec_shard_delta_messages, refusal in the ReceiveFile handler). * test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server The plugin-worker EC tests boot a fake volume server that embeds UnimplementedVolumeServerServer. After the worker started calling VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the default Unimplemented response surfaced as fourteen "method not implemented" errors and TestErasureCodingExecutionEncodesShards failed. Both RPCs are no-ops here — nothing on the fake server has mounted state or persisted shard files to remove.	2026-05-17 11:31:37 -07:00
Chris Lu	bf9110ebd3	fix(ec): mount falls back to sibling-disk .ecx (fixes #9519 ) (#9521 ) * fix(ec): mount falls back to sibling-disk .ecx (fixes #9519) MountEcShards iterated DiskLocations and on each disk called LoadEcShard with that disk's IdxDirectory as the .ecx home. When ec.balance lands the .ec?? shard on disk A but the .ecx on sibling disk B of the same volume server, NewEcVolume ENOENTs the .ecx and returns "cannot open ec volume index ...". That error is not os.ErrNotExist, so the per-disk continue branch did not engage and the mount loop bailed before trying any other disk. The startup reconciliation in reconcileEcShardsAcrossDisks already handles this layout for orphan shards discovered on boot (issue #9212). This change mirrors the same primitive on the mount path: look up the .ecx owner across all DiskLocations once and route NewEcVolume at that directory whenever the disk being mounted does not own its own copy of the .ecx. Same-disk mounts are unaffected because HasEcxFileOnDisk keeps LocalIdxDirectory in play. Adds a regression test that plants the index files on a sibling disk AFTER NewStore returns (so the startup reconcile is a no-op for that vid) and verifies MountEcShards succeeds; also pins the same-disk baseline against accidental re-routing. * fix(ec): skip redundant stats in cross-disk .ecx lookup (review) Two follow-ups from gemini-code-assist on #9521: 1. MountEcShards: when findEcxIdxDirForVolume already returned a path that lives on this disk's IdxDirectory or Directory, the disk owns the .ecx — skip the HasEcxFileOnDisk stat and use the local idx dir directly. Only re-check when the disk's directories are neither, so the duplicate-.ecx-on-multiple-disks edge case is still honored. 2. findEcxIdxDirForVolume: hoist the seen map across the location loop so a shared IdxDirectory (one -dir.idx paired with several -dir entries) is only stat'd once per call. Both are I/O optimizations; behavior is unchanged. Existing cross-disk and same-disk regression tests still pass. * docs(ec): drop issue/PR references from cross-disk mount comments Comments and test docstrings stand on their own; the issue number adds nothing a reader can act on and goes stale across forks. Keep the description of what the layout is and why the fallback exists, just without the reference.	2026-05-17 11:10:37 -07:00
Chris Lu	d51454adf4	rust(seaweed-volume): distributed EC read across peer servers (#9516 ) * feat(seaweed-volume): distributed EC read across peer servers EcVolume::read_ec_shard_needle previously errored with NotFound when any interval's shard wasn't local. In an RS(10,4)-across-N deployment each server holds one shard, so every read needed >=9 peer fetches and post-EC GETs returned 404 on volumes whose shards lived on more than one server. Mirror of weed/storage/store_ec.go's readOneEcShardInterval -> readRemoteEcShardInterval -> recoverOneRemoteEcShardInterval chain: * server/store_ec.rs (new): entry point read_ec_shard_needle_distributed. Snapshots locate-needle + local reads under the Store sync lock, drops the lock, then async-fetches missing intervals via the peer's VolumeEcShardRead RPC. Falls back to Reed-Solomon reconstruction (read every other shard at the same (shard_offset, size) and run rs.reconstruct) when the direct peer read fails. Refreshes the per-EcVolume shard_locations cache from the master's LookupEcVolume RPC using Go's freshness thresholds (11s / 7min / 37min). * erasure_coding/ec_volume.rs: shard_locations now sits behind a std::sync::RwLock so the read path can refresh the map without holding the Store write lock. Adds shard_locations_refresh_time (Mutex<Option<Instant>>) for the staleness heuristic. Mirrors Go's ShardLocationsLock / ShardLocationsRefreshTime fields. set/get helpers updated for interior mutability. * server/handlers.rs: GET handler now tries the local-only fast path first, then falls through to the distributed path on NotFound. * review: address PR 9516 feedback on distributed EC read Five of the six PR-review comments addressed; the sixth (JWT on outgoing peer gRPC) is deferred with an explicit TODO because the crate-wide outgoing-JWT signing surface doesn't exist yet — adding it in this one call site would split the credential plumbing across peer paths that already lack it (copy_file_from_source, batch_delete, …). Revisit when an outgoing-JWT helper lands. Fixed in this commit: * Handlers: drop the two-tier (local-first, then distributed) read in handlers.rs. read_ec_shard_needle_distributed already does the local-first pass under the same store read lock; the redundant outer attempt re-read local intervals twice for any needle that spanned mixed-locality shards. * Scanner snapshot: replace inline locate-needle math with `ecv.locate_needle(needle_id)`. Same routine the local-only read path uses, so byte-identical on shard-size + interval boundaries. * EcVolume::set_shard_locations also advances shard_locations_refresh_time so the staleness check honors callers that populate the cache directly without going through the master LookupEcVolume RPC. * parse_grpc_address moved from grpc_server.rs into grpc_client.rs as `pub` and is reused by both grpc_server.rs and the new store_ec module. Single source of truth for the HTTP↔gRPC port-offset convention. * Reconstruction (recover_one_remote_ec_shard_interval) now seeds bufs from locally-mounted survivor shards BEFORE the remote fan-out. Previously the fan-out was remote-only, so when the shard_locations cache was cold or the master lookup failed, reconstruction errored even though enough siblings were on local disk to recover the missing interval. * review: tighten parse_grpc_address; atomic shard-locations cache swap Two follow-up findings from the PR 9516 review round 2: * `parse_grpc_address` now validates BOTH port components in the dotted form (`host:port.grpcPort`) — previously a non-numeric HTTP port like `host:abc.18080` slipped through and tripped a less-useful downstream URI parse error. The implicit form (`host:port` → port + 10000) also gains an overflow check so inputs like `host:60000` (which silently wrap past u16) are rejected here instead of producing an opaque connection failure later. Six unit tests cover each rejection path. * `EcVolume::set_shard_locations` no longer bumps the per-volume refresh timestamp. The previous fix introduced a freshness race: a multi-shard population that inserts shard-by-shard would flip `needs_refresh == false` on the first write, letting a concurrent reader observe a half-populated map already marked "fresh" and return NotFound for the not-yet-inserted shards. Added `EcVolume::replace_shard_locations(map)` for the atomic bulk swap; `write_back_shard_locations` in the distributed-read path uses it so the cache transitions old → fresh in a single observable step.	2026-05-16 20:44:28 -07:00
Chris Lu	f892b445b3	helm(admin): support secretExtraEnvironmentVars (refs #9511 ) (#9513 ) * helm(admin): support secretExtraEnvironmentVars The admin statefulset only honored extraEnvironmentVars, forcing the OIDC client secret (and any other sensitive WEED_* value) to be inlined as plain text in values.yaml — not GitOps-friendly. The filer chart has had secretExtraEnvironmentVars for this exact case; mirror that pattern on admin so secrets can be projected via valueFrom.secretKeyRef. Surfaced by an enterprise OIDC deployment (issue #9511) where the only workaround was hardcoding WEED_ADMIN_OIDC_CLIENT_SECRET in values.yaml. * helm(admin): sort secretExtraEnvironmentVars keys for stable output Helm/Go template map iteration is non-deterministic, so the env entries could shuffle between renders and trigger spurious StatefulSet rollouts in GitOps tooling (ArgoCD/Flux). Sort the keys with sortAlpha, mirroring the extraEnvironmentVars block immediately above. Flagged by gemini-code-assist and coderabbitai on PR #9513.	2026-05-15 13:19:05 -07:00
Chris Lu	62821964dd	filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509 ) (#9514 ) PR #9442 made the filer refuse to register the IAM gRPC service unless jwt.filer_signing.key was set in security.toml, which broke the admin UI Users/Groups/Policies pages for every deployment that ships without a security.toml — weed mini, plain Helm, vanilla weed filer. The Users tab returns Unimplemented and the page is unusable. Issues #9504, #9505 and #9509 all trace to this gap. The rest of the filer's gRPC surface is unauthenticated by default; treat IAM the same way. The service now always registers, and the auth gate is a no-op when no signing key is configured. When the key is set, every RPC still requires an admin-signed Bearer token, matching the post-#9442 behaviour. Operators who expose the filer gRPC port beyond a trusted network should set the key on both filer and admin. The admin client (IamGrpcStore.withIamClient) already skips attaching the authorization metadata when its key is empty, so no changes there.	2026-05-15 13:15:20 -07:00
Konstantin Lebedev	7d1b16fbcd	fix: ListBucketsHandler for pathStyleDomains (#9510 )	2026-05-15 13:12:55 -07:00
Chris Lu	2ed95d7ea9	helm: decouple JWT signing from cert-manager mTLS (fixes #9506 ) (#9508 ) * helm(security): decouple JWT signing from cert-manager mTLS The filer needs jwt.filer_signing.key to register the IAM gRPC service the Admin UI Users tab calls (PR #9442). The chart only rendered security.toml under enableSecurity, which also pulls in cert-manager for mTLS — much heavier than the Admin UI needs. Operators on Helm without cert-manager have no way to flip the JWT key on, so the Users tab fails with Unimplemented after upgrading past 4.24. Introduce seaweedfs.securityConfigEnabled, true when enableSecurity OR any explicit jwtSigning toggle (volumeRead/filerWrite/filerRead) is set. The configmap renders under that helper; the [grpc.]/[https.] sections inside stay gated on enableSecurity. Each pod template splits the security-config mount onto the helper and keeps the cert volume mounts on enableSecurity. volumeWrite is intentionally excluded from the helper trigger because it defaults to true; including it would silently start mounting security.toml on every fresh install. With this change, enableSecurity=false + defaults renders nothing (unchanged), enableSecurity=true renders the full toml (unchanged), and enableSecurity=false + filerWrite=true renders just the [jwt.] sections so the Admin UI works without mTLS. Fixes #9506. helm(security): trim verbose comments * helm(security): handle null securityConfig in helper Address review feedback: (.Values.global.seaweedfs.securityConfig).jwtSigning errored if a user explicitly set securityConfig: null in their values. Drop into intermediate $sec/$jwt with default dict at each step so a missing or nulled-out parent is tolerated. * helm(ci): cover IAM gRPC decoupling (issue #9506) Five regression assertions exercised against the rendered chart so a future change cannot silently re-couple jwt.filer_signing to mTLS: 1. defaults render no security-config ConfigMap (preserves baseline) 2. filerWrite=true alone renders [jwt.filer_signing] with no [grpc.] 3. filerWrite=true mounts security-config on filer + admin without pulling in cert volumes — the actual fix for the Admin UI Users tab 4. enableSecurity=true still produces the full toml with [grpc.master] 5. securityConfig=null and securityConfig.jwtSigning=null both render cleanly (gemini-code-assist review nit, applied chart-wide) Patch a pre-existing direct-access in filer-statefulset.yaml that crashed on securityConfig=null, surfaced by the new null assertion. helm(ci): drop issue numbers from comments * helm(ci): install pyyaml; assert [jwt.signing] in mTLS path Address coderabbit review: - The new IAM gRPC test block uses `import yaml` but ran before the later `pip install pyyaml -q` step that the security+S3 block performs. CI happens to pass because the runner image carries PyYAML, but make the dependency explicit so a future runner change cannot silently break the regression test. - The enableSecurity=true assertion only checked for [grpc.master]. Also assert [jwt.signing] so a refactor that drops the volume-side JWT stanza from the mTLS path fails the test instead of slipping through.	2026-05-14 23:43:24 -07:00
Chris Lu	bfb2661fec	fix(tests): make 32-bit GOARCH tests build and run (#9507 ) fix(tests): make 32-bit GOARCH tests build and run (#9503) verifyTestFilerClient had bare int64 atomic counters after a map header, so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on linux/386. Switch to atomic.Int64, which the stdlib guarantees is 8-byte aligned on all platforms. rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to t.Errorf, where it default-promoted to int and overflowed 32-bit int. Bind it to a typed uint32 const used in both the comparison and the error message.	2026-05-14 20:55:37 -07:00
Chris Lu	7acba59a5c	4.25 4.25	2026-05-14 12:16:26 -07:00
Chris Lu	2c1482f7a6	fix(ec): clear cross-server stale EC shards before re-distribute (#9478 ) (#9499 ) * fix(ec): clear cross-server stale EC shards before re-distribute (#9478) A previous failed encode leaves partial .ec?? shards mounted on destination volume servers that are not the .dat owner. PR #9480 only prunes when the .dat sits on a sibling disk of the SAME store, so the cross-server case stays stuck: every retry trips volume_grpc_copy.go:570's "ec volume %d is mounted; refusing overwrite" guard and the scheduler loops. Detection already lists existing EC shards as CleanupECShards sources; plumb the shard ids through (ActiveTopology.GetECShardLocations, TaskSourceSpec, TaskSource.shard_ids) and have the EC worker call VolumeEcShardsUnmount + VolumeEcShardsDelete on each destination after the local shard set is generated and before distributeEcShards. Skip EC-shard sources in getReplicas so the post-encode VolumeDelete step does not target destination-only nodes. Integration test mounts a partial shard subset, asserts the mounted-volume refusal, runs cleanupStaleEcShards, and asserts the next ReceiveFile lands. * chore(ec): tighten code comments in stale-shard cleanup Drop issue-number refs from code comments and shorten the docstrings on cleanupStaleEcShards / unmountAndDeleteEcShards / getReplicas plus the new test file. Behavior unchanged. * fix(ec): skip empty-ShardIds locations; dedupe getReplicas by node GetECShardLocations dropped entries where ecShardMatchesCollection saw a phantom info record with EcIndexBits=0 — without ShardIds, getReplicas misread the resulting source as a regular replica and would have called VolumeDelete on a destination-only node. getReplicas now dedupes by Node since VolumeDelete is server-wide; per-disk source rows on the same server collapse to one call. * refactor(ec): use MaxShardCount and ShardBits in collectShardIdsForDisk Drop the literal 32 bit-iteration bound for erasure_coding.MaxShardCount and treat the EcIndexBits union as a ShardBits so Count() drives the slice preallocation. Keeps the helper aligned with the rest of the EC code and survives any future expansion of the shard-count ceiling.	2026-05-14 11:57:45 -07:00
Chris Lu	e56a3ee4a2	ci(s3-lifecycle): split into per-test matrix jobs Each test now runs against a fresh `weed mini`, so per-collection TTL volume budget no longer leaks across tests and exhausts the pool.	2026-05-14 11:47:24 -07:00
Chris Lu	db2d975b80	ci(docker): tag latest in unified release instead of rebuilding (#9500 ) The separate container_latest.yml workflow rebuilt the latest image from scratch on every tag push (full multi-arch build + QEMU + trivy gate), which is slow and frequently fails — leaving `latest` stranded on the prior release (e.g. 4.23 after 4.24 shipped, #9497). Drop the rebuild. The unified release workflow already publishes the exact same content as `<tag>` and `<tag>_large_disk`, so just re-tag those manifests with `crane tag` on both GHCR and Docker Hub once copy-to-dockerhub completes. Seconds, not hours, and no QEMU. Move the trivy scan into the unified workflow as report-only: SARIF still uploads to GitHub Security for visibility, but vuln findings no longer block the release. container_latest.yml stays as a workflow_dispatch-only manual fallback. Refs #9497.	2026-05-14 11:26:28 -07:00
Chris Lu	c47eab1a5d	admin: attach admin-signed Bearer token on filer IAM gRPC calls (#9498 ) * admin: attach admin-signed Bearer token on filer IAM gRPC calls PR #9442 added Bearer-JWT enforcement on the filer's IAM gRPC service but didn't update its only production client, IamGrpcStore. The admin UI Users/Groups pages went through that client and started failing in 4.24 with either Unimplemented (filer refuses to register the service when jwt.filer_signing.key is empty) or Unauthenticated (the client sent no token). Issues #9495 and #9496 both trace to this gap. Plumb jwt.filer_signing.key into IamGrpcStore via a new SetAdminSigning hook called from the admin server, and append a freshly minted Bearer token to outgoing metadata on every call. The mint helper security.GenJwtForFilerAdmin existed since #9442 but had no production caller; this wires it up. Add an integration test alongside grpc_store.go that runs a real IamGrpcServer over a real grpc.Server listener and exercises the store end-to-end: matching key succeeds, wrong key returns Unauthenticated, no key returns Unauthenticated. Without the client-side token attach the success path fails, so the regression cannot land again. * address review: include adminSigningExpiresAfterSec in mu comment	2026-05-14 10:51:04 -07:00
Chris Lu	4bac9985b4	fix(build): pin apache/thrift to v0.22.0 for 32-bit GOARCH thrift v0.23.0 uses math.MaxUint32 as an untyped int constant in lib/go/thrift/framed_transport.go:206, which overflows int on 32-bit targets (openbsd/arm, linux/arm, freebsd/arm, netbsd/arm) and breaks the release binary builds. 4.24	2026-05-13 20:56:32 -07:00
Chris Lu	1dfea8a502	4.24	2026-05-13 19:56:22 -07:00
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	0dde6a8c84	refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout (#9494 ) * refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout "Per-Run Time Limit (minutes)" duplicated the admin scheduler's "Execution Timeout (s)" — both are wall-clock caps on the same Execute call, stacked via context.WithTimeout. Whichever was shorter won. Under defaults the scheduler's 90s timeout always clobbered the worker's 60-min cap, so the "Per-Run Time Limit" knob was effectively dead unless an operator also raised Execution Timeout, and operators had to keep two values in agreement. Remove the worker-side knob and declare a sane scheduler default on the handler descriptor: - WorkerConfigForm: nil (was: one section with one field) - Config.MaxRuntime removed; ParseConfig drops max_runtime_minutes - Handler no longer wraps ctx in context.WithTimeout(MaxRuntime); runCtx is just the ctx the scheduler passes - AdminRuntimeDefaults.ExecutionTimeoutSeconds = 3600 (1h) and JobTypeMaxRuntimeSeconds = 3600 — the scheduler's global 90s default would otherwise kill every real run Tests: - TestParseConfigDefaults loses the MaxRuntime check; new TestParseConfigIgnoresWorkerValues documents the contract - TestDescriptor_WorkerConfigFormIsAbsent pins that the form is gone so a future re-add forces a conscious revisit - TestDescriptor_AdminRuntimeDefaultsBoundExecutionTimeout pins the 1h default with a comment about the 90s scheduler floor * fix(s3/lifecycle): no per-pass timeout by default Lifecycle is a scheduled batch — its natural duration is "as long as today's events take." The 1h default ExecutionTimeoutSeconds from the previous commit was still a footgun: too low truncates legitimate large-bucket passes; too high makes the value meaningless. Set both ExecutionTimeoutSeconds and JobTypeMaxRuntimeSeconds to math.MaxInt32 (~68 years) to say "no timeout in practice" in a code-review-readable way. Operators who genuinely want a wall-clock cap can set one in the admin UI; the scheduler's context.WithTimeout machinery is unchanged (we just hand it an effectively-infinite duration). Note: the scheduler floors ExecutionTimeout at 90s (defaultScheduledExecutionTimeout in weed/admin/plugin/plugin_scheduler.go), so 0 doesn't mean "unlimited" — it clamps back to 90s. A literal math.MaxInt32 is the way to express the intent without touching the shared scheduler code. Test updated to pin math.MaxInt32 and document the rationale so a future tighter cap fails the test and forces conscious revisit.	2026-05-13 19:29:06 -07:00
Chris Lu	e9bcb8f4ad	docs(s3/lifecycle): refresh DESIGN.md as-built (#9491 ) * docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages DESIGN.md was written as a phased implementation plan ("Phase 2 will ship X, Phase 4 will ship Y"). All phases are now merged, plus the post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially changed the worker model (single subscription, walker throttle, observability gauges). Rewrite the doc in present tense describing what's actually there. Net changes vs the prior plan-style doc: - Algorithm pseudo-code reflects the single-subscription fan-out plus walkedThisPass within-pass guard. - Walker invocation table replaces the implicit "two distinct calls" prose with three call sites (recovery / steady-state / empty-replay) and their throttle gates. - New section on the subscription model (one Reader, ShardPredicate, fan-out by ev.ShardID). - New section on cursor.LastWalkedNs and the WalkerInterval throttle. - Observability section: gauges, heartbeat tokens, what each means. - "Implementation history" table maps phases to merged PRs. - "Future work" lists the four optimizations we deferred (long-lived subscription, bucket-coordinated walker, per-bucket lag metric, filer meta-log retention). Drop the "Phase N — ..." narrative from the bottom; the PR history table is the durable artifact now. Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for the operator-facing docs. README explains the sync workflow with the external seaweedfs.wiki.git repo. Five pages: - Home.md — landing page, supported rule shapes, what the worker does - Operator-Guide.md — config knobs, when to change each, walker interval recommendations by cluster size - Monitoring.md — Prometheus metric reference + heartbeat token table + suggested PromQL alerts - Troubleshooting.md — stuck cursor, walker stuck, failure outcomes, cursor schema for manual inspection - Architecture.md — high-level overview for newcomers; sits between Home.md (operator) and DESIGN.md (developer) * docs(s3/lifecycle): address PR review feedback on docs Coderabbit + gemini findings on #9491: - Monitoring.md: clarify the "matches all dispatched" phrasing; note that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value (shouldn't appear in healthy systems); filter PromQL alerts to ignore zero-valued gauges so fresh-install heartbeats don't trip. - Operator-Guide.md, Troubleshooting.md: clarify weed shell -master format as host:http_port.grpc_port (SeaweedFS ServerAddress). - Troubleshooting.md: pause the s3_lifecycle job in the admin UI before manually editing a cursor file, otherwise the worker's save races with the operator's edit. - Architecture.md, Home.md, Operator-Guide.md, Monitoring.md, Troubleshooting.md, DESIGN.md: add language tags (`text`) to fenced code blocks for markdownlint MD040 compliance. - DESIGN.md: standardize on the S3 spec rule names (`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`, `AbortIncompleteMultipartUpload`) and add a one-line note mapping them to the engine's `ActionKind` constants. - README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the sync workflow so the `cp` commands' repo-root-relative paths work whether the operator's shell is at the repo root or at docs/wiki/s3-lifecycle/. - Home.md: was lagging the wiki-repo merged version (had the older pre-merge content). Re-sync from the wiki repo so source matches. docs(s3/lifecycle): remove wiki pages from PR The wiki pages belong in seaweedfs.wiki.git, not the main repo. The source-of-truth concern that motivated adding them here is real but the cost — every code-review touchpoint requires reviewers to load operator-facing pages too — outweighs it. The wiki pages are already pushed locally (~/dev/seaweedfs.wiki); they'll publish on the operator-side workflow. This PR remains scoped to DESIGN.md (the developer-facing reference that does belong with the code). * docs(s3/lifecycle): drop Implementation history section git log is the durable record of what shipped when; the prose table duplicates it and goes stale faster than commit metadata. * docs(s3/lifecycle): soften 'exactly once per run' in Goal The prior phrasing overstated the guarantee versus the failure model documented later in the same file. Reword to: 'process due objects each pass; retryable/blocked outcomes get retried from the cursor on later runs.' Surfaces the head-of-line-blocking semantics up front so the rest of the doc reads consistently. Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those pages live in the wiki repo, not the main repo.	2026-05-13 17:06:14 -07:00
Chris Lu	813f1351f8	feat(s3/lifecycle): enable scheduler by default (#9492 ) S3 lifecycle is a standard bucket feature — operators set PutBucketLifecycleConfiguration through the S3 API expecting the configured expirations to actually fire. With the prior default (scheduler enabled=false), buckets with lifecycle XML silently retained data past their declared expiration until an operator noticed and turned the scheduler on. The failure mode of enabled-by-default is "worker runs every day and fast-exits on buckets with no lifecycle rules" — cheap. The failure mode of disabled-by-default is "data lingers, looks like it expired, doesn't" — bad. Enabled-by-default matches both the AWS S3 default behavior and the operator's natural mental model. Operators who want the worker off can still disable it via the admin UI; once a persisted config exists, this descriptor default no longer applies (the persisted Enabled state wins). Test pins the choice so a future flip to false fails loud.	2026-05-13 16:57:10 -07:00
dependabot[bot]	453c735d02	build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 in /test/kafka (#9489 ) build(deps): bump github.com/go-git/go-billy/v5 in /test/kafka Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.8.0 to 5.9.0. - [Release notes](https://github.com/go-git/go-billy/releases) - [Commits](https://github.com/go-git/go-billy/compare/v5.8.0...v5.9.0) --- updated-dependencies: - dependency-name: github.com/go-git/go-billy/v5 dependency-version: 5.9.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2026-05-13 14:45:44 -07:00
Chris Lu	d5e54f217d	feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486 ) Operator visibility was the last item on the daily-replay must-have list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing ever set it — leftover from the streaming worker that got deleted. Wire it up and add a parallel one for the walker so a single PromQL query answers "is this thing working?": - `cursor_min_ts_ns{shard}` set after each cursor save. Operators read `now - cursor_min_ts_ns` as the per-shard replay lag. - `daily_run_last_walked_ns{shard}` new — set in parallel so operators can confirm WalkerInterval is actually being honored. A stuck value means the scheduler isn't invoking the worker, the throttle is too long, or the walker is failing. - saveCursorAndPublish wraps every Save call site in runShard so the gauges and the persisted state stay aligned (gauges only advance on successful saves). - Enhance the `daily_run: status=... duration=...` heartbeat with `cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep. Existing tokens stay positional-stable; new ones append at the end. Marker `cold` distinguishes "not started" from "0s caught up." Tests pin the summary line: cold-start state, max-across-shards selection, and partial-fill (some shards drained, others walked). Stacked on #9485.	2026-05-13 14:18:35 -07:00
Chris Lu	bbc075b353	feat(s3/lifecycle): plumb WalkerInterval through worker admin config (#9485 ) * feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end. * feat(s3/lifecycle): plumb WalkerInterval through worker admin config #9484 added cfg.WalkerInterval to dailyrun.Config but left the worker side wired to zero — operators couldn't actually use the throttle without recompiling. Add the admin-schema knob: - New constant WalkerIntervalMinutesAdminKey = "walker_interval_minutes" follows the MetaLogRetentionDaysAdminKey pattern (Int64, minutes unit, 0 = unbounded / fire every pass). - New Config.WalkerInterval populated in ParseConfig from adminValues; negative / zero stay at zero so the prior "fire every pass" semantics keep the in-repo integration tests and the s3tests sub-minute driver working unchanged. - handler.go: admin form field with operator-facing label and description, default in DefaultValues, value forwarded to dailyrun.Run via cfg.WalkerInterval. Tests cover the default-zero, positive, and negative cases — same shape as the MetaLogRetention tests so the parsing contract stays consistent. Stacked on #9484; rebase after that lands.	2026-05-13 14:09:31 -07:00
Chris Lu	c6582228b8	feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484 ) * feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end.	2026-05-13 14:09:13 -07:00
Lars Lehtonen	75c807b586	chore(weed/mq/kafka/protocol): remove unused functions and variables (#9488 )	2026-05-13 13:59:24 -07:00
Chris Lu	d5c0a7b153	fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487 ) * fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume.	2026-05-13 13:56:20 -07:00
Chris Lu	79859fc21d	feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468 ) * feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd Three diagnostic additions on top of #9460, all aimed at making the next production incident faster to triage than the one we just spent hours on. 1. [versioning-heal] grep prefix on every heal-related log line, with a small fixed event vocabulary (produced / surfaced / healed / enqueue / drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed / teardown_failed / queue_full). One grep gives operators a single event stream across the produce-to-drain lifecycle. 2. Escalate the "scanned N>0 entries but no valid latest" case in updateLatestVersionAfterDeletion from V(1) Infof to a Warning that names the orphan entries it saw. This is the listing-after-rm inconsistency signature that pinned down 259064a8's failure — it should not be invisible at default log levels. 3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]` that walks .versions/ directories under a prefix and reports the stranded population. With -heal it clears the latest-version pointer in place on stranded directories so subsequent reads return a clean NoSuchKey instead of replaying the 10-retry self-heal loop. * fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry Address PR review: 1. s3.versions.audit walked only the first 1024-entry page of each .versions/ directory, false-positiving "stranded" on large dirs. Loop until the page returns < 1024 entries, advancing startName. 2. clean and orphan-only categories double-counted when a directory had no pointer and at least one orphan: incremented both. Make them mutually exclusive so report totals sum to versionsDirs. 3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep, non-interruptible by ctx. A server shutdown / client disconnect would wait out the budget per in-flight delete. Thread ctx through deleteSpecificObjectVersion -> repointLatestBeforeDeletion / updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context(); gRPC lifecycle handlers pass the stream ctx. New test pins the behavior: cancelling ctx mid-backoff returns ctx.Err() in <500ms instead of blocking ~6.3s. * fix(s3/versioning): clearStale outcome + escape grep-able log fields Two coderabbit follow-ups: 1. Successful pointer clear should suppress `produced`. updateLatestVersionAfterDeletion's transient-rm fallback called clearStaleLatestVersionPointer best-effort, then unconditionally returned retryErr. The caller (deleteSpecificObjectVersion) saw the error and emitted `event=produced` + enqueued the reconciler, even though clearStaleLatestVersionPointer had just driven the pointer to consistency and the next reader would get NoSuchKey via the clean-miss path. Make clearStaleLatestVersionPointer return cleared bool; on success the caller returns nil so neither produced nor the reconciler enqueue fires. Concurrent-writer aborts, re-scan errors, and CAS mismatches still report false so genuinely stranded state keeps surfacing. 2. Escape user-controlled fields in heal log lines. versioningHealInfof / Warningf / Errorf interpolated raw bucket / key / filename / err text into a single-space-separated line. An S3 key (or error string from gRPC) containing whitespace, newlines, or `event=...` could split one event into multiple tokens and spoof fake fields downstream. Sanitize each arg in the helper: safe values pass through; anything with whitespace, quotes, control chars, or backslashes is replaced with its strconv.Quote form. No caller changes — the format strings remain unchanged. Tests pin both behaviors: sanitization table covers the field boundary cases; an end-to-end shape test confirms a key containing `event=spoof` stays inside a single quoted token.	2026-05-13 10:48:58 -07:00
Chris Lu	e025ec2334	fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends (#9483 ) * fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends NewSortedFileNeedleMap never initialised the inherited baseNeedleMapper.indexFileOffset, so it stayed at zero. The compact and leveldb constructors both seed it from stat().Size(); this one did not. When a read-only or noWriteCanDelete volume processed a Delete, the inherited appendToIndexFile wrote the tombstone via WriteAt at indexFileOffset=0 and advanced 16 bytes at a time. Every delete since the SortedFileNeedleMap path was first exercised for writes (#5633, which switched .sdx to O_RDWR) overwrote the front of .idx with tombstones for unrelated keys. Net effect on disk: the first N entries of .idx become sequentially written tombstones with .dat offsets at the tail, while the original Put records that lived in slots [0..N) are gone. fsck sees ~N phantom orphans whose keys' runtime needle-map entries are already tombstoned in .sdx, so the volume server replies 304 to every purge and the orphan count is stable across retries. Stat the .idx at open and seed indexFileOffset = size. The .idx now grows on delete instead of being clobbered. Add a regression test that populates .idx via the compact map, opens it as a SortedFileNeedleMap, deletes one needle, and verifies the file grew by one entry, the original Put records survived, and the tombstone landed at the tail. The fix only stops further corruption. .idx files already damaged by this bug have to be rebuilt from .dat before the next restart, or isSortedFileFresh will regenerate .sdx from the bad .idx and propagate the damage. Refs #9479 * test: harden SortedFileNeedleMap regression assertions Address PR review (gemini-code-assist, coderabbit): - Close the writer before t.Fatalf in the Put loop so a failing seed doesn't leak the .idx file descriptor. - Verify offset and size of preserved Put records, not just the key. Front-overwrite damage would clobber all three fields, but a key-only check would miss a different regression that corrupted offset/size while leaving the key intact.	2026-05-13 10:22:01 -07:00
Chris Lu	f5a4bfb514	fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460 ) * fix(s3/versioning): repair dangling latest-version pointer after partial delete deleteSpecificObjectVersion did two non-atomic filer ops: rm the version blob, then update the .versions/ pointer. Step 2 failures were silently logged and the client got 204 OK, so any transient blip (filer timeout, process restart between RPCs, lock contention) left the .versions/ directory naming a missing file. Subsequent GETs paid the 10-retry self-heal cost and returned NoSuchKey — surfacing as "Storage not found" to Veeam, which is what triggered this investigation. Three changes: 1. Pre-roll the pointer for the singleton / multi-version-deleting-latest cases. The pointer is repointed (multi) or cleared (singleton) before the blob rm. A failure between leaves a recoverable orphan blob — pointer is consistent, GETs succeed or correctly miss without entering the stale-pointer self-heal path. 2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion with bounded retries (~6.3s worst case). When retries are exhausted the function now returns a non-nil error instead of swallowing it. The caller logs at Error level and queues the path for the reconciler. 3. Background reconciler drains stranded .versions/ pointer-to-missing states off the hot path. Bounded in-memory queue with capped retries; read-path heal remains as a last-resort safety net. * fix(s3/versioning): address review on #9460 Four fixes addressing review on PR #9460. All four are correctness; no behavioural change for the happy path. 1. repointLatestBeforeDeletion: discriminate NotFound from transient errors when re-fetching the .versions/ entry. Previously any error returned rolled=true,nil — a transient filer hiccup at that point would cause the caller to skip the post-delete reconciliation AND proceed with the blob rm, producing exactly the dangling pointer state the PR aims to prevent. NotFound stays "vacuously consistent" (directory already gone); other errors surface so the caller aborts before removing the blob. 2. Move the singleton .versions/ teardown out of repointLatestBeforeDeletion (where it ran BEFORE the blob rm and always failed with "non-empty folder") into deleteSpecificObjectVersion AFTER the blob rm. Adds a wasSingleton return value so the caller knows when to run the teardown. Without this, every singleton-version delete in a versioned bucket leaked an empty .versions/ directory. 3. Wrap the list, getEntry, and mkFile calls inside repointLatestBeforeDeletion with retryFilerOp so the pre-roll has the same transient-failure resilience as the post-roll path. Without retries, a single transient blip causes the caller to fall back to the legacy non-atomic flow even when the filer recovers immediately. 4. healVersionsPointer in the reconciler: same NotFound-vs-transient discrimination on both the .versions/ getEntry and the latest-file presence probe. Previously a transient filer error would silently evict the candidate from the queue as "healed", leaving the real stranded state until a client read happened to surface it. Also fixes the gemini-flagged consistency nit: the queued-for-reconciler error log now uses normalizedObject instead of object so it matches the queue entry's key. * fix(s3/versioning): short-circuit terminal errors in retryFilerOp Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound, gRPC NotFound, context.Canceled, and context.DeadlineExceeded. retryFilerOp now bails immediately on a terminal error and returns it unwrapped, so callers like repointLatestBeforeDeletion.getEntry and updateLatestVersionAfterDeletion.rm see the raw NotFound instead of paying the ~6.3 s retry-budget delay AND parsing it out of an "exhausted N retries" wrapper. errors.Is and status.Code already walk the %w chain so today's call sites still work, but the delay was real on the hot DELETE path whenever a key was genuinely absent. Test added covering all five terminal-error shapes — each must run the wrapped fn exactly once and return in under 50 ms.	2026-05-13 10:14:27 -07:00
Chris Lu	de28c4df61	fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478 ) (#9480 ) * fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) handleFoundEcxFile only checks for .dat in the same disk location as the EC shards. In a multi-disk volume server an interrupted encode can leave .ec?? + .ecx on disk B while the source .dat still lives on disk A: the per-disk loader sees no .dat next to .ecx, mistakes the leftover for a distributed-EC layout, and mounts the partial shards. The volume server then heartbeats both a regular replica and an EC shard for the same vid and the master keeps both. Sweep the store after per-disk loading and before the cross-disk reconcile to delete partial EC files when a healthy .dat for the same (collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for every pruned shard so master forgets the new-shard message the per-disk pass already emitted, instead of waiting for the next periodic heartbeat. * fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478) Rust port of the same Store-level prune added to weed/storage. The per-disk EC loader in disk_location.rs only checks for .dat in the same disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx on disk B while the source .dat sits on disk A is mounted as if it were a distributed-EC layout. The volume server then heartbeats both a regular replica and an EC shard for the same vid. Sweep the store after per-disk loading and before the cross-disk reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT shards when a .dat for the same (collection, vid) exists on a sibling disk, and remove all on-disk EC artefacts for them. The Rust heartbeat path already diff-emits deletes from the next ec_volumes snapshot, so no explicit delete-channel push is needed here. Tests cover both the issue 9478 layout and a distributed-EC layout with no .dat anywhere on the store, which must be left alone. * fix(storage): validate sibling .dat size before deleting partial EC (#9478) The earlier prune deleted partial EC files whenever any .dat for the same vid existed on a sibling disk — including a zero-byte shell. A shell is no more useful than the partial shard it would replace, and the partial shard might still combine with shards on other servers in a recoverable distributed-EC layout. Wiping it based on a corrupt sibling .dat is data loss masquerading as cleanup. Tighten the check: when the EC's .vif recorded a non-zero source size in datFileSize, require the sibling .dat to be at least that many bytes; otherwise fall back to "at least a superblock". The .vif value is what the encoder wrote at the moment the source was sealed, so a sibling .dat smaller than that is provably truncated. Carry the size through indexDatOwners alongside the location. The Rust port had the same gap and an additional bug behind it: EcVolume::new wasn't reading datFileSize from .vif, so the safety check always fell back to the superblock floor. Wire datFileSize through. The existing shard-size calculation in LocateEcShardNeedleInterval already uses dat_file_size when non-zero, so populating it also matches Go's behaviour there. Tests cover the truncated-sibling case in both ports.	2026-05-13 09:25:10 -07:00
Chris Lu	3f1eaf9724	fix(s3/audit): emit audit log for successful GET/HEAD (#9467 ) * fix(s3/audit): emit audit log for successful GET/HEAD Successful GET/HEAD object requests never produced a fluent audit entry because those handlers write the response directly (streaming for GET, WriteHeader for HEAD) and never reach a PostLog call site. The wiki advertises GET as an audited verb, so the asymmetry surprises operators who rely on the log for read-access auditing. Move the safety net into the track() middleware: tag each request with an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it, and emit a single fallback entry after the handler returns when nothing fired. The recorder's status flows into the fallback so the audit row still reflects 200/206 vs 404 etc. No double logging for handlers that already emit (write helpers, error paths, bulk delete). Refs #9463 * fix(s3/audit): defensive nil checks on audit-tracking helpers Address PR review: guard against nil request and nil atomic.Bool stored under the audit-tracking key. The conditions are unreachable today (the key is private and we only ever store new(atomic.Bool)), but the checks are free and keep the helpers safe if a future caller misbehaves. test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469) test(s3/audit): cover track() fallback wiring + cleanup Adds two unit tests in weed/s3api/stats_test.go that exercise the audit-tracking flag set up by track(): one verifies the fallback path fires when a handler writes the response directly (the GET/HEAD object regression in #9463), the other verifies the flag is set when a handler emits PostLog itself so the fallback is skipped. To make the wiring observable without standing up fluent, PostLog now marks the audit flag before short-circuiting on a nil Logger; production behavior is unchanged (no logger, no posting) but the flag stays consistent. Also drops two stale comments in s3api_object_handlers.go that still referenced proxyToFiler — that helper was removed when GET/HEAD started streaming from volume servers directly. Stacks on #9467.	2026-05-13 09:24:59 -07:00
Chris Lu	d5372f9eb7	feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471 ) Phase 4b shipped the walker without plugging it into the cluster rate.Limiter that processMatches honors. A walker hitting a large bucket on the recovery branch could burst LifecycleDelete RPCs past the cluster_deletes_per_second cap that streaming-replay respects. WalkerDispatcher now takes a *rate.Limiter and waits on it before each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds just like processMatches does. The handler passes the same limiter to both paths so replay + walk share one budget; nil disables throttling (unchanged default). Tests pin: the limiter actually delays a dispatch when the burst token is drained, and a ctx cancellation in Limiter.Wait surfaces as an error without sending the RPC.	2026-05-13 09:24:50 -07:00
dependabot[bot]	31c7996671	build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 (#9482 )	2026-05-13 09:19:48 -07:00
Chris Lu	37e505b8fd	refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481 ) * refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards) even though the same gRPC stream could serve every shard in a worker process. Replace with one SubscribeMetadata stream covering all shards in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and a fan-out goroutine routes events to per-shard channels by ev.ShardID. drainShardEvents now reads from a passed-in channel; shards whose persisted cursor is fresher than the global floor (runNow - maxTTL) filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader when the first ev.TsNs > runNow arrives — meta-log order means the rest of the stream is past the pass boundary too. cfg.Workers no longer gates shard concurrency: with the shared subscription, every shard goroutine must be live to drain its channel, or the fan-out stalls. The field is retained for back-compat and ignored. Dispatch throttling still goes through cfg.Limiter. Filer load: 16x -> 1x SubscribeMetadata streams per pass. * fix(s3/lifecycle): shared subscription floor is min(per-shard cursor) The shared subscription used runNow - maxTTL as its starting TsNs, but that's the cold-start floor. For shards whose persisted cursor sits below the floor — exactly the case a rule with TTL == maxTTL produces, where a pending event's PUT TsNs ends up at runNow - maxTTL — events that the per-shard drain still needs are filtered out before the Reader even forwards them. Same regression I fixed in `6796ab6db` for the per-shard subscription; now applied at the shared level. computeGlobalStartTsNs loads every shard's cursor and picks the minimum, falling back to the cold-start floor only for shards with no persisted cursor.	2026-05-13 02:13:11 -07:00
Chris Lu	b1d59b04a8	fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477 ) * fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 00:19:05 -07:00
Chris Lu	7b8647e8bc	fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476 ) The s3tests workflow (.github/workflows/s3tests.yml) backgrounds `weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'` and then runs `kill -0 $pid` to confirm the worker stayed alive. The PR-9475 restore ran dailyrun.Run once and exited cleanly — even faster when no buckets had lifecycle rules yet ("nothing to run"). The aliveness check then failed and the s3tests job died with "lifecycle worker died on startup". Caught on https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401. Fix: - -refresh now drives an inter-pass loop. cadence=0 (default) is one-shot, matching the test/s3/lifecycle/ integration-test invocation that omits -refresh and expects synchronous return. cadence>0 (the CI case) keeps the command alive until -runtime expires, running a fresh dailyrun.Run on every tick. - Each iteration re-loads bucket configs via scheduler.LoadCompileInputs so rules created mid-run (the s3tests flow creates rules AFTER the worker starts) get picked up. - The "no rules; nothing to run" early return is gone — the command stays alive even with an empty initial snapshot, waiting for tests to add rules. - -dispatch, -checkpoint, -bootstrap-interval stay accepted-but- ignored (legacy streaming flags).	2026-05-12 19:19:46 -07:00
Chris Lu	4ce027c2f3	fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475 ) fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the premise that it was a debug-only tool. It wasn't: the s3tests CI workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/ integration tests invoke it via `weed shell` to drive lifecycle expirations on demand. Both started failing with "unknown command: s3.lifecycle.run-shard". This PR restores the command with the same flag set so existing callers (CI scripts and integration tests) work unchanged. The implementation no longer drives the streaming dispatcher.Pipeline + scheduler.BucketBootstrapper (deleted) — instead it does one bounded dailyrun.Run pass through the same daily-replay code path the production worker exercises. The walker fires for walker-bound rules just like in the worker. Obsolete streaming flags (-dispatch / -checkpoint / -refresh / -bootstrap-interval) are accepted-but-ignored so existing scripts don't need to drop them.	2026-05-12 18:29:18 -07:00
Chris Lu	ce5768fab1	feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash (#9473 ) * feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash dailyrun.Config.RetentionWindow has been wired since Phase 4b but the handler never supplied a value, so runShard always fell back to maxTTL and engine.PromotedHash hashed nothing. The partition-flip recovery trigger was dormant by design "until the handler plumbs the real meta-log retention here." This PR plumbs it via a new admin form field: Meta-Log Retention (days) — 0 = unbounded (current behavior). When set, ParseConfig converts days to a time.Duration on cfg.MetaLogRetention. The handler passes it as dailyrun.Config.RetentionWindow, which runShard then feeds to engine.PromotedHash. Rules whose TTL exceeds the declared window land in the walk partition; the next time an operator shrinks retention so a previously replay-eligible rule slips past it, PromotedHash mismatches → recovery branch fires → walker re-evaluates the rule across the whole filer tree. 0 stays the default, so existing deployments see no behavior change. * chore(s3/lifecycle): rephrase days->duration conversion gemini-code-assist flagged the original form as a compile error, which it wasn't (time.Duration is a named int64 and supports * with other time.Durations — the test suite verified the value was correct). The suggested form is more idiomatic regardless: days*24 happens in int64 space before the lift to time.Duration, so the unit is unambiguous.	2026-05-12 18:26:52 -07:00
Chris Lu	f51468cf73	Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474 ) Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit `f28c7ce6df`. The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects every hostname-based deployment. In docker-compose / k8s the volume server is started with -ip=<service-name> and the gRPC peer surfaces as the container/pod IP, so the two never match and every heartbeat fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`. The master therefore never learns about any volume, growth fails, and fio writes against the mount return EIO. After the #9440 revert merged (`43a8c4fdc`), the e2e workflow is still failing for this reason; see https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 . Reverting to unblock e2e. A narrower re-do should accept the heartbeat when heartbeat.Ip resolves (DNS) to the peer address, so the spoof hardening can return without breaking hostname-based clusters.	2026-05-12 18:22:21 -07:00
Chris Lu	43a8c4fdca	Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472 ) * Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit `21054b6c18`. The fail-closed gate broke any multi-host cluster: in compose / k8s / remote-host deployments the master's IP isn't loopback, so every master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute, vacuum, scrub, ...) is rejected with PermissionDenied unless the operator manually configures -whiteList. The e2e workflow has been failing since `10cc06333` with `not authorized: 172.18.0.2` on AllocateVolume; downstream symptom is fio fsync EIO because zero volumes can be grown. The gate's intent was to lock down destructive admin tooling, but the same RPCs are the master's normal mechanism for growing and managing volumes. Reverting to restore cluster-internal operation; a narrower re-do should distinguish operator/admin callers from the master peer (e.g. trust IPs resolved from -master) before going back in. * security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic The revert in the previous commit also rolled back an unrelated bug fix that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error but did not continue, so the nil *net.IPNet was stored in whiteListCIDR and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on the next gRPC admin check. Restore the continue. Orthogonal to the fail-closed semantics this PR is reverting.	2026-05-12 16:00:44 -07:00
Chris Lu	f28c7ce6df	master: bind heartbeat claims to the connecting peer (#9443 ) SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on the wire. Three changes tighten that: - Reject heartbeats whose Ip does not match the gRPC peer's source address. Loopback peers are still trusted; operators behind a proxy can opt out with -master.allowUntrustedHeartbeat. - Track which (ip, port) first claimed a volume id or an ec shard slot and drop foreign re-claims. Non-EC volume claims are bounded by the replica copy count so legitimate replicas still register. EC ownership is keyed by (vid, shard_id) so the same vid can legitimately be split across many peers as long as their EcIndexBits are disjoint; rejected bits are cleared from the bitmap and the parallel ShardSizes array is compacted in lock-step. - Maintain reverse indexes owner -> volumes and owner -> ec shard slots so disconnect cleanup is O(M) in what that peer held rather than O(N) over the whole map. Bindings are also released when a heartbeat reports that the peer no longer holds an id, either via explicit Deleted{Volumes,EcShards} entries or by omitting it from a full snapshot. Without this, a planned rebalance that moved a vid or an ec shard from peer A to peer B would leave B's heartbeats permanently filtered out until A disconnected, breaking ec encode/decode flows that delete shards on the source as soon as the move completes. The (vid -> owners) binding still does not track which replica slot each peer occupies, so the first N claims under the copy count win; strict per-slot mapping is a follow-up.	2026-05-12 15:38:52 -07:00
Chris Lu	10cc06333b	cluster: restrict Ping RPC to known peers of the requested type (#9445 ) Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master.	2026-05-12 13:00:52 -07:00
Chris Lu	5004b4e542	feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466 ) * feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run.	2026-05-12 12:54:52 -07:00
Chris Lu	745e864bda	feat(s3/lifecycle): retire algorithm flag, daily_replay is the only path (Phase 5a) (#9465 ) feat(s3/lifecycle): remove algorithm flag, daily_replay is the only path (Phase 5a) With Phase 4b on master the daily_replay path covers every rule kind and the streaming algorithm has no remaining responsibilities. This PR retires the algorithm flag from the worker: - Drop the "Algorithm" enum field from AdminConfigForm and its DefaultValues entry. - Drop the if/else routing in Execute — every Execute call now routes straight into executeDailyReplay. - Drop the streaming-only worker fields (DispatchTick, CheckpointTick, RefreshInterval, BootstrapInterval) and their matching form fields. None of them are read by the daily_replay path; keeping them in the form would suggest tuning knobs that don't do anything. - Drop AlgorithmStreaming / AlgorithmDailyReplay constants and the Config.Algorithm field. The streaming-path packages (s3lifecycle/scheduler, s3lifecycle/dispatcher) remain on the tree; they're now reachable only by the weed shell s3.lifecycle.run-shard debug command and the few helpers (LoadCompileInputs, FilerStore, FilerSiblingLister) the daily_replay worker still uses. Phase 5b deletes the dead code. Tests prune the cadence-default assertions to the single remaining field (max_runtime_minutes).	2026-05-12 12:39:37 -07:00

1 2 3 4 5 ...

13866 Commits