mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-17 15:21:31 +00:00
4.24
13846 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4bac9985b4 |
fix(build): pin apache/thrift to v0.22.0 for 32-bit GOARCH
thrift v0.23.0 uses math.MaxUint32 as an untyped int constant in lib/go/thrift/framed_transport.go:206, which overflows int on 32-bit targets (openbsd/arm, linux/arm, freebsd/arm, netbsd/arm) and breaks the release binary builds.4.24 |
||
|
|
1dfea8a502 | 4.24 | ||
|
|
3a8389cd68 |
fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process. |
||
|
|
0dde6a8c84 |
refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout (#9494)
* refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout "Per-Run Time Limit (minutes)" duplicated the admin scheduler's "Execution Timeout (s)" — both are wall-clock caps on the same Execute call, stacked via context.WithTimeout. Whichever was shorter won. Under defaults the scheduler's 90s timeout always clobbered the worker's 60-min cap, so the "Per-Run Time Limit" knob was effectively dead unless an operator also raised Execution Timeout, and operators had to keep two values in agreement. Remove the worker-side knob and declare a sane scheduler default on the handler descriptor: - WorkerConfigForm: nil (was: one section with one field) - Config.MaxRuntime removed; ParseConfig drops max_runtime_minutes - Handler no longer wraps ctx in context.WithTimeout(MaxRuntime); runCtx is just the ctx the scheduler passes - AdminRuntimeDefaults.ExecutionTimeoutSeconds = 3600 (1h) and JobTypeMaxRuntimeSeconds = 3600 — the scheduler's global 90s default would otherwise kill every real run Tests: - TestParseConfigDefaults loses the MaxRuntime check; new TestParseConfigIgnoresWorkerValues documents the contract - TestDescriptor_WorkerConfigFormIsAbsent pins that the form is gone so a future re-add forces a conscious revisit - TestDescriptor_AdminRuntimeDefaultsBoundExecutionTimeout pins the 1h default with a comment about the 90s scheduler floor * fix(s3/lifecycle): no per-pass timeout by default Lifecycle is a scheduled batch — its natural duration is "as long as today's events take." The 1h default ExecutionTimeoutSeconds from the previous commit was still a footgun: too low truncates legitimate large-bucket passes; too high makes the value meaningless. Set both ExecutionTimeoutSeconds and JobTypeMaxRuntimeSeconds to math.MaxInt32 (~68 years) to say "no timeout in practice" in a code-review-readable way. Operators who genuinely want a wall-clock cap can set one in the admin UI; the scheduler's context.WithTimeout machinery is unchanged (we just hand it an effectively-infinite duration). Note: the scheduler floors ExecutionTimeout at 90s (defaultScheduledExecutionTimeout in weed/admin/plugin/plugin_scheduler.go), so 0 doesn't mean "unlimited" — it clamps back to 90s. A literal math.MaxInt32 is the way to express the intent without touching the shared scheduler code. Test updated to pin math.MaxInt32 and document the rationale so a future tighter cap fails the test and forces conscious revisit. |
||
|
|
e9bcb8f4ad |
docs(s3/lifecycle): refresh DESIGN.md as-built (#9491)
* docs(s3/lifecycle): refresh DESIGN.md as-built + add wiki pages
DESIGN.md was written as a phased implementation plan ("Phase 2 will
ship X, Phase 4 will ship Y"). All phases are now merged, plus the
post-cutover changes from #9477/#9481/#9484/#9485/#9486 substantially
changed the worker model (single subscription, walker throttle,
observability gauges). Rewrite the doc in present tense describing
what's actually there.
Net changes vs the prior plan-style doc:
- Algorithm pseudo-code reflects the single-subscription fan-out plus
walkedThisPass within-pass guard.
- Walker invocation table replaces the implicit "two distinct calls"
prose with three call sites (recovery / steady-state / empty-replay)
and their throttle gates.
- New section on the subscription model (one Reader, ShardPredicate,
fan-out by ev.ShardID).
- New section on cursor.LastWalkedNs and the WalkerInterval throttle.
- Observability section: gauges, heartbeat tokens, what each means.
- "Implementation history" table maps phases to merged PRs.
- "Future work" lists the four optimizations we deferred (long-lived
subscription, bucket-coordinated walker, per-bucket lag metric,
filer meta-log retention).
Drop the "Phase N — ..." narrative from the bottom; the PR history
table is the durable artifact now.
Add wiki pages under docs/wiki/s3-lifecycle/ as source-of-truth for
the operator-facing docs. README explains the sync workflow with the
external seaweedfs.wiki.git repo. Five pages:
- Home.md — landing page, supported rule shapes, what the worker does
- Operator-Guide.md — config knobs, when to change each, walker
interval recommendations by cluster size
- Monitoring.md — Prometheus metric reference + heartbeat token table
+ suggested PromQL alerts
- Troubleshooting.md — stuck cursor, walker stuck, failure outcomes,
cursor schema for manual inspection
- Architecture.md — high-level overview for newcomers; sits between
Home.md (operator) and DESIGN.md (developer)
* docs(s3/lifecycle): address PR review feedback on docs
Coderabbit + gemini findings on #9491:
- Monitoring.md: clarify the "matches all dispatched" phrasing; note
that LIFECYCLE_DELETE_OUTCOME_UNSPECIFIED is the proto zero-value
(shouldn't appear in healthy systems); filter PromQL alerts to
ignore zero-valued gauges so fresh-install heartbeats don't trip.
- Operator-Guide.md, Troubleshooting.md: clarify weed shell -master
format as host:http_port.grpc_port (SeaweedFS ServerAddress).
- Troubleshooting.md: pause the s3_lifecycle job in the admin UI
before manually editing a cursor file, otherwise the worker's
save races with the operator's edit.
- Architecture.md, Home.md, Operator-Guide.md, Monitoring.md,
Troubleshooting.md, DESIGN.md: add language tags (`text`) to
fenced code blocks for markdownlint MD040 compliance.
- DESIGN.md: standardize on the S3 spec rule names
(`ExpiredObjectDeleteMarker`, `NewerNoncurrentVersions`,
`AbortIncompleteMultipartUpload`) and add a one-line note mapping
them to the engine's `ActionKind*` constants.
- README.md: prepend `cd "$(git rev-parse --show-toplevel)"` to the
sync workflow so the `cp` commands' repo-root-relative paths work
whether the operator's shell is at the repo root or at
docs/wiki/s3-lifecycle/.
- Home.md: was lagging the wiki-repo merged version (had the older
pre-merge content). Re-sync from the wiki repo so source matches.
* docs(s3/lifecycle): remove wiki pages from PR
The wiki pages belong in seaweedfs.wiki.git, not the main repo. The
source-of-truth concern that motivated adding them here is real but
the cost — every code-review touchpoint requires reviewers to load
operator-facing pages too — outweighs it. The wiki pages are already
pushed locally (~/dev/seaweedfs.wiki); they'll publish on the
operator-side workflow.
This PR remains scoped to DESIGN.md (the developer-facing reference
that does belong with the code).
* docs(s3/lifecycle): drop Implementation history section
git log is the durable record of what shipped when; the prose table
duplicates it and goes stale faster than commit metadata.
* docs(s3/lifecycle): soften 'exactly once per run' in Goal
The prior phrasing overstated the guarantee versus the failure model
documented later in the same file. Reword to: 'process due objects
each pass; retryable/blocked outcomes get retried from the cursor on
later runs.' Surfaces the head-of-line-blocking semantics up front so
the rest of the doc reads consistently.
Also: drop the stale 'see docs/wiki/s3-lifecycle/' pointer — those
pages live in the wiki repo, not the main repo.
|
||
|
|
813f1351f8 |
feat(s3/lifecycle): enable scheduler by default (#9492)
S3 lifecycle is a standard bucket feature — operators set PutBucketLifecycleConfiguration through the S3 API expecting the configured expirations to actually fire. With the prior default (scheduler enabled=false), buckets with lifecycle XML silently retained data past their declared expiration until an operator noticed and turned the scheduler on. The failure mode of enabled-by-default is "worker runs every day and fast-exits on buckets with no lifecycle rules" — cheap. The failure mode of disabled-by-default is "data lingers, looks like it expired, doesn't" — bad. Enabled-by-default matches both the AWS S3 default behavior and the operator's natural mental model. Operators who want the worker off can still disable it via the admin UI; once a persisted config exists, this descriptor default no longer applies (the persisted Enabled state wins). Test pins the choice so a future flip to false fails loud. |
||
|
|
453c735d02 |
build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 in /test/kafka (#9489)
build(deps): bump github.com/go-git/go-billy/v5 in /test/kafka Bumps [github.com/go-git/go-billy/v5](https://github.com/go-git/go-billy) from 5.8.0 to 5.9.0. - [Release notes](https://github.com/go-git/go-billy/releases) - [Commits](https://github.com/go-git/go-billy/compare/v5.8.0...v5.9.0) --- updated-dependencies: - dependency-name: github.com/go-git/go-billy/v5 dependency-version: 5.9.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
d5e54f217d |
feat(s3/lifecycle): publish per-shard cursor + walker gauges and heartbeat (#9486)
Operator visibility was the last item on the daily-replay must-have
list. The `S3LifecycleCursorMinTsNs` gauge already existed but nothing
ever set it — leftover from the streaming worker that got deleted.
Wire it up and add a parallel one for the walker so a single PromQL
query answers "is this thing working?":
- `cursor_min_ts_ns{shard}` set after each cursor save. Operators read
`now - cursor_min_ts_ns` as the per-shard replay lag.
- `daily_run_last_walked_ns{shard}` new — set in parallel so operators
can confirm WalkerInterval is actually being honored. A stuck value
means the scheduler isn't invoking the worker, the throttle is too
long, or the walker is failing.
- saveCursorAndPublish wraps every Save call site in runShard so the
gauges and the persisted state stay aligned (gauges only advance on
successful saves).
- Enhance the `daily_run: status=... duration=...` heartbeat with
`cursor_lag_max=` and `walked_max_age=` summary tokens for ops grep.
Existing tokens stay positional-stable; new ones append at the end.
Marker `cold` distinguishes "not started" from "0s caught up."
Tests pin the summary line: cold-start state, max-across-shards
selection, and partial-fill (some shards drained, others walked).
Stacked on #9485.
|
||
|
|
bbc075b353 |
feat(s3/lifecycle): plumb WalkerInterval through worker admin config (#9485)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end. * feat(s3/lifecycle): plumb WalkerInterval through worker admin config #9484 added cfg.WalkerInterval to dailyrun.Config but left the worker side wired to zero — operators couldn't actually use the throttle without recompiling. Add the admin-schema knob: - New constant WalkerIntervalMinutesAdminKey = "walker_interval_minutes" follows the MetaLogRetentionDaysAdminKey pattern (Int64, minutes unit, 0 = unbounded / fire every pass). - New Config.WalkerInterval populated in ParseConfig from adminValues; negative / zero stay at zero so the prior "fire every pass" semantics keep the in-repo integration tests and the s3tests sub-minute driver working unchanged. - handler.go: admin form field with operator-facing label and description, default in DefaultValues, value forwarded to dailyrun.Run via cfg.WalkerInterval. Tests cover the default-zero, positive, and negative cases — same shape as the MetaLogRetention tests so the parsing contract stays consistent. Stacked on #9484; rebase after that lands. |
||
|
|
c6582228b8 |
feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval (#9484)
* feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end. |
||
|
|
75c807b586 | chore(weed/mq/kafka/protocol): remove unused functions and variables (#9488) | ||
|
|
d5c0a7b153 |
fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume. |
||
|
|
79859fc21d |
feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd Three diagnostic additions on top of #9460, all aimed at making the next production incident faster to triage than the one we just spent hours on. 1. [versioning-heal] grep prefix on every heal-related log line, with a small fixed event vocabulary (produced / surfaced / healed / enqueue / drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed / teardown_failed / queue_full). One grep gives operators a single event stream across the produce-to-drain lifecycle. 2. Escalate the "scanned N>0 entries but no valid latest" case in updateLatestVersionAfterDeletion from V(1) Infof to a Warning that names the orphan entries it saw. This is the listing-after-rm inconsistency signature that pinned down 259064a8's failure — it should not be invisible at default log levels. 3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]` that walks .versions/ directories under a prefix and reports the stranded population. With -heal it clears the latest-version pointer in place on stranded directories so subsequent reads return a clean NoSuchKey instead of replaying the 10-retry self-heal loop. * fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry Address PR review: 1. s3.versions.audit walked only the first 1024-entry page of each .versions/ directory, false-positiving "stranded" on large dirs. Loop until the page returns < 1024 entries, advancing startName. 2. clean and orphan-only categories double-counted when a directory had no pointer and at least one orphan: incremented both. Make them mutually exclusive so report totals sum to versionsDirs. 3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep, non-interruptible by ctx. A server shutdown / client disconnect would wait out the budget per in-flight delete. Thread ctx through deleteSpecificObjectVersion -> repointLatestBeforeDeletion / updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context(); gRPC lifecycle handlers pass the stream ctx. New test pins the behavior: cancelling ctx mid-backoff returns ctx.Err() in <500ms instead of blocking ~6.3s. * fix(s3/versioning): clearStale outcome + escape grep-able log fields Two coderabbit follow-ups: 1. Successful pointer clear should suppress `produced`. updateLatestVersionAfterDeletion's transient-rm fallback called clearStaleLatestVersionPointer best-effort, then unconditionally returned retryErr. The caller (deleteSpecificObjectVersion) saw the error and emitted `event=produced` + enqueued the reconciler, even though clearStaleLatestVersionPointer had just driven the pointer to consistency and the next reader would get NoSuchKey via the clean-miss path. Make clearStaleLatestVersionPointer return cleared bool; on success the caller returns nil so neither produced nor the reconciler enqueue fires. Concurrent-writer aborts, re-scan errors, and CAS mismatches still report false so genuinely stranded state keeps surfacing. 2. Escape user-controlled fields in heal log lines. versioningHealInfof / Warningf / Errorf interpolated raw bucket / key / filename / err text into a single-space-separated line. An S3 key (or error string from gRPC) containing whitespace, newlines, or `event=...` could split one event into multiple tokens and spoof fake fields downstream. Sanitize each arg in the helper: safe values pass through; anything with whitespace, quotes, control chars, or backslashes is replaced with its strconv.Quote form. No caller changes — the format strings remain unchanged. Tests pin both behaviors: sanitization table covers the field boundary cases; an end-to-end shape test confirms a key containing `event=spoof` stays inside a single quoted token. |
||
|
|
e025ec2334 |
fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends (#9483)
* fix(volume): seed indexFileOffset in SortedFileNeedleMap so Delete appends NewSortedFileNeedleMap never initialised the inherited baseNeedleMapper.indexFileOffset, so it stayed at zero. The compact and leveldb constructors both seed it from stat().Size(); this one did not. When a read-only or noWriteCanDelete volume processed a Delete, the inherited appendToIndexFile wrote the tombstone via WriteAt at indexFileOffset=0 and advanced 16 bytes at a time. Every delete since the SortedFileNeedleMap path was first exercised for writes (#5633, which switched .sdx to O_RDWR) overwrote the front of .idx with tombstones for unrelated keys. Net effect on disk: the first N entries of .idx become sequentially written tombstones with .dat offsets at the tail, while the original Put records that lived in slots [0..N) are gone. fsck sees ~N phantom orphans whose keys' runtime needle-map entries are already tombstoned in .sdx, so the volume server replies 304 to every purge and the orphan count is stable across retries. Stat the .idx at open and seed indexFileOffset = size. The .idx now grows on delete instead of being clobbered. Add a regression test that populates .idx via the compact map, opens it as a SortedFileNeedleMap, deletes one needle, and verifies the file grew by one entry, the original Put records survived, and the tombstone landed at the tail. The fix only stops further corruption. .idx files already damaged by this bug have to be rebuilt from .dat before the next restart, or isSortedFileFresh will regenerate .sdx from the bad .idx and propagate the damage. Refs #9479 * test: harden SortedFileNeedleMap regression assertions Address PR review (gemini-code-assist, coderabbit): - Close the writer before t.Fatalf in the Put loop so a failing seed doesn't leak the .idx file descriptor. - Verify offset and size of preserved Put records, not just the key. Front-overwrite damage would clobber all three fields, but a key-only check would miss a different regression that corrupted offset/size while leaving the key intact. |
||
|
|
f5a4bfb514 |
fix(s3/versioning): repair dangling latest-version pointer after partial delete (#9460)
* fix(s3/versioning): repair dangling latest-version pointer after partial delete deleteSpecificObjectVersion did two non-atomic filer ops: rm the version blob, then update the .versions/ pointer. Step 2 failures were silently logged and the client got 204 OK, so any transient blip (filer timeout, process restart between RPCs, lock contention) left the .versions/ directory naming a missing file. Subsequent GETs paid the 10-retry self-heal cost and returned NoSuchKey — surfacing as "Storage not found" to Veeam, which is what triggered this investigation. Three changes: 1. Pre-roll the pointer for the singleton / multi-version-deleting-latest cases. The pointer is repointed (multi) or cleared (singleton) before the blob rm. A failure between leaves a recoverable orphan blob — pointer is consistent, GETs succeed or correctly miss without entering the stale-pointer self-heal path. 2. Wrap the load-bearing filer ops in updateLatestVersionAfterDeletion with bounded retries (~6.3s worst case). When retries are exhausted the function now returns a non-nil error instead of swallowing it. The caller logs at Error level and queues the path for the reconciler. 3. Background reconciler drains stranded .versions/ pointer-to-missing states off the hot path. Bounded in-memory queue with capped retries; read-path heal remains as a last-resort safety net. * fix(s3/versioning): address review on #9460 Four fixes addressing review on PR #9460. All four are correctness; no behavioural change for the happy path. 1. repointLatestBeforeDeletion: discriminate NotFound from transient errors when re-fetching the .versions/ entry. Previously any error returned rolled=true,nil — a transient filer hiccup at that point would cause the caller to skip the post-delete reconciliation AND proceed with the blob rm, producing exactly the dangling pointer state the PR aims to prevent. NotFound stays "vacuously consistent" (directory already gone); other errors surface so the caller aborts before removing the blob. 2. Move the singleton .versions/ teardown out of repointLatestBeforeDeletion (where it ran BEFORE the blob rm and always failed with "non-empty folder") into deleteSpecificObjectVersion AFTER the blob rm. Adds a wasSingleton return value so the caller knows when to run the teardown. Without this, every singleton-version delete in a versioned bucket leaked an empty .versions/ directory. 3. Wrap the list, getEntry, and mkFile calls inside repointLatestBeforeDeletion with retryFilerOp so the pre-roll has the same transient-failure resilience as the post-roll path. Without retries, a single transient blip causes the caller to fall back to the legacy non-atomic flow even when the filer recovers immediately. 4. healVersionsPointer in the reconciler: same NotFound-vs-transient discrimination on both the .versions/ getEntry and the latest-file presence probe. Previously a transient filer error would silently evict the candidate from the queue as "healed", leaving the real stranded state until a client read happened to surface it. Also fixes the gemini-flagged consistency nit: the queued-for-reconciler error log now uses normalizedObject instead of object so it matches the queue entry's key. * fix(s3/versioning): short-circuit terminal errors in retryFilerOp Add isRetryableFilerErr that returns false for filer_pb.ErrNotFound, gRPC NotFound, context.Canceled, and context.DeadlineExceeded. retryFilerOp now bails immediately on a terminal error and returns it unwrapped, so callers like repointLatestBeforeDeletion.getEntry and updateLatestVersionAfterDeletion.rm see the raw NotFound instead of paying the ~6.3 s retry-budget delay AND parsing it out of an "exhausted N retries" wrapper. errors.Is and status.Code already walk the %w chain so today's call sites still work, but the delay was real on the hot DELETE path whenever a key was genuinely absent. Test added covering all five terminal-error shapes — each must run the wrapped fn exactly once and return in under 50 ms. |
||
|
|
de28c4df61 |
fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) (#9480)
* fix(storage): prune partial EC shards when sibling disk has healthy .dat (#9478) handleFoundEcxFile only checks for .dat in the same disk location as the EC shards. In a multi-disk volume server an interrupted encode can leave .ec?? + .ecx on disk B while the source .dat still lives on disk A: the per-disk loader sees no .dat next to .ecx, mistakes the leftover for a distributed-EC layout, and mounts the partial shards. The volume server then heartbeats both a regular replica and an EC shard for the same vid and the master keeps both. Sweep the store after per-disk loading and before the cross-disk reconcile to delete partial EC files when a healthy .dat for the same (collection, vid) exists on a sibling disk. Push DeletedEcShardsChan for every pruned shard so master forgets the new-shard message the per-disk pass already emitted, instead of waiting for the next periodic heartbeat. * fix(seaweed-volume): mirror prune of partial EC with sibling .dat (#9478) Rust port of the same Store-level prune added to weed/storage. The per-disk EC loader in disk_location.rs only checks for .dat in the same disk as the EC shards, so an interrupted encode that leaves .ec?? + .ecx on disk B while the source .dat sits on disk A is mounted as if it were a distributed-EC layout. The volume server then heartbeats both a regular replica and an EC shard for the same vid. Sweep the store after per-disk loading and before the cross-disk reconcile, dropping in-memory EcVolumes with fewer than DATA_SHARDS_COUNT shards when a .dat for the same (collection, vid) exists on a sibling disk, and remove all on-disk EC artefacts for them. The Rust heartbeat path already diff-emits deletes from the next ec_volumes snapshot, so no explicit delete-channel push is needed here. Tests cover both the issue 9478 layout and a distributed-EC layout with no .dat anywhere on the store, which must be left alone. * fix(storage): validate sibling .dat size before deleting partial EC (#9478) The earlier prune deleted partial EC files whenever any .dat for the same vid existed on a sibling disk — including a zero-byte shell. A shell is no more useful than the partial shard it would replace, and the partial shard might still combine with shards on other servers in a recoverable distributed-EC layout. Wiping it based on a corrupt sibling .dat is data loss masquerading as cleanup. Tighten the check: when the EC's .vif recorded a non-zero source size in datFileSize, require the sibling .dat to be at least that many bytes; otherwise fall back to "at least a superblock". The .vif value is what the encoder wrote at the moment the source was sealed, so a sibling .dat smaller than that is provably truncated. Carry the size through indexDatOwners alongside the location. The Rust port had the same gap and an additional bug behind it: EcVolume::new wasn't reading datFileSize from .vif, so the safety check always fell back to the superblock floor. Wire datFileSize through. The existing shard-size calculation in LocateEcShardNeedleInterval already uses dat_file_size when non-zero, so populating it also matches Go's behaviour there. Tests cover the truncated-sibling case in both ports. |
||
|
|
3f1eaf9724 |
fix(s3/audit): emit audit log for successful GET/HEAD (#9467)
* fix(s3/audit): emit audit log for successful GET/HEAD Successful GET/HEAD object requests never produced a fluent audit entry because those handlers write the response directly (streaming for GET, WriteHeader for HEAD) and never reach a PostLog call site. The wiki advertises GET as an audited verb, so the asymmetry surprises operators who rely on the log for read-access auditing. Move the safety net into the track() middleware: tag each request with an audit-tracking flag, let PostLog/PostAccessLog (delete path) mark it, and emit a single fallback entry after the handler returns when nothing fired. The recorder's status flows into the fallback so the audit row still reflects 200/206 vs 404 etc. No double logging for handlers that already emit (write helpers, error paths, bulk delete). Refs #9463 * fix(s3/audit): defensive nil checks on audit-tracking helpers Address PR review: guard against nil request and nil *atomic.Bool stored under the audit-tracking key. The conditions are unreachable today (the key is private and we only ever store new(atomic.Bool)), but the checks are free and keep the helpers safe if a future caller misbehaves. * test(s3/audit): track() audit fallback coverage + stale comment cleanup (#9469) test(s3/audit): cover track() fallback wiring + cleanup Adds two unit tests in weed/s3api/stats_test.go that exercise the audit-tracking flag set up by track(): one verifies the fallback path fires when a handler writes the response directly (the GET/HEAD object regression in #9463), the other verifies the flag is set when a handler emits PostLog itself so the fallback is skipped. To make the wiring observable without standing up fluent, PostLog now marks the audit flag before short-circuiting on a nil Logger; production behavior is unchanged (no logger, no posting) but the flag stays consistent. Also drops two stale comments in s3api_object_handlers.go that still referenced proxyToFiler — that helper was removed when GET/HEAD started streaming from volume servers directly. Stacks on #9467. |
||
|
|
d5372f9eb7 |
feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471)
Phase 4b shipped the walker without plugging it into the cluster rate.Limiter that processMatches honors. A walker hitting a large bucket on the recovery branch could burst LifecycleDelete RPCs past the cluster_deletes_per_second cap that streaming-replay respects. WalkerDispatcher now takes a *rate.Limiter and waits on it before each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds just like processMatches does. The handler passes the same limiter to both paths so replay + walk share one budget; nil disables throttling (unchanged default). Tests pin: the limiter actually delays a dispatch when the burst token is drained, and a ctx cancellation in Limiter.Wait surfaces as an error without sending the RPC. |
||
|
|
31c7996671 | build(deps): bump github.com/go-git/go-billy/v5 from 5.8.0 to 5.9.0 (#9482) | ||
|
|
37e505b8fd |
refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass (#9481)
* refactor(s3/lifecycle): one meta-log subscription per dailyrun.Run pass
Per-shard Reader subscriptions multiplied filer load by len(cfg.Shards)
even though the same gRPC stream could serve every shard in a worker
process. Replace with one SubscribeMetadata stream covering all shards
in cfg.Shards: the Reader's ShardPredicate accepts the shard set, and
a fan-out goroutine routes events to per-shard channels by ev.ShardID.
drainShardEvents now reads from a passed-in channel; shards whose
persisted cursor is fresher than the global floor (runNow - maxTTL)
filter ev.TsNs <= startTsNs locally. The fan-out cancels the reader
when the first ev.TsNs > runNow arrives — meta-log order means the
rest of the stream is past the pass boundary too.
cfg.Workers no longer gates shard concurrency: with the shared
subscription, every shard goroutine must be live to drain its channel,
or the fan-out stalls. The field is retained for back-compat and
ignored. Dispatch throttling still goes through cfg.Limiter.
Filer load: 16x -> 1x SubscribeMetadata streams per pass.
* fix(s3/lifecycle): shared subscription floor is min(per-shard cursor)
The shared subscription used runNow - maxTTL as its starting TsNs, but
that's the cold-start floor. For shards whose persisted cursor sits
below the floor — exactly the case a rule with TTL == maxTTL produces,
where a pending event's PUT TsNs ends up at runNow - maxTTL — events
that the per-shard drain still needs are filtered out before the
Reader even forwards them.
Same regression I fixed in
|
||
|
|
b1d59b04a8 |
fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
7b8647e8bc |
fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476)
The s3tests workflow (.github/workflows/s3tests.yml) backgrounds
`weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'`
and then runs `kill -0 $pid` to confirm the worker stayed alive.
The PR-9475 restore ran dailyrun.Run once and exited cleanly — even
faster when no buckets had lifecycle rules yet ("nothing to run").
The aliveness check then failed and the s3tests job died with
"lifecycle worker died on startup". Caught on
https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401.
Fix:
- -refresh now drives an inter-pass loop. cadence=0 (default) is
one-shot, matching the test/s3/lifecycle/ integration-test
invocation that omits -refresh and expects synchronous return.
cadence>0 (the CI case) keeps the command alive until -runtime
expires, running a fresh dailyrun.Run on every tick.
- Each iteration re-loads bucket configs via
scheduler.LoadCompileInputs so rules created mid-run (the s3tests
flow creates rules AFTER the worker starts) get picked up.
- The "no rules; nothing to run" early return is gone — the
command stays alive even with an empty initial snapshot, waiting
for tests to add rules.
- -dispatch, -checkpoint, -bootstrap-interval stay accepted-but-
ignored (legacy streaming flags).
|
||
|
|
4ce027c2f3 |
fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475)
fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the premise that it was a debug-only tool. It wasn't: the s3tests CI workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/ integration tests invoke it via `weed shell` to drive lifecycle expirations on demand. Both started failing with "unknown command: s3.lifecycle.run-shard". This PR restores the command with the same flag set so existing callers (CI scripts and integration tests) work unchanged. The implementation no longer drives the streaming dispatcher.Pipeline + scheduler.BucketBootstrapper (deleted) — instead it does one bounded dailyrun.Run pass through the same daily-replay code path the production worker exercises. The walker fires for walker-bound rules just like in the worker. Obsolete streaming flags (-dispatch / -checkpoint / -refresh / -bootstrap-interval) are accepted-but-ignored so existing scripts don't need to drop them. |
||
|
|
ce5768fab1 |
feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash (#9473)
* feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash dailyrun.Config.RetentionWindow has been wired since Phase 4b but the handler never supplied a value, so runShard always fell back to maxTTL and engine.PromotedHash hashed nothing. The partition-flip recovery trigger was dormant by design "until the handler plumbs the real meta-log retention here." This PR plumbs it via a new admin form field: Meta-Log Retention (days) — 0 = unbounded (current behavior). When set, ParseConfig converts days to a time.Duration on cfg.MetaLogRetention. The handler passes it as dailyrun.Config.RetentionWindow, which runShard then feeds to engine.PromotedHash. Rules whose TTL exceeds the declared window land in the walk partition; the next time an operator shrinks retention so a previously replay-eligible rule slips past it, PromotedHash mismatches → recovery branch fires → walker re-evaluates the rule across the whole filer tree. 0 stays the default, so existing deployments see no behavior change. * chore(s3/lifecycle): rephrase days->duration conversion gemini-code-assist flagged the original form as a compile error, which it wasn't (time.Duration is a named int64 and supports * with other time.Durations — the test suite verified the value was correct). The suggested form is more idiomatic regardless: days*24 happens in int64 space before the lift to time.Duration, so the unit is unambiguous. |
||
|
|
f51468cf73 |
Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit |
||
|
|
43a8c4fdca |
Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit |
||
|
|
f28c7ce6df |
master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:
- Reject heartbeats whose Ip does not match the gRPC peer's source
address. Loopback peers are still trusted; operators behind a proxy
can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
and drop foreign re-claims. Non-EC volume claims are bounded by the
replica copy count so legitimate replicas still register. EC
ownership is keyed by (vid, shard_id) so the same vid can legitimately
be split across many peers as long as their EcIndexBits are disjoint;
rejected bits are cleared from the bitmap and the parallel ShardSizes
array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
so disconnect cleanup is O(M) in what that peer held rather than O(N)
over the whole map.
Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.
The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
|
||
|
|
10cc06333b |
cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master. |
||
|
|
5004b4e542 |
feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run. |
||
|
|
745e864bda |
feat(s3/lifecycle): retire algorithm flag, daily_replay is the only path (Phase 5a) (#9465)
feat(s3/lifecycle): remove algorithm flag, daily_replay is the only path (Phase 5a)
With Phase 4b on master the daily_replay path covers every rule kind
and the streaming algorithm has no remaining responsibilities. This
PR retires the algorithm flag from the worker:
- Drop the "Algorithm" enum field from AdminConfigForm and its
DefaultValues entry.
- Drop the if/else routing in Execute — every Execute call now
routes straight into executeDailyReplay.
- Drop the streaming-only worker fields (DispatchTick,
CheckpointTick, RefreshInterval, BootstrapInterval) and their
matching form fields. None of them are read by the daily_replay
path; keeping them in the form would suggest tuning knobs that
don't do anything.
- Drop AlgorithmStreaming / AlgorithmDailyReplay constants and the
Config.Algorithm field.
The streaming-path packages (s3lifecycle/scheduler, s3lifecycle/dispatcher)
remain on the tree; they're now reachable only by the
weed shell s3.lifecycle.run-shard debug command and the few helpers
(LoadCompileInputs, FilerStore, FilerSiblingLister) the daily_replay
worker still uses. Phase 5b deletes the dead code.
Tests prune the cadence-default assertions to the single remaining
field (max_runtime_minutes).
|
||
|
|
2f682303fb |
fix(s3/lifecycle): align walker dispatch error label to RPC_ERROR (#9464)
Follow-up to PR #9459 (merged before this fix landed). The walker dispatcher's RPC failure paths were labeled "TRANSPORT_ERROR" and "NIL_RESPONSE"; streaming (dispatcher/dispatcher.go) and the replay drain (processMatches in run.go via #9462) use "RPC_ERROR" for the same condition. Aligning so a single Prometheus query covers all three delete paths. Folds nil-response under RPC_ERROR rather than a separate label — operationally it's the same class of failure (server returned no usable response). |
||
|
|
31a579d12a |
build(deps): bump github.com/rclone/rclone from 1.73.5 to 1.74.1 (#9455)
Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.5 to 1.74.1. - [Release notes](https://github.com/rclone/rclone/releases) - [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md) - [Commits](https://github.com/rclone/rclone/compare/v1.73.5...v1.74.1) --- updated-dependencies: - dependency-name: github.com/rclone/rclone dependency-version: 1.74.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
2212cc8a5f |
shell: expose retention flags on mq.topic.configure (#9416)
* shell: expose retention flags on mq.topic.configure
The ConfigureTopicRequest proto already carries a TopicRetention message
({retention_seconds, enabled}), and GetTopicConfigurationResponse / the
admin UI both surface it — but the `mq.topic.configure` shell command
sent Retention: nil unconditionally, leaving CLI and IaC users with no
way to set retention without going through the admin UI. Add three
flags so the shell matches the existing surface:
-retention <duration> Go duration string (e.g. 168h, 30m)
-retentionSeconds <int> raw seconds (mutually exclusive with -retention)
-retentionEnabled bool toggle for retention enforcement
Behavior:
- If none of the retention flags are set, the request omits Retention
(`nil`) — preserving the prior "leave server-side state alone"
semantics for callers that only care about partition count.
- Setting any retention flag populates TopicRetention with both fields
so existing operators keep working when only one is provided.
- -retention and -retentionSeconds together is a hard error (ambiguous);
negative values are rejected.
The new behavior is detected via flag.FlagSet.Visit so default values
of 0 / false are distinguishable from "not provided".
Help text updated with examples for both setting a TTL and disabling
retention on an existing topic. No proto / server changes needed; this
is a CLI-only patch.
* broker, shell: address review for mq.topic.configure retention
Three follow-up fixes for the review comments on the original commit:
1. Server-side: detect retention changes in the early-return path.
Previously ConfigureTopic returned without persisting when
partitionCount + schema were unchanged, even if the request supplied
a different Retention. Now the early-return branch checks both
schema and retention, persists either or both when they differ, and
logs which fields actually changed.
2. Server-side: preserve existing retention when request.Retention is
nil in the fresh-allocation path. Capture the prior resp.Retention
before overwriting `resp = &mq_pb.ConfigureTopicResponse{}`, and
only overwrite the new resp.Retention with `request.Retention` when
the request actually supplies one. Otherwise carry the previous
retention forward, so partition-count changes (or any path that
bypasses the early-return branch) don't accidentally clear retention.
3. CLI: switch the `-retention` / `-retentionSeconds` mutual-exclusion
check from value-based (`!= 0`) to flag.FlagSet.Visit-based, so an
explicit `-retention=0 -retentionSeconds=N` is also rejected.
4. CLI: when any retention flag is set, fetch the current
GetTopicConfiguration and fill in the fields the user didn't
explicitly provide. This means `-retentionEnabled` alone no longer
zeros the existing duration, and `-retention 24h` alone preserves
the existing enabled flag. When the topic doesn't exist yet, the
GetTopicConfiguration error is treated as "no current state" and
we proceed with just the user-supplied values.
Build / vet / gofmt / `go test ./weed/shell/... ./weed/mq/broker/...`
clean.
|
||
|
|
89608e5499 | chore(weed/util/log_buffer): remove unused functions (#9444) | ||
|
|
21054b6c18 |
volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use it to gate destructive volume admin RPCs. IsWhiteListed keeps its allow-all-when-empty semantics for HTTP compatibility. For TCP peers with an empty whitelist, off-host callers are rejected but loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly cohabits with the master/filer on a single host and in integration-test clusters; the loopback exception keeps cluster-internal admin traffic working without -whiteList while still locking out off-host attackers. Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check entirely. When `weed server` runs master+volume+filer in a single process the master dials the volume server in-process and the peer address surfaces as "@", which has no parseable IP. Such a caller shares our OS process and cannot be spoofed by a remote attacker, so we treat it as trusted by construction. The gate also tolerates a nil guard (developmental / embedded path) and only enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet. |
||
|
|
495632730c |
feat(s3/lifecycle): daily-replay observability — metrics + summary log (Phase 6) (#9462)
* feat(s3/lifecycle): daily-replay observability metrics + per-run summary log
Operators have no Prometheus signal today for the daily_replay path
beyond the cluster-rate-limiter wait histogram. Phase 6 adds the
three baseline questions: how long does a shard take, how many events
did it scan, and what did dispatch produce.
- S3LifecycleDailyRunShardDurationSeconds (histogram, label=shard):
wall-clock per shard. p95 climbing toward MaxRuntime means the
shard is brushing its budget.
- S3LifecycleDailyRunEventsScanned (counter, label=shard): meta-log
events drainShardEvents processed. Pairs with the duration so a
spike in events-per-shard correlates with a slow shard.
- S3LifecycleDispatchCounter (existing, reused): processMatches now
increments this with the outcome label, so streaming and
daily_replay paths share one outcome view. Transport errors are
counted under outcome="TRANSPORT_ERROR".
dailyrun.Run logs a per-run summary at V(0): status / shards /
errors / duration. The summary is the at-a-glance line operators read
in /var/log to confirm a run completed.
Test pins the dispatch-counter increment with a unique
bucket/kind/outcome triple so a refactor that drops the
instrumentation call surfaces as a test failure.
* fix(s3/lifecycle): align dispatch error label + clean test labels
Two PR-9462 review fixes from gemini:
1. processMatches' transport-failure label was "TRANSPORT_ERROR";
streaming's dispatcher uses "RPC_ERROR" for the same condition
(see dispatcher/dispatcher.go). Use "RPC_ERROR" here too so
the same Prometheus query covers both delete paths.
2. The dispatch-counter assertion test now deletes its label row
on exit so the in-process Prometheus registry doesn't accumulate
per-test state across the suite.
|
||
|
|
f954781169 |
feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459)
* feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config
Adds a Config.RetentionWindow field that runShard threads into
engine.PromotedHash. Zero (the default) falls back to maxTTL, which
matches Phase 4a behavior — PromotedHash stays empty and the
partition-flip recovery trigger stays dormant.
Pure plumbing. The handler still passes zero so nothing changes at
runtime. The walker work (Phase 4b proper) sets a real retention from
the meta-log boundary and the partition-flip trigger starts firing.
* feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker
Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient
so the same LifecycleDelete RPC drives both the meta-log replay path
and the walker. No CAS witness — the server's identityMatches treats
nil ExpectedIdentity as a bootstrap call and rebuilds the witness from
the live entry, which is the right contract for a full-tree walk.
Adds VersionID to bootstrap.Entry so versioned-bucket walks address
the right version. MPU init uses DestKey for ObjectPath (matching the
prefix-match contract); rejecting empty DestKey keeps malformed init
records out of the dispatch path.
Not wired yet — runShard still doesn't invoke the walker. Follow-up
commits add the ListFunc adapter and the recovery-branch wiring.
* feat(s3/lifecycle): wire Walker hook into runShard's recovery branch
Adds a Config.Walker callback that fires on rule-content edit /
partition flip BEFORE the cursor rewinds, so already-due objects across
the rewritten rule set get caught instead of waiting on meta-log
replay alone. The callback receives engine.RecoveryView(snap) and the
per-shard ID; nil disables it (Phase 4a behavior preserved).
Decoupling the wiring from the implementation: the handler-side
WalkerFunc that drives bootstrap.Walk via the filer is the follow-up
commit, and tests can stub the callback without standing up the full
filer/client/lister harness.
Tests pin: walker fires exactly once on hash mismatch, walker error
propagates and leaves the cursor unchanged, nil Walker is a no-op.
* feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard
Adds dailyrun.WalkBuckets — the composable driver the handler-side
WalkerFunc will call. Iterates a bucket list, wraps the supplied
bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey
for MPU init), and runs bootstrap.Walk per bucket using the supplied
Dispatcher. First bucket error wins; remaining buckets log and run to
completion so one filer flake doesn't kill the shard.
Composable rather than monolithic so callers and tests can swap parts:
production uses a filer-backed ListFunc + WalkerDispatcher; tests use
bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the
next commit.
Tests pin: shard filter routes only matching entries, MPU shard uses
DestKey not the .uploads/<id> path, single-bucket error propagates
while other buckets still run, ctx cancellation short-circuits between
buckets, nil guards on view/list/dispatch.
* feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker
Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that
streams entries under <bucketsPath>/<bucket> by paginated SeaweedList.
Recurses into regular directories; .versions/ and .uploads/ are
skipped at this stage so they don't surface as raw children — the
sibling expansion (versioned NoncurrentDays state, MPU init dispatch)
lands in the next commit.
listAll and isVersionsDir are ported from scheduler/bootstrap.go's
same-named helpers. Phase 5 deletes the scheduler copies along with
the streaming path.
Tests pin: flat listing, recursion through nested directories,
.versions/ and .uploads/ skipped, kill-resume via the start path
contract, nil-client error, attribute propagation (mtime / size /
IsLatest default).
* feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc
Adds the .versions/<key>/ expansion to the daily-run's filer-backed
ListFunc. Each call emits one bootstrap.Entry per sibling (real
version files + the bare null version, when found) with the same
sibling state the streaming bootstrap injects via reader.Event:
- Path = logical key (not the .versions/<file> physical path), so
bootstrap.Walk's MatchPath uses the user's intended path.
- VersionID per sibling (version_id or "null").
- IsLatest resolved via parent's ExtLatestVersionIdKey, falling back
to explicit-null-bare, falling back to newest-by-mtime.
- NoncurrentIndex rank computed against the latest's position.
- SuccessorModTime: SuccessorFromEntryStamp if stamped, else the
previous-newer sibling's mtime (legacy derivation).
- IsDeleteMarker from ExtDeleteMarkerKey.
- NumVersions = len(siblings).
Two-pass walk so .versions/ dirs run before regular files; the bare
null-version path is recorded in skipBare so pass 2 doesn't emit it
twice.
expandVersionsDir and lookupNullVersion are ported from
scheduler/bootstrap.go. Sort order, latest resolution, and successor
derivation must agree with that path verbatim so streaming and walker
reach the same verdict on the same objects. Phase 5 deletes the
scheduler copy.
MPU init (.uploads/<id>) remains skipped — the dedicated commit emits
it with IsMPUInit and DestKey.
Tests pin: pointer-wins latest resolution, no-pointer newest-sibling
fallback, explicit-null-is-latest with skipBare suppression of the
bare emission, coincidentally-named .versions folder recursing as a
regular subdir, delete-marker propagation.
* feat(s3/lifecycle): emit MPU init records from FilerListFunc
Last gap in the filer-backed ListFunc. A directory at .uploads/<id>
carrying ExtMultipartObjectKey is the MPU init record; emit one
bootstrap.Entry with IsMPUInit=true and DestKey set to the user's
intended path. The walker's MatchPath uses DestKey for prefix
matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's
ObjectPath. .uploads/<id> directories without the extended key are
mid-write before metadata landed and stay skipped.
isMPUInitDir is upgraded from the path-shape-only stub to the full
shape + extended-attr check that mirrors router.mpuInitInfo and
scheduler/bootstrap.go's same-named helper.
Tests pin: valid init record emits with the right DestKey, missing
ExtMultipartObjectKey skips the directory.
* feat(s3/lifecycle): wire walker into executeDailyReplay
Activates the recovery-branch walker. The handler composes the three
Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets
— into a dailyrun.WalkerFunc and passes it via Config.Walker. The
bucket list is derived from the compiled inputs so it matches the
engine snapshot exactly.
Effect on master behavior: when a worker observes a RuleSetHash or
PromotedHash mismatch on its persisted cursor (rule content edited /
partition flip), runShard now walks the live filer tree under the
RecoveryView before rewinding the cursor. Already-due objects across
the rewritten rule set fire immediately instead of waiting on the
sliding meta-log replay.
Still scoped to replay-eligible action kinds because
checkSnapshotForUnsupported continues to reject walker-bound rules
(ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and
scan_only-promoted rules at the top of Run. The follow-up commit
relaxes the gate once the steady-state walker over RulesForShard's
walk view is wired so those rules fire every day, not just on rule
edits.
* feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate
Adds the second walker invocation in runShard. After the recovery
check passes, runShard derives the walk view via snap.RulesForShard
(using the same retentionWindow PromotedHash used, so the partition
is consistent) and runs the walker over it. The view holds
walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) plus any replay-eligible rules promoted to walk by
retention shortage; an empty view skips the call so non-versioned,
replay-only deployments don't pay an O(N) bucket walk per run.
With the walker now servicing every rule kind, checkSnapshotForUnsupported
and its UnsupportedRuleError type are obsolete. router.Route gates
replay on Mode == ModeEventDriven, so walker-bound and scan_only
rules are silently dropped by replay and picked up by the walker
instead — no double-dispatch. Drop the gate, delete replayability.go
+ replayability_test.go, and remove the handler's redundant
IsUnsupportedRule branch.
* fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment
Two PR-review fixes on 9459:
1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return —
add a defensive nil-response check so the walk halts cleanly
instead. Spotted by coderabbit.
2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash
"stays empty" in fallback mode, which gemini correctly pointed out
is only true once rules are active. During bootstrap (rules
compiled but IsActive=false) MaxEffectiveTTL is 0 while
PromotedHash counts every non-disabled rule, so promoted becomes
non-empty and the next post-activation run hits the recovery
branch. That's the intended bootstrap walk — rewrite the comment
to explain it rather than misstate the invariant.
Test: pins nil-response → error path on WalkerDispatcher.
* fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion
Reviewer caught a structural bug in expandVersionsDir's latest
resolution: when ExtLatestVersionIdKey was set but no scanned sibling
carried that id (stale pointer), the code left latestPos at the
default 0 without ever entering the no-pointer fallback. Today the
two paths yield the same value (newest sibling wins), but the
implicit fall-through makes the intent unclear and would break
silently if the no-pointer branch ever did anything more than
latestPos=0.
Track a pointerResolved flag explicitly so the no-pointer branch
(including the explicit-null-bare check) re-runs on a stale pointer.
Behavior unchanged today.
Test pins: stale pointer + two real versions falls back to
newest-sibling (vnew, not vold).
* feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher
Mirrors the Phase 6 instrumentation already on the replay side
(processMatches) onto the walker's Delete dispatch. Every walker
dispatch now bumps S3LifecycleDispatchCounter with the resolved
outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so
streaming, daily_replay's replay drain, and daily_replay's walker
share a single per-(bucket, kind, outcome) counter view.
Lands together with the rest of Phase 4b — no new metric, just an
extra observation site for the existing one.
|
||
|
|
69da20bdae |
volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the destructive volume-server RPCs, and reject cluster-internal remote S3 endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before dialing. Pin the validated address against DNS rebinding by routing the AWS SDK through an HTTP transport whose DialContext re-resolves the host and re-applies the deny list on every dial, so an endpoint that resolves to a public IP at validate-time and then flips to 127.0.0.1 at connect time is refused. Operators that legitimately fetch from private hosts can opt out with -volume.allowUntrustedRemoteEndpoints. |
||
|
|
5e8f99f40a |
filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires a Bearer token in the authorization metadata, signed with the filer write-signing key. The service refuses to register on a filer that has no jwt.filer_signing.key set, so the unauthenticated default is gone: operators who use these RPCs must configure the key and attach a token on every call. Bearer scheme matching is case-insensitive (RFC 6750), every handler nil-checks req before dereferencing it, and tests now cover the expired-token path. |
||
|
|
05ed5c9ae8 |
filer: scope JWT allowed_prefixes to path components (#9439)
The allowed_prefixes check used a literal byte-prefix match, so a token scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar sibling paths. Match on /-separated path components after path.Clean normalisation instead. |
||
|
|
bd687a2d7a |
build(deps): bump google.golang.org/api from 0.274.0 to 0.278.0 (#9451)
Bumps [google.golang.org/api](https://github.com/googleapis/google-api-go-client) from 0.274.0 to 0.278.0. - [Release notes](https://github.com/googleapis/google-api-go-client/releases) - [Changelog](https://github.com/googleapis/google-api-go-client/blob/main/CHANGES.md) - [Commits](https://github.com/googleapis/google-api-go-client/compare/v0.274.0...v0.278.0) --- updated-dependencies: - dependency-name: google.golang.org/api dependency-version: 0.278.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
fe0d533b9d |
build(deps): bump github.com/klauspost/compress from 1.18.5 to 1.18.6 (#9452)
Bumps [github.com/klauspost/compress](https://github.com/klauspost/compress) from 1.18.5 to 1.18.6. - [Release notes](https://github.com/klauspost/compress/releases) - [Commits](https://github.com/klauspost/compress/compare/v1.18.5...v1.18.6) --- updated-dependencies: - dependency-name: github.com/klauspost/compress dependency-version: 1.18.6 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
91957d6919 |
build(deps): bump cloud.google.com/go/kms from 1.30.0 to 1.31.0 (#9453)
Bumps [cloud.google.com/go/kms](https://github.com/googleapis/google-cloud-go) from 1.30.0 to 1.31.0. - [Release notes](https://github.com/googleapis/google-cloud-go/releases) - [Changelog](https://github.com/googleapis/google-cloud-go/blob/main/documentai/CHANGES.md) - [Commits](https://github.com/googleapis/google-cloud-go/compare/kms/v1.30.0...dlp/v1.31.0) --- updated-dependencies: - dependency-name: cloud.google.com/go/kms dependency-version: 1.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
fade4ce77d |
build(deps): bump github.com/rabbitmq/amqp091-go from 1.10.0 to 1.11.0 (#9454)
Bumps [github.com/rabbitmq/amqp091-go](https://github.com/rabbitmq/amqp091-go) from 1.10.0 to 1.11.0. - [Release notes](https://github.com/rabbitmq/amqp091-go/releases) - [Changelog](https://github.com/rabbitmq/amqp091-go/blob/main/CHANGELOG.md) - [Commits](https://github.com/rabbitmq/amqp091-go/compare/v1.10.0...v1.11.0) --- updated-dependencies: - dependency-name: github.com/rabbitmq/amqp091-go dependency-version: 1.11.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
18677a8430 |
fix(storage): refuse to load .vif-only entry as regular volume when .ecx exists (#9448) (#9461)
fix(storage): refuse to load .vif-only entry as regular volume when .ecx exists Defensive root-cause fix for issue #9448. The detection-side guard in the EC plugin worker already breaks the infinite re-encode loop, but the underlying volume-server state — `.vif` preserved next to EC shards when the source replica is destroyed — can still re-arm a phantom regular volume via the MountVolume / LoadVolume path. In loadExistingVolume, the existing `.ecx`-present check is gated on the caller passing `skipIfEcVolumesExists=true`. The startup-scan path (concurrentLoadingVolumes) does pass that flag and correctly skips the `.vif`. The LoadVolume → loadExistingVolume(…, false, …) path used by VolumeMount does NOT, so it falls through to NewVolume, which calls v.load with createDatIfMissing=true and creates a phantom empty `.dat`. The master then reports the volume as regular and EC detection re-proposes it. Hoist the `.ecx`-present check so it runs unconditionally for `.vif` entries: if the EC index is on the disk, the `.vif` belongs to those EC shards, never to a regular volume that should be resurrected. Pure OSS clusters never reach this exact state today (OSS deletes `.vif` unconditionally during Destroy), but the guard hardens the load path against any future path that leaves the same state. Test: - TestLoadExistingVolumeSkipsVifWhenEcxPresent builds the exact post-#9448 disk layout (`.vif` + `.ecx`, no `.dat`) and asserts loadExistingVolume(skipIfEcVolumesExists=false) returns false, does not create a placeholder `.dat`, and does not register a phantom volume in l.volumes. |
||
|
|
d221a64262 |
fix(ec): skip re-encode when EC shards already exist for the volume (#9448) (#9458)
* fix(ec): skip re-encode when EC shards already exist for the volume (#9448) When an earlier EC encoding succeeded but the post-encode source-delete left a regular replica behind on one of the servers, the next detection cycle proposes the same volume again. The new encode tries to redistribute shards to targets that already have them mounted, the volume server returns `ec volume %d is mounted; refusing overwrite`, the task fails, and detection re-queues the volume. The cycle repeats forever — issue #9448. The existing `metric.IsECVolume` skip catches the case where the canonical metric is reported on the EC-shard side of the heartbeat, but when the master sees BOTH a regular replica AND its EC shards in the same volume list, the canonical metric we pick is the regular replica and IsECVolume is false. Add a second guard that checks the topology directly via `findExistingECShards` (already present and indexed) and skip the volume when any shards exist, logging a warning that points the admin at the stuck source. This breaks the loop. Auto-cleanup of the orphaned replica is left as follow-up work — deleting a source replica from inside the detector is only safe with a re-verification step right before the delete, plus a config opt-in, and is best done in its own change. * fix(ec): #9448 guard only fires when EC shard set is complete The first version of the #9448 guard tripped on `len(existingShards) > 0`, which is broader than necessary. The existing recovery branch in the encode arm (around the `existingECShards` block, ~line 216) is designed to fold partial leftover shards from a previously failed encode into the new task as cleanup sources. Skipping unconditionally on any existing shards made that branch dead code, regressing the recovery behavior Gemini flagged in the review of |
||
|
|
644664bbee |
feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) (#9457)
* feat(s3/lifecycle): swap daily_run to engine hash APIs (Phase 4a) Replace the local replay-content-hash / max-effective-TTL helpers in dailyrun with the engine package's canonical versions (ReplayContentHash, MaxEffectiveTTL, PromotedHash) that landed with the Phase 4 view surface. Adds PromotedHash to the cursor's recovery triggers: a partition flip (rule moving between replay and walk because retention shifted) now fires the rule-change branch alongside RuleSetHash mismatch. The retentionWindow is set to MaxEffectiveTTL today, which keeps the promoted set empty and the trigger dormant; Phase 4b will plumb the real meta-log retention boundary so true scan_only promotions are detected. Cursor schema is unchanged — PromotedHash was already persisted as the zero hash in Phase 2. * docs(s3/lifecycle): note the one-time cursor rewind on hash format change gemini-code-assist flagged that swapping localReplayContentHash for engine.ReplayContentHash changes the persisted RuleSetHash byte layout (sort order + tagged-field encoding). Phase-2 cursors mismatch on first post-upgrade run and drop into the rule-change branch. Going with option 3 (document the intentional one-time rewind). The rewind is bounded to runNow - maxTTL (not time-zero), self-healing on the next save, and daily_replay is off by default so the affected population is limited to early adopters of the algorithm flag. A migration shim or a hash-compat layer would carry the legacy encoder forever for one bounded re-scan; not worth it. Comment in runShard makes the trade explicit so a future reader doesn't hunt for the "why does my cursor rewind once after upgrade" mystery. * chore(s3/lifecycle): trim verbose comments in dailyrun Cut multi-paragraph headers and narration that just described what the code does. Kept the small WHY notes (per-match skip vs per-rule, the one-time post-upgrade cursor rewind, scan_only rejection rationale). Same behavior, ~150 fewer lines of comment. * fix(s3/lifecycle): persist PromotedHash on the successful runShard save The comment-trim pass dropped the field alongside a "stays empty in Phase 2" comment. Harmless today (promoted is always zero), but Phase 4b turns promoted into a real value — and a save that writes zero would make the next run falsely detect drift and rewind. Spotted by gemini-code-assist on PR 9457. Other save paths (recovery, drain-error) already persisted it; the success path is the only one that was missing it. Now consistent. |
||
|
|
532b088262 |
fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423) When EC shards land on a target whose disk type differs from the source volume's, master heartbeats wrongly reported under the target disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the target server applies it to the in-memory EcVolume via SetDiskType so the mount notification and steady-state heartbeat both carry the source's disk type. Empty value falls back to the location's disk type (used by disk-scan reload paths). The override is not persisted with the volume — disk type stays an environmental property and .vif remains portable. * fix(ec): plumb source disk type through plugin worker (#9423) Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved), populate it from the metric the detector already collects, thread it through ec_task into the MountEcShards helper, and forward it on the VolumeEcShardsMount RPC. * fix(ec): mirror source disk type plumbing in rust volume server (#9423) The volume_ec_shards_mount handler now forwards source_disk_type into mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it overrides ec_vol.disk_type (and each mounted shard's disk_type) via the new set_disk_type method; empty value keeps the location's disk type, so disk-scan reload and reconcile paths are unchanged. Also picks up two pre-existing proto drifts that 'make gen' synced from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds in remote.proto). * feat(ec): bias placement toward preferred disk type (#9423) Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType. When PreferredDiskType is non-empty, SelectDestinations partitions suitable disks into matching/fallback tiers and runs the rack/server/ disk-diversity passes on the matching tier first; the fallback tier is only consulted if the matching pool can't satisfy ShardsNeeded. PlacementResult.SpilledToOtherDiskType lets callers warn on spillover. Empty PreferredDiskType keeps the existing single-pool behavior. * fix(ec): plumb source disk type into placement planner (#9423) diskInfosToCandidates now copies DiskInfo.DiskType into the placement candidate, and ecPlacementPlanner.selectDestinations forwards metric.DiskType as PreferredDiskType so EC shards land on disks matching the source volume's disk type when possible. A glog warning fires when placement had to spill to other disk types. * test(ec): integration coverage for source-disk-type plumbing (#9423) store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a shard physically lives on an HDD location, MountEcShards is called with sourceDiskType="ssd", and the test asserts that the in-memory EcVolume, the mounted shard, the NewEcShardsChan notification, and the steady-state heartbeat all report under the source's disk type. A companion test pins the empty-source path so disk-scan reload keeps the location's disk type. detection_disk_type_test exercises the worker plumbing: with a cluster of nodes carrying both HDD and SSD disks, planECDestinations must place every shard on SSD when metric.DiskType="ssd"; with only one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout via spillover (and log a warning). * revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423) make gen pulled two pre-existing OSS changes into the rust proto tree (LockRingUpdate / by_plugin in master.proto, listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as scope creep — none of the rust EC fix references those fields. Restore both files to origin/master so this branch only touches EC-related symbols. * fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423) partitionByDiskType used raw string comparison, so a PreferredDiskType of "hdd" never matched candidates whose DiskType is "" (the HardDriveType sentinel that weed/storage/types uses). EC encoding of an HDD source would spill onto any HDD reporting "" even when the cluster has plenty of matching capacity. Normalize both sides through normalizeDiskType, which lowercases and folds "" → "hdd", mirroring types.ToDiskType without taking a dependency on it. selectFromTier's rack-diversity pass also kept revisiting racks the preferred tier had already used when running on the fallback tier, which negated PreferDifferentRacks on spillover. Skip racks already in usedRacks so fallback placements still spread onto new racks. * fix(ec): empty-source remount must not clobber existing disk type (#9423) mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount, disk-scan reload, orphan-shard reconcile). After an RPC sets the source-derived disk type, any later call passing source_disk_type="" was resetting ec_vol.disk_type back to the location's value, which reintroduces the heartbeat drift this PR is meant to fix. Only default to the location's disk type when the EC volume is fresh (no shards mounted yet); otherwise leave the recorded type alone so empty-source reloads preserve whatever the original mount RPC set. |
||
|
|
884b0bcbfd |
feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456)
* feat(s3/lifecycle): cluster rate-limit allocation (Phase 3)
Admin computes a per-worker share of cluster_deletes_per_second at
ExecuteJob time and ships it to the worker via
ClusterContext.Metadata. The worker reads the share, constructs a
golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via
cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the
streaming path; until then streaming ignores the cap.
Why allocate at admin: the cluster cap is a single knob operators
care about. Dividing it locally per worker would either need
out-of-band coordination or accept N× the configured budget. Admin
is the only party that knows how many execute-capable workers there
are, so it owns the math.
Admin side (weed/admin/plugin):
- Registry.CountCapableExecutors(jobType) returns the number of
non-stale workers with CanExecute=true.
- New file cluster_rate_limit.go: decorateClusterContextForJob clones
the input ClusterContext and injects two metadata keys for
s3_lifecycle. cloneClusterContext duplicates Metadata so per-job
decoration doesn't race shared base state.
- executeJobWithExecutor calls the decorator after loading the admin
config; other job types pass through unchanged.
Worker side (weed/worker/tasks/s3_lifecycle):
- New cluster_rate_limit.go declares the constants both sides agree
on (admin-config field names, metadata keys). Plain strings on the
admin side keep weed/admin/plugin free of a dependency on the
s3_lifecycle worker package; the two sets of constants are pinned
to identical values and a mismatch would silently disable rate
limiting.
- handler.go executeDailyReplay reads ClusterContext.Metadata,
builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}.
Missing/empty/non-positive values → no limiter (legacy unlimited
behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a
bucket that never refills.
- Admin form gains two fields under "Scope": cluster_deletes_per_second
(rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate).
Metric:
- New S3LifecycleDispatchLimiterWaitSeconds histogram observes how
long each Limiter.Wait blocks before a LifecycleDelete RPC.
Operators tune the cap by reading p95 — near-zero means the cap
isn't binding, a long tail at 1/rate means it is.
Tests:
- weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering
pass-through for non-allocator job types, rps=0 / no-executors
skip, even sharing, burst sharing, burst=0 omit (worker default
kicks in), burst floor of 1, no mutation of input metadata, nil
input.
- weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases
covering nil/empty/missing metadata, non-positive/invalid rate,
positive rate builds correctly, burst missing defaults to 2× rate,
tiny rate clamps burst to ≥1.
Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the
parents; this branch stacks on Phase 2 since it consumes
dailyrun.Config{Limiter} which lands there.
* fix(s3/lifecycle): divide cluster budget by active workers, not all capable
gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1
(handler.go:189) — it's a singleton job, only one worker is ever active.
Dividing the cluster_deletes_per_second budget by the count of capable
executors gave the single active worker just 1/N of the configured cap.
Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor
is now min(executors, maxJobsPerDetection), clamped to >=1. For
s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a
hypothetical parallel-dispatch job (maxJobs>1) the budget divides
across the running-set.
Tests swap the SharedEvenly case for two pinned scenarios:
- SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1
- SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker
- MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4
* feat(s3/lifecycle): drop Worker Count knob from admin config form
The "Worker Count" admin field controlled in-process pipeline goroutines
across the 16-shard space — per-worker tuning, not a cluster-wide scope
concern. Operators looking at the form alongside Cluster Delete Rate
reasonably misread it as the number of workers in the cluster.
Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded
to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the
plumbing through dailyrun.Config.Workers stays so a future need can
re-introduce it as a worker-local knob (or just bump the constant).
handler_test.go pins that "workers" must NOT appear in the form so the
removal doesn't silently regress.
|
||
|
|
91bcc910eb |
build(deps): bump actions/dependency-review-action from 4.9.0 to 5.0.0 (#9450)
Bumps [actions/dependency-review-action](https://github.com/actions/dependency-review-action) from 4.9.0 to 5.0.0.
- [Release notes](https://github.com/actions/dependency-review-action/releases)
- [Commits](
|