seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-14 05:41:29 +00:00

Author	SHA1	Message	Date
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	0dde6a8c84	refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout (#9494 ) * refactor(s3/lifecycle): drop Per-Run Time Limit knob; use scheduler's Execution Timeout "Per-Run Time Limit (minutes)" duplicated the admin scheduler's "Execution Timeout (s)" — both are wall-clock caps on the same Execute call, stacked via context.WithTimeout. Whichever was shorter won. Under defaults the scheduler's 90s timeout always clobbered the worker's 60-min cap, so the "Per-Run Time Limit" knob was effectively dead unless an operator also raised Execution Timeout, and operators had to keep two values in agreement. Remove the worker-side knob and declare a sane scheduler default on the handler descriptor: - WorkerConfigForm: nil (was: one section with one field) - Config.MaxRuntime removed; ParseConfig drops max_runtime_minutes - Handler no longer wraps ctx in context.WithTimeout(MaxRuntime); runCtx is just the ctx the scheduler passes - AdminRuntimeDefaults.ExecutionTimeoutSeconds = 3600 (1h) and JobTypeMaxRuntimeSeconds = 3600 — the scheduler's global 90s default would otherwise kill every real run Tests: - TestParseConfigDefaults loses the MaxRuntime check; new TestParseConfigIgnoresWorkerValues documents the contract - TestDescriptor_WorkerConfigFormIsAbsent pins that the form is gone so a future re-add forces a conscious revisit - TestDescriptor_AdminRuntimeDefaultsBoundExecutionTimeout pins the 1h default with a comment about the 90s scheduler floor * fix(s3/lifecycle): no per-pass timeout by default Lifecycle is a scheduled batch — its natural duration is "as long as today's events take." The 1h default ExecutionTimeoutSeconds from the previous commit was still a footgun: too low truncates legitimate large-bucket passes; too high makes the value meaningless. Set both ExecutionTimeoutSeconds and JobTypeMaxRuntimeSeconds to math.MaxInt32 (~68 years) to say "no timeout in practice" in a code-review-readable way. Operators who genuinely want a wall-clock cap can set one in the admin UI; the scheduler's context.WithTimeout machinery is unchanged (we just hand it an effectively-infinite duration). Note: the scheduler floors ExecutionTimeout at 90s (defaultScheduledExecutionTimeout in weed/admin/plugin/plugin_scheduler.go), so 0 doesn't mean "unlimited" — it clamps back to 90s. A literal math.MaxInt32 is the way to express the intent without touching the shared scheduler code. Test updated to pin math.MaxInt32 and document the rationale so a future tighter cap fails the test and forces conscious revisit.	2026-05-13 19:29:06 -07:00
Chris Lu	813f1351f8	feat(s3/lifecycle): enable scheduler by default (#9492 ) S3 lifecycle is a standard bucket feature — operators set PutBucketLifecycleConfiguration through the S3 API expecting the configured expirations to actually fire. With the prior default (scheduler enabled=false), buckets with lifecycle XML silently retained data past their declared expiration until an operator noticed and turned the scheduler on. The failure mode of enabled-by-default is "worker runs every day and fast-exits on buckets with no lifecycle rules" — cheap. The failure mode of disabled-by-default is "data lingers, looks like it expired, doesn't" — bad. Enabled-by-default matches both the AWS S3 default behavior and the operator's natural mental model. Operators who want the worker off can still disable it via the admin UI; once a persisted config exists, this descriptor default no longer applies (the persisted Enabled state wins). Test pins the choice so a future flip to false fails loud.	2026-05-13 16:57:10 -07:00
Chris Lu	bbc075b353	feat(s3/lifecycle): plumb WalkerInterval through worker admin config (#9485 ) * feat(s3/lifecycle): throttle steady-state walker by cfg.WalkerInterval The steady-state and empty-replay walker fired on every dailyrun.Run invocation, which is fine when Run is called at the bucket-walk cadence the operator intends (e.g., once per hour or once per day), but catastrophic when a fast driver like the s3tests CI workflow or the admin worker scheduler invokes Run at multi-second cadence — each tick ran a full subtree scan per shard, crushing the filer. Decouple walker cadence from Run() invocation cadence: persist LastWalkedNs in the per-shard cursor and fire the steady-state / empty-replay walker only when (runNow - LastWalkedNs) >= cfg.WalkerInterval. Cold-start and recovery walker fires (RecoveryView) stay unconditional since those are bounded events that must run when their trigger condition (no cursor, hash mismatch) is met. Recovery walker fires also update LastWalkedNs so the subsequent steady-state pass doesn't double-walk. cfg.WalkerInterval=0 keeps the prior "fire every pass" behavior — the in-repo integration tests and s3tests fast driver continue to work unchanged. Production deployments should set this to the walk cost budget (typically 1h-24h depending on cluster size). Cursor file is back-compat: last_walked_ns is omitempty, so cursor files written before this change decode as LastWalkedNs=0, which walkerDue treats as "never walked steady-state" → walker fires next pass to establish the anchor (same path a cold-start cursor takes). No version bump. Operator surface for WalkerInterval is the dailyrun.Config struct; plumbing through worker.tasks.s3_lifecycle.Config and the admin schema is a follow-up. * fix(s3/lifecycle): suppress walker double-fire within a single pass Two gemini-code-assist findings: 1. walkerDue with interval=0 returned true even when lastWalkedNs == runNow.UnixNano() — the cold-start / recovery branch already fired the walker this pass, and the steady-state fall-through fired it again. RecoveryView is a superset of every per-shard partition, so the second walk added zero coverage and burned a full subtree scan. Add a within-pass guard at the front of walkerDue: if the cursor's LastWalkedNs equals runNow's UnixNano, the walker already ran this pass — skip. 2. The empty-replay branch passed persisted.LastWalkedNs to walkerDue instead of the local lastWalkedNs variable the rest of runShard threads through. Trivially equal at this point in the function, but the inconsistency would mask a future bug if any code above the branch ever sets lastWalkedNs. Test updates: TestWalkerDue gains the within-pass guard case plus a companion "earlier same pass still fires" sanity check. TestRunShard_ColdStartDoesNotDoubleWalk is new and pins the integration: cold-start runShard with WalkerInterval=0 must call cfg.Walker exactly once, not twice. * fix(s3/lifecycle): reject negative WalkerInterval + lift within-pass guard Two coderabbit findings: 1. validate() now rejects negative cfg.WalkerInterval. A typo like -1h previously fell through walkerDue's `interval <= 0` branch and silently re-enabled "walk every pass" — the exact behavior the throttle was added to prevent. The admin-config parser already clamps negative input to zero, but callers using dailyrun.Config directly (tests, embedders) now get a loud error instead. 2. Within-pass double-fire suppression moves out of walkerDue and into runShard's walkedThisPass local flag. walkerDue's equality check (lastWalkedNs == runNow.UnixNano) was correct in production (each pass freezes runNow at time.Now().UTC, no collisions) but fragile in tests that inject the same runNow across distinct passes — the test would see false suppression. Separating the concerns also makes walkerDue answer one question (persisted-state throttle) and runShard another (within-pass call-site dedup). walker_interval_test.go: TestValidate_RejectsNegativeWalkerInterval pins the new validation. TestWalkerDue's within-pass cases move out (the function is pure throttle now); TestRunShard_ColdStartDoesNot DoubleWalk still pins the integration behavior end-to-end. * feat(s3/lifecycle): plumb WalkerInterval through worker admin config #9484 added cfg.WalkerInterval to dailyrun.Config but left the worker side wired to zero — operators couldn't actually use the throttle without recompiling. Add the admin-schema knob: - New constant WalkerIntervalMinutesAdminKey = "walker_interval_minutes" follows the MetaLogRetentionDaysAdminKey pattern (Int64, minutes unit, 0 = unbounded / fire every pass). - New Config.WalkerInterval populated in ParseConfig from adminValues; negative / zero stay at zero so the prior "fire every pass" semantics keep the in-repo integration tests and the s3tests sub-minute driver working unchanged. - handler.go: admin form field with operator-facing label and description, default in DefaultValues, value forwarded to dailyrun.Run via cfg.WalkerInterval. Tests cover the default-zero, positive, and negative cases — same shape as the MetaLogRetention tests so the parsing contract stays consistent. Stacked on #9484; rebase after that lands.	2026-05-13 14:09:31 -07:00
Chris Lu	d5372f9eb7	feat(s3/lifecycle): apply cluster rate limit to walker dispatch (#9471 ) Phase 4b shipped the walker without plugging it into the cluster rate.Limiter that processMatches honors. A walker hitting a large bucket on the recovery branch could burst LifecycleDelete RPCs past the cluster_deletes_per_second cap that streaming-replay respects. WalkerDispatcher now takes a *rate.Limiter and waits on it before each RPC, observing the wait time on S3LifecycleDispatchLimiterWaitSeconds just like processMatches does. The handler passes the same limiter to both paths so replay + walk share one budget; nil disables throttling (unchanged default). Tests pin: the limiter actually delays a dispatch when the burst token is drained, and a ctx cancellation in Limiter.Wait surfaces as an error without sending the RPC.	2026-05-13 09:24:50 -07:00
Chris Lu	ce5768fab1	feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash (#9473 ) * feat(s3/lifecycle): operator-declared meta-log retention activates PromotedHash dailyrun.Config.RetentionWindow has been wired since Phase 4b but the handler never supplied a value, so runShard always fell back to maxTTL and engine.PromotedHash hashed nothing. The partition-flip recovery trigger was dormant by design "until the handler plumbs the real meta-log retention here." This PR plumbs it via a new admin form field: Meta-Log Retention (days) — 0 = unbounded (current behavior). When set, ParseConfig converts days to a time.Duration on cfg.MetaLogRetention. The handler passes it as dailyrun.Config.RetentionWindow, which runShard then feeds to engine.PromotedHash. Rules whose TTL exceeds the declared window land in the walk partition; the next time an operator shrinks retention so a previously replay-eligible rule slips past it, PromotedHash mismatches → recovery branch fires → walker re-evaluates the rule across the whole filer tree. 0 stays the default, so existing deployments see no behavior change. * chore(s3/lifecycle): rephrase days->duration conversion gemini-code-assist flagged the original form as a compile error, which it wasn't (time.Duration is a named int64 and supports * with other time.Durations — the test suite verified the value was correct). The suggested form is more idiomatic regardless: days*24 happens in int64 space before the lift to time.Duration, so the unit is unambiguous.	2026-05-12 18:26:52 -07:00
Chris Lu	5004b4e542	feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466 ) * feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run.	2026-05-12 12:54:52 -07:00
Chris Lu	745e864bda	feat(s3/lifecycle): retire algorithm flag, daily_replay is the only path (Phase 5a) (#9465 ) feat(s3/lifecycle): remove algorithm flag, daily_replay is the only path (Phase 5a) With Phase 4b on master the daily_replay path covers every rule kind and the streaming algorithm has no remaining responsibilities. This PR retires the algorithm flag from the worker: - Drop the "Algorithm" enum field from AdminConfigForm and its DefaultValues entry. - Drop the if/else routing in Execute — every Execute call now routes straight into executeDailyReplay. - Drop the streaming-only worker fields (DispatchTick, CheckpointTick, RefreshInterval, BootstrapInterval) and their matching form fields. None of them are read by the daily_replay path; keeping them in the form would suggest tuning knobs that don't do anything. - Drop AlgorithmStreaming / AlgorithmDailyReplay constants and the Config.Algorithm field. The streaming-path packages (s3lifecycle/scheduler, s3lifecycle/dispatcher) remain on the tree; they're now reachable only by the weed shell s3.lifecycle.run-shard debug command and the few helpers (LoadCompileInputs, FilerStore, FilerSiblingLister) the daily_replay worker still uses. Phase 5b deletes the dead code. Tests prune the cadence-default assertions to the single remaining field (max_runtime_minutes).	2026-05-12 12:39:37 -07:00
Chris Lu	f954781169	feat(s3/lifecycle): Phase 4b — daily walker for recovery and steady state (#9459 ) * feat(s3/lifecycle): plumb RetentionWindow into dailyrun.Config Adds a Config.RetentionWindow field that runShard threads into engine.PromotedHash. Zero (the default) falls back to maxTTL, which matches Phase 4a behavior — PromotedHash stays empty and the partition-flip recovery trigger stays dormant. Pure plumbing. The handler still passes zero so nothing changes at runtime. The walker work (Phase 4b proper) sets a real retention from the meta-log boundary and the partition-flip trigger starts firing. * feat(s3/lifecycle): WalkerDispatcher adapter for the daily-run walker Phase 4b prep. Implements bootstrap.Dispatcher on top of LifecycleClient so the same LifecycleDelete RPC drives both the meta-log replay path and the walker. No CAS witness — the server's identityMatches treats nil ExpectedIdentity as a bootstrap call and rebuilds the witness from the live entry, which is the right contract for a full-tree walk. Adds VersionID to bootstrap.Entry so versioned-bucket walks address the right version. MPU init uses DestKey for ObjectPath (matching the prefix-match contract); rejecting empty DestKey keeps malformed init records out of the dispatch path. Not wired yet — runShard still doesn't invoke the walker. Follow-up commits add the ListFunc adapter and the recovery-branch wiring. * feat(s3/lifecycle): wire Walker hook into runShard's recovery branch Adds a Config.Walker callback that fires on rule-content edit / partition flip BEFORE the cursor rewinds, so already-due objects across the rewritten rule set get caught instead of waiting on meta-log replay alone. The callback receives engine.RecoveryView(snap) and the per-shard ID; nil disables it (Phase 4a behavior preserved). Decoupling the wiring from the implementation: the handler-side WalkerFunc that drives bootstrap.Walk via the filer is the follow-up commit, and tests can stub the callback without standing up the full filer/client/lister harness. Tests pin: walker fires exactly once on hash mismatch, walker error propagates and leaves the cursor unchanged, nil Walker is a no-op. * feat(s3/lifecycle): WalkBuckets composes ListFunc + Dispatcher per shard Adds dailyrun.WalkBuckets — the composable driver the handler-side WalkerFunc will call. Iterates a bucket list, wraps the supplied bootstrap.ListFunc with a per-shard filter (Path for non-MPU, DestKey for MPU init), and runs bootstrap.Walk per bucket using the supplied Dispatcher. First bucket error wins; remaining buckets log and run to completion so one filer flake doesn't kill the shard. Composable rather than monolithic so callers and tests can swap parts: production uses a filer-backed ListFunc + WalkerDispatcher; tests use bootstrap.EntryCallback + a stub. The filer-backed ListFunc is the next commit. Tests pin: shard filter routes only matching entries, MPU shard uses DestKey not the .uploads/<id> path, single-bucket error propagates while other buckets still run, ctx cancellation short-circuits between buckets, nil guards on view/list/dispatch. * feat(s3/lifecycle): filer-backed ListFunc for the daily-run walker Phase 4b: dailyrun.FilerListFunc returns a bootstrap.ListFunc that streams entries under <bucketsPath>/<bucket> by paginated SeaweedList. Recurses into regular directories; .versions/ and .uploads/ are skipped at this stage so they don't surface as raw children — the sibling expansion (versioned NoncurrentDays state, MPU init dispatch) lands in the next commit. listAll and isVersionsDir are ported from scheduler/bootstrap.go's same-named helpers. Phase 5 deletes the scheduler copies along with the streaming path. Tests pin: flat listing, recursion through nested directories, .versions/ and .uploads/ skipped, kill-resume via the start path contract, nil-client error, attribute propagation (mtime / size / IsLatest default). * feat(s3/lifecycle): versioned-sibling expansion in FilerListFunc Adds the .versions/<key>/ expansion to the daily-run's filer-backed ListFunc. Each call emits one bootstrap.Entry per sibling (real version files + the bare null version, when found) with the same sibling state the streaming bootstrap injects via reader.Event: - Path = logical key (not the .versions/<file> physical path), so bootstrap.Walk's MatchPath uses the user's intended path. - VersionID per sibling (version_id or "null"). - IsLatest resolved via parent's ExtLatestVersionIdKey, falling back to explicit-null-bare, falling back to newest-by-mtime. - NoncurrentIndex rank computed against the latest's position. - SuccessorModTime: SuccessorFromEntryStamp if stamped, else the previous-newer sibling's mtime (legacy derivation). - IsDeleteMarker from ExtDeleteMarkerKey. - NumVersions = len(siblings). Two-pass walk so .versions/ dirs run before regular files; the bare null-version path is recorded in skipBare so pass 2 doesn't emit it twice. expandVersionsDir and lookupNullVersion are ported from scheduler/bootstrap.go. Sort order, latest resolution, and successor derivation must agree with that path verbatim so streaming and walker reach the same verdict on the same objects. Phase 5 deletes the scheduler copy. MPU init (.uploads/<id>) remains skipped — the dedicated commit emits it with IsMPUInit and DestKey. Tests pin: pointer-wins latest resolution, no-pointer newest-sibling fallback, explicit-null-is-latest with skipBare suppression of the bare emission, coincidentally-named .versions folder recursing as a regular subdir, delete-marker propagation. * feat(s3/lifecycle): emit MPU init records from FilerListFunc Last gap in the filer-backed ListFunc. A directory at .uploads/<id> carrying ExtMultipartObjectKey is the MPU init record; emit one bootstrap.Entry with IsMPUInit=true and DestKey set to the user's intended path. The walker's MatchPath uses DestKey for prefix matching; the WalkerDispatcher uses it for the LifecycleDelete RPC's ObjectPath. .uploads/<id> directories without the extended key are mid-write before metadata landed and stay skipped. isMPUInitDir is upgraded from the path-shape-only stub to the full shape + extended-attr check that mirrors router.mpuInitInfo and scheduler/bootstrap.go's same-named helper. Tests pin: valid init record emits with the right DestKey, missing ExtMultipartObjectKey skips the directory. * feat(s3/lifecycle): wire walker into executeDailyReplay Activates the recovery-branch walker. The handler composes the three Phase 4b building blocks — FilerListFunc + WalkerDispatcher + WalkBuckets — into a dailyrun.WalkerFunc and passes it via Config.Walker. The bucket list is derived from the compiled inputs so it matches the engine snapshot exactly. Effect on master behavior: when a worker observes a RuleSetHash or PromotedHash mismatch on its persisted cursor (rule content edited / partition flip), runShard now walks the live filer tree under the RecoveryView before rewinding the cursor. Already-due objects across the rewritten rule set fire immediately instead of waiting on the sliding meta-log replay. Still scoped to replay-eligible action kinds because checkSnapshotForUnsupported continues to reject walker-bound rules (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) and scan_only-promoted rules at the top of Run. The follow-up commit relaxes the gate once the steady-state walker over RulesForShard's walk view is wired so those rules fire every day, not just on rule edits. * feat(s3/lifecycle): steady-state walker + drop unsupported-rule gate Adds the second walker invocation in runShard. After the recovery check passes, runShard derives the walk view via snap.RulesForShard (using the same retentionWindow PromotedHash used, so the partition is consistent) and runs the walker over it. The view holds walker-bound action kinds (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) plus any replay-eligible rules promoted to walk by retention shortage; an empty view skips the call so non-versioned, replay-only deployments don't pay an O(N) bucket walk per run. With the walker now servicing every rule kind, checkSnapshotForUnsupported and its UnsupportedRuleError type are obsolete. router.Route gates replay on Mode == ModeEventDriven, so walker-bound and scan_only rules are silently dropped by replay and picked up by the walker instead — no double-dispatch. Drop the gate, delete replayability.go + replayability_test.go, and remove the handler's redundant IsUnsupportedRule branch. * fix(s3/lifecycle): walker dispatcher nil-response guard + retention-comment Two PR-review fixes on 9459: 1. WalkerDispatcher.Delete used to panic on a (nil, nil) RPC return — add a defensive nil-response check so the walk halts cleanly instead. Spotted by coderabbit. 2. The retentionWindow=maxTTL comment in runShard claimed PromotedHash "stays empty" in fallback mode, which gemini correctly pointed out is only true once rules are active. During bootstrap (rules compiled but IsActive=false) MaxEffectiveTTL is 0 while PromotedHash counts every non-disabled rule, so promoted becomes non-empty and the next post-activation run hits the recovery branch. That's the intended bootstrap walk — rewrite the comment to explain it rather than misstate the invariant. Test: pins nil-response → error path on WalkerDispatcher. * fix(s3/lifecycle): explicit stale-pointer fallback in versioned expansion Reviewer caught a structural bug in expandVersionsDir's latest resolution: when ExtLatestVersionIdKey was set but no scanned sibling carried that id (stale pointer), the code left latestPos at the default 0 without ever entering the no-pointer fallback. Today the two paths yield the same value (newest sibling wins), but the implicit fall-through makes the intent unclear and would break silently if the no-pointer branch ever did anything more than latestPos=0. Track a pointerResolved flag explicitly so the no-pointer branch (including the explicit-null-bare check) re-runs on a stale pointer. Behavior unchanged today. Test pins: stale pointer + two real versions falls back to newest-sibling (vnew, not vold). * feat(s3/lifecycle): walker-side dispatch metrics in WalkerDispatcher Mirrors the Phase 6 instrumentation already on the replay side (processMatches) onto the walker's Delete dispatch. Every walker dispatch now bumps S3LifecycleDispatchCounter with the resolved outcome (or TRANSPORT_ERROR / NIL_RESPONSE for the failure paths) so streaming, daily_replay's replay drain, and daily_replay's walker share a single per-(bucket, kind, outcome) counter view. Lands together with the rest of Phase 4b — no new metric, just an extra observation site for the existing one.	2026-05-12 11:39:15 -07:00
Chris Lu	d221a64262	fix(ec): skip re-encode when EC shards already exist for the volume (#9448 ) (#9458 ) * fix(ec): skip re-encode when EC shards already exist for the volume (#9448) When an earlier EC encoding succeeded but the post-encode source-delete left a regular replica behind on one of the servers, the next detection cycle proposes the same volume again. The new encode tries to redistribute shards to targets that already have them mounted, the volume server returns `ec volume %d is mounted; refusing overwrite`, the task fails, and detection re-queues the volume. The cycle repeats forever — issue #9448. The existing `metric.IsECVolume` skip catches the case where the canonical metric is reported on the EC-shard side of the heartbeat, but when the master sees BOTH a regular replica AND its EC shards in the same volume list, the canonical metric we pick is the regular replica and IsECVolume is false. Add a second guard that checks the topology directly via `findExistingECShards` (already present and indexed) and skip the volume when any shards exist, logging a warning that points the admin at the stuck source. This breaks the loop. Auto-cleanup of the orphaned replica is left as follow-up work — deleting a source replica from inside the detector is only safe with a re-verification step right before the delete, plus a config opt-in, and is best done in its own change. * fix(ec): #9448 guard only fires when EC shard set is complete The first version of the #9448 guard tripped on `len(existingShards) > 0`, which is broader than necessary. The existing recovery branch in the encode arm (around the `existingECShards` block, ~line 216) is designed to fold partial leftover shards from a previously failed encode into the new task as cleanup sources. Skipping unconditionally on any existing shards made that branch dead code, regressing the recovery behavior Gemini flagged in the review of `af09e1ec7`. Two corrections: 1. New helper `countExistingEcShardsForVolume` walks each disk's `EcIndexBits` bitmap and ORs the results into a `ShardBits`, returning the distinct-shard popcount. This is the right unit: a single `VolumeEcShardInformationMessage` can carry several shards, so `len(EcShardInfos)` is not the same as the number of present shards. Per Gemini's "use helper functions that walk the actual shard bitmap" note. 2. The guard now fires only when `shardCount >= totalShards`. Partial shard sets fall through to the existing recovery branch, unchanged. Tests: - TestDetectionSkipsWhenECShardsAlreadyExist: complete shards → no proposal (the regression test for #9448 itself, unchanged intent, rewritten on top of new helpers). - TestDetectionAllowsRegularReplicaWhenShardsPartial: partial shards → guard does NOT swallow the volume; the encode arm still gets a chance. - TestCountExistingEcShardsForVolume: the helper walks the bitmap correctly even when one info entry packs multiple shards on one disk. The dangerous `volume.delete` hint in the warning is unchanged for now — it gets fixed in the next commit. * fix(ec): drop dangerous shell-command hint from #9448 warning The previous warning told operators to run `volume.delete -volumeId=%d` in the SeaweedFS shell to clean up the orphaned source replica. That command is cluster-wide — it deletes every replica of the volume, including the EC shards, which share the same volume id. Running it in the state the message describes would cause the data loss the guard exists to prevent. Replace it with explicit guidance that the cleanup must be a targeted VolumeDelete RPC against the source server only, and that the shell command is the exact wrong thing to use here. The next two commits add the plumbing and the auto-execution of that targeted delete so most operators never see this hint at all. Per Gemini comment on `af09e1ec7`. * feat(worker): plumb grpc dial option through ClusterInfo Add ClusterInfo.GrpcDialOption (optional) and set it in the erasure_coding plugin handler. Lets the detector make targeted gRPC calls during detection — used by the follow-up commit to auto-clean orphan source replicas via VolumeDelete RPCs. Zero-value safe: existing detectors that don't need RPC access get a nil DialOption and ignore the field. * feat(ec): auto-clean orphan source replica via targeted VolumeDelete Builds on the previous commits: the guard now identifies the #9448 stuck-source state and a gRPC dial option is available on ClusterInfo. When both are true, detection auto-cleans the orphaned regular replica instead of just warning the operator. New helper `cleanupOrphanSourceReplicas`: 1. Re-verifies the EC shard set is still complete via `countExistingEcShardsForVolume` against the live topology snapshot. If the count dropped between detection start and the cleanup decision (a volume server going down mid-cycle), it aborts — the source replica is the only complete copy and deleting it without a healthy shard set would be data loss. 2. Issues targeted VolumeDelete RPCs to each regular-replica server via `operation.WithVolumeServerClient`. That RPC only touches the regular volume on the targeted server; EC shards live in a separate store path and are not affected. This is the safe alternative to the cluster-wide `volume.delete` shell command we previously warned against. If the cleanup partially fails (one replica delete errors, others succeed), detection logs the failure and continues to skip the volume. The next detection cycle will try again. We deliberately don't fall back to a re-encode because that would just collide with the mounted shards on the targets again. When no dial option is available the existing warning still points operators at the safe manual procedure.	2026-05-11 23:12:57 -07:00
Chris Lu	532b088262	fix(ec): preserve source disk type across EC encoding (#9423 ) (#9449 ) * fix(ec): carry source disk type on VolumeEcShardsMount (#9423) When EC shards land on a target whose disk type differs from the source volume's, master heartbeats wrongly reported under the target disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the target server applies it to the in-memory EcVolume via SetDiskType so the mount notification and steady-state heartbeat both carry the source's disk type. Empty value falls back to the location's disk type (used by disk-scan reload paths). The override is not persisted with the volume — disk type stays an environmental property and .vif remains portable. * fix(ec): plumb source disk type through plugin worker (#9423) Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved), populate it from the metric the detector already collects, thread it through ec_task into the MountEcShards helper, and forward it on the VolumeEcShardsMount RPC. * fix(ec): mirror source disk type plumbing in rust volume server (#9423) The volume_ec_shards_mount handler now forwards source_disk_type into mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it overrides ec_vol.disk_type (and each mounted shard's disk_type) via the new set_disk_type method; empty value keeps the location's disk type, so disk-scan reload and reconcile paths are unchanged. Also picks up two pre-existing proto drifts that 'make gen' synced from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds in remote.proto). * feat(ec): bias placement toward preferred disk type (#9423) Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType. When PreferredDiskType is non-empty, SelectDestinations partitions suitable disks into matching/fallback tiers and runs the rack/server/ disk-diversity passes on the matching tier first; the fallback tier is only consulted if the matching pool can't satisfy ShardsNeeded. PlacementResult.SpilledToOtherDiskType lets callers warn on spillover. Empty PreferredDiskType keeps the existing single-pool behavior. * fix(ec): plumb source disk type into placement planner (#9423) diskInfosToCandidates now copies DiskInfo.DiskType into the placement candidate, and ecPlacementPlanner.selectDestinations forwards metric.DiskType as PreferredDiskType so EC shards land on disks matching the source volume's disk type when possible. A glog warning fires when placement had to spill to other disk types. * test(ec): integration coverage for source-disk-type plumbing (#9423) store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a shard physically lives on an HDD location, MountEcShards is called with sourceDiskType="ssd", and the test asserts that the in-memory EcVolume, the mounted shard, the NewEcShardsChan notification, and the steady-state heartbeat all report under the source's disk type. A companion test pins the empty-source path so disk-scan reload keeps the location's disk type. detection_disk_type_test exercises the worker plumbing: with a cluster of nodes carrying both HDD and SSD disks, planECDestinations must place every shard on SSD when metric.DiskType="ssd"; with only one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout via spillover (and log a warning). * revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423) make gen pulled two pre-existing OSS changes into the rust proto tree (LockRingUpdate / by_plugin in master.proto, listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as scope creep — none of the rust EC fix references those fields. Restore both files to origin/master so this branch only touches EC-related symbols. * fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423) partitionByDiskType used raw string comparison, so a PreferredDiskType of "hdd" never matched candidates whose DiskType is "" (the HardDriveType sentinel that weed/storage/types uses). EC encoding of an HDD source would spill onto any HDD reporting "" even when the cluster has plenty of matching capacity. Normalize both sides through normalizeDiskType, which lowercases and folds "" → "hdd", mirroring types.ToDiskType without taking a dependency on it. selectFromTier's rack-diversity pass also kept revisiting racks the preferred tier had already used when running on the fallback tier, which negated PreferDifferentRacks on spillover. Skip racks already in usedRacks so fallback placements still spread onto new racks. * fix(ec): empty-source remount must not clobber existing disk type (#9423) mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount, disk-scan reload, orphan-shard reconcile). After an RPC sets the source-derived disk type, any later call passing source_disk_type="" was resetting ec_vol.disk_type back to the location's value, which reintroduces the heartbeat drift this PR is meant to fix. Only default to the location's disk type when the EC volume is fresh (no shards mounted yet); otherwise leave the recorded type alone so empty-source reloads preserve whatever the original mount RPC set.	2026-05-11 20:21:50 -07:00
Chris Lu	884b0bcbfd	feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) (#9456 ) * feat(s3/lifecycle): cluster rate-limit allocation (Phase 3) Admin computes a per-worker share of cluster_deletes_per_second at ExecuteJob time and ships it to the worker via ClusterContext.Metadata. The worker reads the share, constructs a golang.org/x/time/rate.Limiter, and passes it to dailyrun.Run via cfg.Limiter (Phase 2 already plumbed the field). Phase 5 deletes the streaming path; until then streaming ignores the cap. Why allocate at admin: the cluster cap is a single knob operators care about. Dividing it locally per worker would either need out-of-band coordination or accept N× the configured budget. Admin is the only party that knows how many execute-capable workers there are, so it owns the math. Admin side (weed/admin/plugin): - Registry.CountCapableExecutors(jobType) returns the number of non-stale workers with CanExecute=true. - New file cluster_rate_limit.go: decorateClusterContextForJob clones the input ClusterContext and injects two metadata keys for s3_lifecycle. cloneClusterContext duplicates Metadata so per-job decoration doesn't race shared base state. - executeJobWithExecutor calls the decorator after loading the admin config; other job types pass through unchanged. Worker side (weed/worker/tasks/s3_lifecycle): - New cluster_rate_limit.go declares the constants both sides agree on (admin-config field names, metadata keys). Plain strings on the admin side keep weed/admin/plugin free of a dependency on the s3_lifecycle worker package; the two sets of constants are pinned to identical values and a mismatch would silently disable rate limiting. - handler.go executeDailyReplay reads ClusterContext.Metadata, builds a rate.Limiter, and passes it into dailyrun.Config{Limiter}. Missing/empty/non-positive values → no limiter (legacy unlimited behavior). burst defaults to 2 × rate, clamped to ≥1 to avoid a bucket that never refills. - Admin form gains two fields under "Scope": cluster_deletes_per_second (rate, 0 = unlimited) and cluster_deletes_burst (0 = 2 × rate). Metric: - New S3LifecycleDispatchLimiterWaitSeconds histogram observes how long each Limiter.Wait blocks before a LifecycleDelete RPC. Operators tune the cap by reading p95 — near-zero means the cap isn't binding, a long tail at 1/rate means it is. Tests: - weed/admin/plugin/cluster_rate_limit_test.go: 9 cases covering pass-through for non-allocator job types, rps=0 / no-executors skip, even sharing, burst sharing, burst=0 omit (worker default kicks in), burst floor of 1, no mutation of input metadata, nil input. - weed/worker/tasks/s3_lifecycle/cluster_rate_limit_test.go: 7 cases covering nil/empty/missing metadata, non-positive/invalid rate, positive rate builds correctly, burst missing defaults to 2× rate, tiny rate clamps burst to ≥1. Build clean. Phase 2 (#9446) and Phase 4 engine (#9447) are the parents; this branch stacks on Phase 2 since it consumes dailyrun.Config{Limiter} which lands there. * fix(s3/lifecycle): divide cluster budget by active workers, not all capable gemini pointed out that s3_lifecycle has MaxJobsPerDetection=1 (handler.go:189) — it's a singleton job, only one worker is ever active. Dividing the cluster_deletes_per_second budget by the count of capable executors gave the single active worker just 1/N of the configured cap. Pass adminRuntime.MaxJobsPerDetection through to the decorator. Divisor is now min(executors, maxJobsPerDetection), clamped to >=1. For s3_lifecycle (maxJobs=1) the active worker gets the full budget; for a hypothetical parallel-dispatch job (maxJobs>1) the budget divides across the running-set. Tests swap the SharedEvenly case for two pinned scenarios: - SingletonJobGetsFullBudget: maxJobs=1 across 4 executors => 100/1 - SharedEvenlyWhenParallelLimited: maxJobs=4 across 4 executors => 25/worker - MaxJobsExceedsExecutors: maxJobs=10 across 4 executors => divisor 4 * feat(s3/lifecycle): drop Worker Count knob from admin config form The "Worker Count" admin field controlled in-process pipeline goroutines across the 16-shard space — per-worker tuning, not a cluster-wide scope concern. Operators looking at the form alongside Cluster Delete Rate reasonably misread it as the number of workers in the cluster. Drop the form field and DefaultValues entry. cfg.Workers is now hardcoded to shardPipelineGoroutines (=1) inside ParseConfig; the rest of the plumbing through dailyrun.Config.Workers stays so a future need can re-introduce it as a worker-local knob (or just bump the constant). handler_test.go pins that "workers" must NOT appear in the form so the removal doesn't silently regress.	2026-05-11 19:17:06 -07:00
Chris Lu	122ca7c020	feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) (#9446 ) * docs(s3lifecycle): design for daily-replay worker Captures the algorithm and dev plan iterated on in PR #9431 and the discussion leading up to it: per-shard daily meta-log replay, walker as a per-day pass for ExpirationDate/ExpiredDeleteMarker/NewerNoncurrent plus a recovery branch over engine.RecoveryView(snap), explicit retention-window input to RulesForShard, two cursor hashes (ReplayContentHash + PromotedHash) that together detect every invalidation case. Implementation phases are sequenced so each can ship independently — Phase 1 (noncurrent_since stamp) just landed. * feat(s3/lifecycle): daily-replay worker behind algorithm flag (Phase 2) New weed/s3api/s3lifecycle/dailyrun package implementing the bounded daily meta-log scan from the design doc. One pass per Execute per shard: load cursor, scan events forward, route each through router.Route, dispatch any due Match, advance the cursor on success. Halt-on-failure keeps the cursor at the last fully-processed event so tomorrow resumes from the same point — head-of-line blocking is the deliberate failure signal. Replay-only in this phase. Phase 4 wires the walker for ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, and scan_only-promoted rules. Until then a typed UnsupportedRuleError refuses runs on those buckets: operators see the rejection in the activity log rather than silently losing rules. Behavior: - Per-shard cursor {TsNs, RuleSetHash, PromotedHash} JSON-persisted under /etc/s3/lifecycle/daily-cursors/. PromotedHash always-empty in Phase 2; Phase 4 turns it on. - Rule-change branch rewinds cursor to now - max_ttl when the replay-content hash mismatches. Cold start uses the same floor. - Transport errors retry 3x with exponential backoff capped at 5s; server outcomes (RETRY_LATER / BLOCKED) halt the run without retry. - Empty-replay sentinel: cursor TsNs=0 when no replay-eligible rules exist, only the hash gates a future addition. Worker shape: - New admin config field "algorithm" with enum streaming\|daily_replay, default streaming. Existing deployments are unaffected. - handler.Execute branches on the flag: streaming routes through the current scheduler.Scheduler, daily_replay routes through dailyrun.Run. - dispatcher.NewFilerSiblingLister exported so both paths share the same .versions/ + null-bare lookup. Engine integration: - Local replayContentHash + maxEffectiveTTL helpers in dailyrun. Phase 4's engine surface (ReplayContentHash, MaxEffectiveTTL) will replace them with one-line redirects; the local versions hash the same fields so the cursor stays valid across the swap. Tests cover cursor persistence, unsupported-rule rejection, hash stability under rule reordering, hash sensitivity to TTL edits, max-TTL aggregation, dispatch retry budget, and request shape including the identity-CAS witness. Includes the design doc at weed/s3api/s3lifecycle/DESIGN.md so reviewers and future phases share the same spec. * feat(s3/lifecycle): default to daily_replay; streaming becomes the fallback knob The streaming dispatcher hasn't shipped to users yet, so there's no backward-compat surface to preserve. Flip the algorithm default from streaming to daily_replay so the new path is the standard from day one. Streaming stays as an explicit opt-in escape hatch during the Phase 4 walker rollout; Phase 5 deletes both the flag and the streaming code. Buckets whose lifecycle rules require walker-bound dispatch (ExpirationDate, ExpiredDeleteMarker, NewerNoncurrent, scan_only) will fail the daily_replay run with the existing UnsupportedRuleError until Phase 4 walker integration ships. Operators hitting that case can set algorithm=streaming until the follow-up lands. Updates the test for the default value and renames the unknown-value-fallback case to reflect the new default. * fix(s3/lifecycle/dailyrun): drop per-rule done flag — it suppressed due matches The done map was keyed by ActionKey = {Bucket, RuleHash, ActionKind}. That's only safe when each event produces at most one match per ActionKey with a single deterministic due-time formula — ExpirationDays and AbortMPU fit that shape because due_time = ev.TsNs + r.days is monotonic in event TsNs. But NoncurrentDays paired with NewerNoncurrentVersions > 0 (allowed in Phase 2 since it compiles to ActionKindNoncurrentDays) routes through routePointerTransitionExpand, which emits matches for every noncurrent sibling — each with its own SuccessorModTime taken from the demoting event for that specific sibling. A single event can therefore produce two matches for the same ActionKey on different objects with wildly different DueTimes. With the old code, a not-yet-due sibling encountered first would set done[ActionKey] = true and then the next sibling — even though its DueTime had already passed — would be skipped. Future events for the same rule would also be suppressed for the rest of the run. Objects that should have been deleted weren't. Fix: drop the early-stop optimization. Process every match independently. A future-DueTime match is now silently skipped without affecting any later match. The performance hit is small (Phase 2 is a single bounded daily pass, and the rate limiter is the real throughput governor); the correctness gain is non-negotiable. Also fixes the inverted comment in processMatches that described the old check as "due_time is past now" when it actually checked DueTime.After(now) (i.e., NOT yet due). Adds four targeted tests: - not-yet-due match first in slice does not suppress two later due matches for the same rule; - reversed slice ordering produces identical dispatch; - BLOCKED outcome halts the loop before later due matches are sent; - empty match slice is a no-op. Phase 4's walker-and-recovery integration can revisit a per-(rule, object) memoization if profiling argues for it. * fix(s3/lifecycle/dailyrun): address PR review — cursor advance, mode gate, ctx cancel, snapshot consistency Addresses PR #9446 review feedback. Eight distinct fixes: 1. CURSOR ADVANCEMENT (gemini, critical). The old code advanced the persisted cursor to lastOK = TsNs of the last event processed, including events whose matches were skipped as not-yet-due. Those skipped matches would never be re-scanned, so objects under long-TTL rules would never expire. Track a "stuck" flag in drainShardEvents: the first event with a skipped (future-DueTime) match stops cursorAdvanceTo from rising, but the loop keeps processing later events to dispatch any that ARE due. The persisted cursor sits at the last fully-processed event so tomorrow's run re-scans from the skipped event onward and the future-due matches get re-evaluated when they age in. processMatches now returns (skippedAny, halted, err) so the drain loop can tell apart "event fully drained" from "event had pending future-due matches." 2. MODE GATE (gemini). checkSnapshotForUnsupported only checked the ActionKind. A replay-eligible kind with Mode != ModeEventDriven (e.g. ModeScanOnly via retention promotion) passed the check but then got silently ignored by router.Route, which gates dispatch on Mode == ModeEventDriven. Reject loudly with the typed error so admin sees the rejection in the activity log. 3. WORKERS CONFIG (gemini). The handler hardcoded 16 concurrent shard goroutines regardless of cfg.Workers. Add a Workers field to dailyrun.Config and gate the goroutine fan-out on a semaphore of that size; the handler now passes cfg.Workers through. 4. SINGLE SNAPSHOT PER RUN (coderabbit). Run() validated against one snapshot but runShard() pulled a fresh cfg.Engine.Snapshot() per shard. Mid-run Compile would let shards process different rule sets. Capture snap at the top of Run, pass it down to every shard. 5. FROZEN runNow (coderabbit). drainShardEvents and processMatches accepted a `now func() time.Time` and called it multiple times. DueTime comparisons would slip as the run wore on. Capture runNow once at the top of Run and thread it through as a time.Time value. 6. CTX CANCELLATION (coderabbit). The drain loop's <-ctx.Done() case broke out of the loop and returned nil, marking interrupted runs as successful. Return ctx.Err() instead so the caller propagates the interrupt; cursorAdvanceTo carries whatever progress was made. 7. CURSOR LOAD VALIDATION (coderabbit + gemini). The persister silently accepted empty files, mismatched shard_ids, and hash slices shorter than 32 bytes (copy() would zero-pad). Each now returns a typed error so the run halts and an operator investigates rather than silently re-scanning from time zero or persisting a zero-padded hash that masks corruption forever. 8. DEAD BRANCH (coderabbit). The "lastOK < startTsNs → keep persisted" guard in runShard was unreachable because drainShardEvents initialized lastOK := startTsNs and only ever raised it. Removed along with the new cursor-advancement semantics that handle the "no events processed" case implicitly. Plus markdown lint: DESIGN.md fenced code blocks now carry a `text` language identifier to satisfy MD040. Skipped from the review: - gemini's "maxTTL == 0 incorrectly skips immediate expirations": actions with Days <= 0 don't compile to a CompiledAction (see weed/s3api/s3lifecycle/action_kind.go: `if rule.X > 0`). The new empty-replay sentinel uses `rsh == [32]byte{}` for clarity per gemini's suggested form, but the behavior is equivalent. Tests added/updated: - TestProcessMatches_AllDueNoSkippedFlag pins skippedAny=false when all matches are past their DueTime. - TestCheckSnapshotForUnsupported_NonEventDrivenModeRejected pins the new Mode check. - TestFilerCursorPersister_EmptyFileReturnsError, _ShardIDMismatchReturnsError, _HashLengthMismatchReturnsError pin the new validation rules. - Existing process-matches tests reshaped for the (skippedAny, halted, err) return tuple. Full build clean. Dailyrun + worker test packages green.	2026-05-11 18:07:17 -07:00
Chris Lu	b456628a7a	ui(s3_lifecycle): plain-English labels for cadence fields Dispatch Tick / Cursor Checkpoint Tick / Engine Refresh / Bootstrap Re-walk are internal terms — operators tuning the form had to read the descriptions to guess what each field meant. Renames the visible labels and the section blurb; underlying field names are unchanged so stored configs still load.	2026-05-10 23:46:12 -07:00
Chris Lu	9d20e71883	test(s3/lifecycle): cover worker handler lookupBucketsPath (#9407 ) Three branches: gRPC error from GetFilerConfiguration must propagate (else Execute would proceed to dial S3 with an empty buckets path and never dispatch); a non-empty DirBuckets is honored verbatim so operators with a non-default layout aren't force-routed to /buckets; an empty DirBuckets falls back to the documented "/buckets" default rather than returning empty (which would route to root). stubFilerConfigClient embeds filer_pb.SeaweedFilerClient so methods other than the one under test panic if called — keeps the surface narrow.	2026-05-09 20:41:09 -07:00
Chris Lu	cb6e498e0b	test(s3/lifecycle): pin Descriptor structural invariants (#9401 ) * test(s3/lifecycle): pin Descriptor structural invariants Pre-existing handler tests covered Capability and Detect; Descriptor was previously untested. A drift between the form fields it advertises and the defaults config.go reads silently breaks the admin UI in two ways: the form renders blank (admin can't tune) or the worker clamps to a hardcoded fallback ignoring the admin's edits. The new tests catch both directions. Pinned: jobType / DisplayName / Description / DescriptorVersion; AdminConfigForm exposes a workers field whose default matches defaultWorkers; WorkerConfigForm has a default and a field for every cadence knob ParseConfig reads (dispatch_tick / checkpoint_tick / refresh_interval / bootstrap_interval / max_runtime); AdminRuntime- Defaults hits a daily cadence with bounded detection timeout and single job per detection. * test(s3/lifecycle): tighten Descriptor invariant assertions Per gemini review on #9401: pin DetectionTimeoutSeconds to its exact value (60) instead of ">0" so an accidental tweak is caught, and assert WorkerConfigForm fields are INT64 (matching ParseConfig's readInt64) so a STRING-type drift can't silently make the worker ignore admin edits.	2026-05-09 20:34:17 -07:00
Chris Lu	c0cf1417f1	test(s3/lifecycle): cover worker handler Execute validation paths (#9398 ) 7 tests pin the Execute early-return surface that runs without a filer or S3 dial: nil request / nil Job / nil sender all error; foreign JobType errors with the offending name in the message; no S3 endpoints in cluster context errors (Execute is stricter than Detect — the admin shouldn't have routed the job there); missing filer_grpc_address parameter errors (proposal must have been tampered with or dropped); empty JobType is accepted as broadcast routing and flows through to the next validation step. The dial path itself is intentionally not covered here — those tests would need an in-process gRPC server and belong with the integration suite.	2026-05-09 19:51:31 -07:00
Chris Lu	62e04623ce	test(s3/lifecycle): cover worker handler Detect + helpers (#9396 ) * test(s3/lifecycle): cover worker handler Detect + helpers 13 tests pin the worker-handler surface that runs without a live filer or S3 server. Pure helpers: clusterS3Endpoints (nil context, empty list, filter empty entries while preserving order, all-valid passthrough); readString (missing key, nil ConfigValue, wrong kind falls back, string returned). Capability advertises jobType with single-job concurrency caps. Detect: nil request / nil sender / wrong JobType all error; no S3 endpoints emits a 'skipped' activity and completes with success; no filer addresses behaves the same; the happy path proposes one job parameterized with the first filer address; empty JobType is accepted (broadcast detect); a SendProposals failure propagates without firing complete. * test(s3/lifecycle): cover SendComplete error propagation in worker Detect The recordingSender already supported forcing an err on SendComplete via errOn, but no case exercised it. A SendComplete failure must propagate so the admin learns the completion signal never landed; proposals went out before the failure so they remain recorded.	2026-05-09 19:46:57 -07:00
Chris Lu	1854101125	feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386 ) * feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) scan_only actions only fire from the bootstrap walk: the engine classifies a rule as scan_only when its retention horizon exceeds the meta-log retention, so event-driven routing can't be trusted. Today each bucket walks once per process, so a long-running worker never revisits — scan_only retention only catches up when the worker restarts. Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk (name -> completion time). KickOffNew now re-walks a bucket whose last walk completed more than BootstrapInterval ago. Zero interval preserves the legacy walk-once-per-process behavior so existing deployments don't change cadence by default. walkBucket re-stamps on success and clears the stamp on failure (via MarkDirty), so the next KickOffNew picks failed walks back up. Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and a Now func() for testable time travel. weed shell run-shard grows --bootstrap-interval (cadence knob) and --force-bootstrap (drop in-memory state at startup so every bucket walks again immediately, useful when a config change should take effect without a restart). Tests: cadence respected (skip inside interval, re-walk past it); zero interval keeps once-per-process; MarkDirty forces re-walk under a 24h interval; MarkAllDirty resets every record. The fakeClock helper guards the test clock with a mutex so race-detector runs are clean. * fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag Three issues with the Phase 8 cadence work as it landed: 1. lastWalk did double duty as both completed-walk timestamp and in-flight debounce. A walk that took longer than BootstrapInterval would have a fresh KickOffNew start a duplicate goroutine on the next refresh tick because the stamp from KickOffNew looked stale against the interval. Split into lastCompleted (set on success) and inFlight (set on dispatch, cleared after the walk goroutine returns success or failure). KickOffNew skips inFlight buckets regardless of cadence. 2. The cadence knob existed on `weed shell` but not on the production path: scheduler.Scheduler constructed BucketBootstrapper without BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had no field for it. Add Scheduler.BootstrapInterval, parse `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk- once-per-process; negative clamps to zero), and forward it from the handler. Tests cover default, override, clamp, and explicit-zero. 3. --force-bootstrap was a no-op: BucketBootstrapper is freshly allocated at command start, so MarkAllDirty on empty state does nothing, and the flag couldn't influence an already-running process anyway. Remove it; a real runtime trigger (SIGHUP, control RPC) is a separate change. In-flight regression: a blockingInjector pins the first walk in progress while the test advances the clock past the interval. The second KickOffNew is a no-op (inFlight check). After release, the post-completion KickOffNew within the interval is also a no-op. * test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock The cadence test polled listedN to know "the walk happened" — but that fires once both list passes are issued, while the success-stamp lands later, after walkBucketDir returns. A clock.Advance(30m) between those two events would record the stamp at clock+30m instead of T0; the next assertion would then see now.Sub(last) < 1h and skip the expected re-walk. Tight in practice but exposed under -race / load. Add a waitForCompleted helper that polls b.lastCompleted directly, and use it before each clock advance in both the cadence and zero- interval tests. * fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks Two follow-ups on Phase 8. The worker config descriptor had no bootstrap_interval_minutes field, so the production operator UI couldn't enable the cadence — only the internal ParseConfig + Scheduler wiring knew about it. Add the field to the cadence section (MinValue=0 since 0 is the legacy default) and include the default in DefaultValues so existing deployments see the knob with the right preset. MarkDirty / MarkAllDirty silently lost their effect when a walk was in flight: the methods cleared lastCompleted, but the walk's success path then wrote a fresh timestamp, hiding the operator's invalidation. Track a pendingDirty set; the walk goroutine consumes the flag on exit and skips the success stamp, so the next KickOffNew picks the bucket up immediately. Regression: pin a walk in progress with a blockingInjector, MarkDirty the bucket, release the walk, and assert lastCompleted stayed empty plus the next KickOffNew triggers a new walk inside the BootstrapInterval window. * refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty These methods were the operator-hook half of Phase 8, but the only caller (--force-bootstrap on the shell command) was removed when it turned out to be a no-op against a freshly-allocated bootstrapper. Nothing in production calls them anymore. Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty set, the dirty-suppression branch in walkBucket, and the three tests that only exercised those methods. BootstrapInterval-driven re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP, control RPC) is a separate change with a real call site.	2026-05-09 13:42:31 -07:00
Chris Lu	fd463155e4	fix(ec): planner treats each (server, disk_id) as a distinct target (#9369 ) (#9371 ) * fix(ec): planner treats each (server, disk_id) as a distinct target (#9369) master_pb.DataNodeInfo.DiskInfos is keyed by disk type, so a volume server with multiple physical disks of the same type collapses into a single DiskInfo. Per-disk attribution survives only inside the VolumeInfos[].DiskId / EcShardInfos[].DiskId records, and the active topology never put it back together. The EC planner saw N candidates instead of N×disks, returned a short plan, and createECTargets round-robined extra shards onto the same (server, disk_id) — colliding with the #9185 disk_id-aware ReceiveFile. Reconstruct per-physical-disk view in UpdateTopology by splitting each DiskInfo into one entry per observed disk_id, and index volumes / EC shards by their own DiskId so lookups stay aligned. Refuse to plan an EC task when fewer than totalShards distinct disks are available rather than packing shards onto the same disk. Threads dataShards/parityShards through planECDestinations, createECTargets and createECTaskParams so the helpers don't depend on the OSS 10+4 constants — keeps enterprise merges clean. * trim verbose comments * align EC param signatures with enterprise - dataShards/parityShards: uint32 → int (matches enterprise's ratio API) - drop unused multiPlan from createECTaskParams - minTotalDisks: total/parity+1 → ceil(total/parity), correct for non-default ratios Reduces merge surface when this PR lands in seaweed-enterprise.	2026-05-08 12:59:02 -07:00
Chris Lu	5d43f84df7	refactor(plugin): rename detection_interval_seconds → detection_interval_minutes (#9366 ) Minutes is the natural granularity for detection cadence — every production handler already set the seconds field to a 60-multiple (1760, 3060, 3600, 246060). Switching to minutes drops the 60 arithmetic and matches the unit conventions used elsewhere in the plugin worker forms. - Proto: AdminRuntimeDefaults + AdminRuntimeConfig.detection_interval_ field renamed. - Helpers: durationFromMinutes / minutesFromDuration alongside the existing seconds variants in plugin_scheduler.go. - Handlers: vacuum, ec_balance, balance, erasure_coding, iceberg, admin_script, s3_lifecycle now declare DetectionIntervalMinutes. - Admin: scheduler_status + types + UI templ + plugin_api.go pass through the new field; UI label and table cells switch to "min".	2026-05-08 10:33:02 -07:00
Chris Lu	7f254e158e	feat(worker/s3_lifecycle): plugin handler with admin UI config (#9362 ) * feat(s3/lifecycle): scheduler — N pipelines over an even shard split Scheduler.Run spawns Workers Pipeline goroutines plus one engine-refresh ticker. Each worker owns a contiguous AssignShards(idx, total) slice of [0, ShardCount) and runs Pipeline.Run with EventBudget bounding each iteration; brief RetryBackoff between iterations avoids hot-loop on errors. The refresh ticker rebuilds the engine snapshot from the filer's bucket configs every RefreshInterval. LoadCompileInputs / IsBucketVersioned / AllActivePriorStates are exported from a configload.go sibling so the shell command can move to this shared implementation in a follow-up. * refactor(shell): reuse scheduler.LoadCompileInputs in run-shard Drop the local copies of loadLifecycleCompileInputs / isBucketVersioned / allActivePriorStates / lifecycleParseError that the new scheduler package now exports. Same behavior, one source of truth. * feat(worker/s3_lifecycle): plugin handler with admin UI config Registers a JobHandler for s3_lifecycle via pluginworker.RegisterHandler. Admin pulls the descriptor over the worker plugin gRPC and renders the AdminConfigForm + WorkerConfigForm in the existing UI: Admin form (cluster shape): - workers (1..16, default 1) - s3_grpc_endpoints (comma list) Worker form (operational tuning): - dispatch_tick_ms (default 5000) - checkpoint_tick_ms (default 30000) - refresh_interval_ms (default 300000) - event_budget (default 0 = unbounded) Detect emits a single proposal whenever S3 endpoints + filer addresses are configured. MaxExecutionConcurrency=1 so admin only ever runs one lifecycle daemon per worker; a fresh proposal next cycle restarts it if the prior Execute exits. Execute dials the configured S3 endpoint + filer, builds a scheduler.Scheduler with the parsed config, and runs it until ctx cancellation. Reuses the existing scheduler / dispatcher / reader / engine packages — the handler is the thin glue that parses descriptor values and wires the long-running daemon. * proto(plugin): add s3_grpc_addresses to ClusterContext So workers can dial s3 servers discovered by the master rather than a hand-typed list in the admin form. * feat(admin): populate ClusterContext.s3_grpc_addresses from master ListClusterNodes(S3Type) returns the live S3 servers; the plugin scheduler now hands these to job handlers alongside filer/volume addresses. * feat(worker/s3_lifecycle): discover s3 endpoints from cluster context Drop the s3_grpc_endpoints admin form field and read the master-supplied ClusterContext.S3GrpcAddresses instead. Operators no longer maintain a hand-typed list, and a stale entry self-heals when the master's view updates. * feat(worker/s3_lifecycle): time-based runtime cap, friendlier cadence units - dispatch_tick_minutes (was *_ms): minutes is the natural granularity for a daily batch; default 1 minute. - checkpoint_tick_seconds: seconds for the durable cursor write; default 30 seconds. - refresh_interval_minutes: minutes for the engine snapshot rebuild. - max_runtime_minutes replaces event_budget. Each daily run is bounded by wall clock — typical run wraps in well under an hour because the cursor persists and the meta-log streams fast. Default 60 minutes. - AdminRuntimeDefaults.DetectionIntervalSeconds = 86400 so the admin schedules one job per day.	2026-05-08 10:30:02 -07:00
Chris Lu	4f79d8e358	feat(s3/lifecycle): bucket-level bootstrap walker (#9350 ) * feat(worker): add TaskTypeS3Lifecycle constant Single job type for the lifecycle worker; the S3LifecycleParams.Subtype field (READ / BOOTSTRAP / DRAIN) dispatches inside the handler. The "s3_lifecycle" string is already wired to LaneLifecycle in admin/plugin/scheduler_lane.go so adding the constant doesn't change runtime behavior — it lets future commits reference the type name without sprinkling string literals. * feat(s3/lifecycle): bucket-level bootstrap walker Iterates entries in a bucket, evaluates every active ActionKey in the engine snapshot against each entry, and dispatches inline-delete for currently-due actions. Date-kind actions and pending_bootstrap actions are skipped — the former are handled by their own SCAN_AT_DATE bootstrap, the latter aren't IsActive() yet. Walker is callback-driven so callers supply the listing source (real filer_pb.SeaweedList or test fake) and the dispatcher (real LifecycleDelete client or test fake). This keeps the walker free of filer_pb dependencies and makes the per-action evaluation flow unit-testable in isolation. Checkpoint state (LastScannedPath, Completed) is returned to the caller, who is responsible for persisting it under /etc/s3/lifecycle/<bucket>/_bootstrap. Walk() honours opts.Resume so a kill-resumed task picks up where the previous walker stopped. Tests cover: prefix-mismatched skip, not-yet-due skip (reader's job), date-kind skip, pending_bootstrap skip, multi-action rule (one rule with three actions dispatches three times — the regression that per-action keying fixes), dispatch error halts at last-successful checkpoint, Resume skips entries up to and including the resume path. * test(s3/lifecycle): walker test uses bucket-scoped ActionKey Mechanical follow-up to the bucket-scoped ActionKey on lifecycle-engine: the bootstrap walker tests construct ActionKeys to seed PriorStates and need the Bucket field to match what engine.Compile keys against. * fix(s3/lifecycle): walker quick wins Two minor cleanups noted on review: - Drop the redundant Resume re-filter inside the Walk callback. ListFunc's contract already promises "skip entries with Path <= start"; trusting that contract avoids divergence if the filter logic ever changes on one side and not the other. - Hoist the ObjectInfo allocation out of the per-action loop in walkEntry. Multi-action rules previously allocated one ObjectInfo per (entry, kind) pair; now it's one per entry, reused across all matching kinds. * fix(s3/lifecycle): walker Entry.NoncurrentIndex tracks ObjectInfo's int ObjectInfo.NoncurrentIndex is now int so unset is unambiguous; mirror that on bootstrap.Entry so the per-entry construction stays type-clean. Phase 5 (versioned-bucket walks) is the first caller that will populate the field. * refactor(s3/lifecycle): trim narration from bootstrap walker Drop the inline step-by-step on Walk and the multi-paragraph package preamble; the function names already say it. Keep one-liner WHYs at the SCAN_AT_DATE skip and the once-per-entry ObjectInfo build. * fix(s3/lifecycle): walker skips directories and ModeDisabled actions Two safety findings from review: 1. SeaweedFS directory entries can appear in the listing alongside objects; without an IsDirectory check the walker would treat a dir like any other entry and could dispatch a delete against it. Add IsDirectory to bootstrap.Entry and short-circuit it before walkEntry. 2. ModeDisabled is set by the operator (e.g. shell pause) independent of the XML rule's Status field. EvaluateAction gates on Status and would still fire for an operator-disabled action whose XML status is "Enabled". Skip ModeDisabled explicitly in walkEntry alongside the existing SCAN_AT_DATE skip. Two regression tests pin both cases. * perf(s3/lifecycle): reuse ObjectInfo across walker entries Walker allocated one ObjectInfo struct per entry. For buckets with millions of objects that's measurable GC pressure. Hoist the allocation out of the per-entry callback (one per Walk) and reuse via field assignment in walkEntry. EvaluateAction reads ObjectInfo synchronously and doesn't retain a reference, so the reuse is safe — the next iteration's overwrite can't corrupt an in-flight evaluation. * refactor(s3/lifecycle): trim narration on walker Drop the multi-line Entry / ObjectInfo-reuse / SCAN_AT_DATE+DISABLED explanations. The walker's structure is small enough that the condition itself reads as the documentation.	2026-05-07 15:04:51 -07:00
Chris Lu	1c0e24f06a	fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335 ) * fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-06 15:19:43 -07:00
Chris Lu	1f6f473995	refactor(worker): co-locate plugin handlers with their task packages (#9301 ) * refactor(worker): co-locate plugin handlers with their task packages Move every per-task plugin handler from weed/plugin/worker/ into the matching weed/worker/tasks/<name>/ package, so each task owns its detection, scheduling, execution, and plugin handler in one place. Step 0 (within pluginworker, no behavior change): extract shared helpers that previously lived inside individual handler files into dedicated files and export the ones now consumed across packages. - activity.go: BuildExecutorActivity, BuildDetectorActivity - config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority - interval.go: ShouldSkipDetectionByInterval - volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location - collection_filter.go: CollectionFilterMode + consts - volume_metrics.go: export CollectVolumeMetricsFromMasters, MasterAddressCandidates, FetchVolumeList - testing_senders_test.go: shared test stubs Phase 1: move the per-task plugin handlers (and the iceberg subpackage) into their task packages. weed/plugin/worker/vacuum_handler.go -> weed/worker/tasks/vacuum/plugin_handler.go weed/plugin/worker/ec_balance_handler.go -> weed/worker/tasks/ec_balance/plugin_handler.go weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go weed/plugin/worker/iceberg/ -> weed/worker/tasks/iceberg/ weed/plugin/worker/handlers/handlers.go now blank-imports all five task subpackages so their init() registrations fire. weed/command/mini.go and the worker tests construct the handler with vacuum.DefaultMaxExecutionConcurrency (the constant moved with the vacuum handler). admin_script remains in weed/plugin/worker/ because there is no underlying weed/worker/tasks/admin_script/ package to merge with. * refactor(worker): update test/plugin_workers imports for moved handlers Three handler constructors moved out of pluginworker into their task packages — update the integration test files in test/plugin_workers/ to import from the new locations: pluginworker.NewVacuumHandler -> vacuum.NewVacuumHandler pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler The pluginworker import is kept where the file still uses pluginworker.WorkerOptions / pluginworker.JobHandler. * refactor(worker): update test/s3tables iceberg import path The iceberg subpackage moved from weed/plugin/worker/iceberg/ to weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go still imported the old path, breaking S3 Tables / RisingWave / Trino / Spark / Iceberg-catalog / STS integration test builds. Mirrors the OSS-side fix needed by every job in the run that transitively imports test/s3tables/maintenance. * chore: gofmt PR-touched files The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables and test/s3tables, then fails if anything is unformatted. Files this PR moved or modified had import-grouping and trailing-spacing issues introduced by perl-based renames; reformat them with gofmt -w. Touched files: test/plugin_workers/erasure_coding/{detection,execution}_test.go test/s3tables/maintenance/maintenance_integration_test.go weed/plugin/worker/handlers/handlers.go weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler.go refactor(worker): bounds-checked int conversions for plugin config values CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved plugin handler files: results of pluginworker.ReadInt64Config (which ultimately calls strconv.ParseInt with bit size 64) were being narrowed to int32/uint32/int without an upper-bound check, so a malicious or malformed admin/worker config value could overflow the target type. Add three helpers in weed/plugin/worker/config.go that wrap ReadInt64Config and clamp out-of-range values back to the caller's fallback: ReadInt32Config (math.MinInt32 .. math.MaxInt32) ReadUint32Config (0 .. math.MaxUint32) ReadIntConfig (math.MinInt32 .. math.MaxInt32, platform-portable) Update each flagged call site in the four moved task packages to use the bounds-checked helper. For protobuf uint32 fields (volume IDs) the variable type also becomes uint32, removing the trailing uint32(volumeID) casts and changing the "missing volume_id" check from `<= 0` to `== 0`. Touched files: weed/plugin/worker/config.go weed/worker/tasks/balance/plugin_handler.go weed/worker/tasks/erasure_coding/plugin_handler.go weed/worker/tasks/vacuum/plugin_handler.go * refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers CodeQL still flagged three call sites where ReadInt64Config was being narrowed to int after a value-range clamp (max_concurrent_moves <= 50, batch_size <= 100, min_server_count >= 2). The clamp is correct but CodeQL's flow analysis didn't recognize the bound, so it flagged them as unbounded narrowing. Switch to ReadIntConfig (already int32-bounded by the helper) for those three sites, drop the now-redundant int64 intermediate variables. Also drops the now-unused `> math.MaxInt32` clamp in ec_balance.deriveECBalanceWorkerConfig (the helper covers it).	2026-05-02 18:03:13 -07:00
Chris Lu	628363c4a6	fix(erasure_coding): surface replica delete failures from EC task (#9184 ) (#9187 ) * test(erasure_coding): reproduce #9184 deleteOriginalVolume swallowing errors ErasureCodingTask.deleteOriginalVolume logs a warning when any replica VolumeDelete fails and then returns nil, so the EC task reports success to the admin even when a source replica survives. That stale replica lets a later detection scan re-propose the same volume and, once retried, drives the mounted-shard-truncation corruption that issue 9184 also describes. Reproducer: wire one reachable replica (succeeds) and one unreachable replica (fails) and assert the function currently returns nil. After the fix the function must surface the replica failure so the task is retried rather than marked done, and this test needs to be inverted. * fix(erasure_coding): surface replica delete failures from EC task ErasureCodingTask.deleteOriginalVolume previously logged a warning and returned nil when any VolumeDelete against a source replica failed. The EC task therefore reported overall success to the admin even when a source replica stayed on disk, which let a later detection scan propose a duplicate EC encoding of the same volume. The retry then walked the ReceiveFile path against servers that already had mounted EC shards for the volume, truncating the live shard files in place (the other half of #9184). This change returns an error describing the per-replica failures after the best-effort delete pass, so the task is marked failed instead of silently moving on. Successful deletes are still applied (per-replica progress is preserved); only the final return changes. When combined with the ReceiveFile mount-safety check, a stuck original replica now produces loud, actionable failures instead of silent corruption. Tests: - TestDeleteOriginalVolumeSurfacesReplicaFailures: asserts an error is returned and names the unreachable replica, while the reachable replica still gets deleted. - TestDeleteOriginalVolumeSucceedsWhenAllReplicasReachable: pins the happy path.	2026-04-22 16:02:51 -07:00
Chris Lu	9d15705c16	fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C (#9112 ) * fix(mini): shut down admin/s3/webdav/filer before volume/master on Ctrl+C Interrupts fired grace hooks in registration order, so master (started first) shut down before its clients, producing heartbeat-canceled errors and masterClient reconnection noise during weed mini shutdown. Admin/s3/ webdav had no interrupt hooks at all and were killed at os.Exit. - grace: execute interrupt hooks in LIFO (defer-style) order so later- started services tear down first. - filer: consolidate the three separate interrupt hooks (gRPC / HTTP / DB) into one that runs in order, so filer shutdown stays correct independent of FIFO/LIFO semantics. - mini: add MiniClientsShutdownCtx (separate from test-facing MiniClusterCtx) plus an OnMiniClientsShutdown helper. Admin, S3, WebDAV and the maintenance worker observe it; runMini registers a cancel hook after startup so under LIFO it fires first and waits up to 10s on a WaitGroup for those services to drain before filer, volume, and master shut down. Resulting order on Ctrl+C: admin/s3/webdav/worker -> filer (gRPC -> HTTP -> DB) -> volume -> master. * refactor(mini): group mini-client shutdown into one state struct The first pass spread the shutdown plumbing across three globals (MiniClientsShutdownCtx, miniClientsWg, cancelMiniClients) and two ctx-derivation sites (OnMiniClientsShutdown and startMiniAdminWithWorker). Group into a private miniClientsState (ctx/cancel/wg) rebuilt per runMini invocation, and chain its ctx from MiniClusterCtx so clients only observe one signal. Tests that cancel MiniClusterCtx still trigger client shutdown via parent-child propagation. - resetMiniClients() installs fresh state at the top of runMini, so in-process test reruns don't inherit stale ctx/wg. - onMiniClientsShutdown(fn) replaces the exported OnMiniClientsShutdown and only observes one ctx. - trackMiniClient() replaces the manual wg.Add/Done dance for the admin goroutine. - miniClientsCtx() gives the admin startup a ctx without re-deriving. - triggerMiniClientsShutdown(timeout) is the interrupt hook body. No behaviour change; existing tests pass. * refactor: generalize shutdown ctx as an option, not a mini-specific helper Several service files (s3, webdav, filer, master, volume) observed the mini-specific MiniClusterCtx or called onMiniClientsShutdown directly. That leaked mini orchestration into code that also runs under weed s3, weed webdav, weed filer, weed master, and weed volume standalone. Replace with a generic `shutdownCtx context.Context` field on each service's Options struct. When non-nil, the server watches it and shuts down gracefully; when nil (standalone), the shutdown path is a no-op. Mini wires the contexts up from a single place (runMini): - miniMasterOptions/miniOptions.v/miniFilerOptions.shutdownCtx = MiniClusterCtx (drives test-triggered teardown) - miniS3Options/miniWebDavOptions.shutdownCtx = miniClientsCtx() (drives Ctrl+C teardown before filer/volume/master) All knowledge of MiniClusterCtx now lives in mini.go. * fix(mini): stop worker before clients ctx so admin shutdown isn't blocked Symptom on Ctrl+C of a clean weed mini: mini's Shutting down admin/s3/ webdav hook sat for 10s then logged "timed out". Admin had started its shutdown but was blocked inside StopWorkerGrpcServer's GracefulStop, waiting for the still-connected worker stream. That in turn left filer clients connected and cascaded into filer's own 10s gRPC graceful-stop timeout. Two causes, both fixed: 1. worker.Stop() deadlocked on clean shutdown. It sent ActionStop (which makes managerLoop `break out` and exit), then called getTaskLoad() which sends to the same unbuffered cmd channel — no receiver, hangs forever. Reorder Stop() to snapshot the admin client and drain tasks BEFORE sending ActionStop, and call Disconnect() via the local snapshot afterwards. 2. Worker's taskRequestLoop raced with Disconnect(): RequestTask reads from c.incoming, which Disconnect closes, yielding a nil response and a panic on response.Message. Handle the closed channel explicitly. 3. Mini now has a preCancel phase (beforeMiniClientsShutdown) that runs synchronously BEFORE the clients ctx is cancelled. Register worker shutdown there so admin's worker-gRPC GracefulStop finds the worker already disconnected and returns immediately, instead of waiting on a stream that is about to close anyway. Observed shutdown of a clean mini: admin/s3/webdav down in <10ms; full process exit in ~11s (the remaining 10s is a pre-existing filer gRPC graceful-stop timeout, not cascaded from the clients tier). * feat(mini): cap filer gRPC graceful stop at 1s under weed mini Full weed mini shutdown was ~11s on a clean exit, dominated by the filer's default 10s gRPC GracefulStop timeout while background SubscribeLocalMetadata streams drained. Expose the timeout as a FilerOptions.gracefulStopTimeout field (default 10s for standalone weed filer) and set it to 1s in mini. Clean weed mini shutdown now takes ~2s.	2026-04-16 16:11:01 -07:00
Chris Lu	2fd60cfbc3	fix(balance): guard against destination overshoot and oscillation (#9090 ) * fix(balance): guard against destination overshoot and oscillation Plugin-worker volume_balance detection re-selects maxServer/minServer each iteration based on utilization ratio. With heterogeneous MaxVolumeCount values, a single greedy move can flip which server is most-utilized, causing A->B, B->A oscillation within one detection cycle and pushing destinations past the cluster ideal. Mirror the shell balancer's per-move guard (weed/shell/command_volume_balance.go:440): before scheduling a move, verify that the destination's post-move utilization would not strictly exceed the source's post-move utilization. If it would, no single move can improve balance, so stop. Add regression tests that cover: - TestDetection_HeterogeneousMax_NoOvershootNoOscillation: 2 servers with different caps just above threshold; detection must not oscillate or make the imbalance worse. - TestDetection_RespectsClusterIdealUtilization: 3-server heterogeneous layout; destinations must not overshoot cluster ideal. * fix(balance): use effective capacity when resolving destination disk resolveBalanceDestination read VolumeCount directly from the topology snapshot, which is not updated when AddPendingTask registers a move within the current detection cycle. This meant multiple moves planned in a single cycle all saw the same static count and could target the same disk past its effective capacity. Switch to ActiveTopology.GetNodeDisks + GetEffectiveAvailableCapacity so that destination planning accounts for all pending and assigned tasks affecting the disk — consistent with how the detection loop already tracks effectiveCounts at the server level. Add a unit test that seeds two pending balance tasks against a destination disk with 2 free slots and asserts resolveBalanceDestination rejects a third planned move. * fix(ec_balance): capacity-weighted guard in Phase 4 global rebalance detectGlobalImbalance picked min/max nodes by raw shard count and compared them against a simple (unweighted) rack-wide average. With heterogeneous MaxVolumeCount across nodes in the same rack, this lets the greedy algorithm move shards from a large, barely-used node to a small, nearly-full node just because the small node has fewer shards in absolute terms — strictly worsening imbalance by utilization and potentially overfilling the small node. Snapshot each node's total shard capacity (current shards plus free slots) at loop start and add a per-move convergence guard: reject any move where the destination's post-move utilization would strictly exceed the source's post-move utilization. Mirrors the fix in weed/worker/tasks/balance/detection.go. Regression test TestDetectGlobalImbalance_HeterogeneousCapacity covers a rack with node1 (cap 100, 10 shards → 10% util) and node2 (cap 5, 3 shards → 60% util). Before the fix, Phase 4 moves 2 shards from node1 to node2, filling node2 to 100% util. After the fix, the guard blocks both moves. * fix(ec_balance): utilization-based max/min in Phase 4 rebalance Phase 4's global rebalancer picked source and destination nodes by raw shard count, and compared against a simple raw-count average. With heterogeneous MaxVolumeCount across nodes in a rack, this got the direction wrong: a large-capacity node holding many shards in absolute terms but only a small fraction of its capacity would be picked as the "overloaded" source, while a small-capacity node nearly at its slot limit (but holding fewer absolute shards) would be picked as the "underloaded" destination. The previous fix added a strict-improvement guard that prevented the bad move but left balance untouched — the rack stayed in an uneven state. Switch to utilization-based selection and a utilization-based pre-check: - Pick max/min by (count / capacity), where capacity is the node's current allowed shards plus remaining free slots (snapshotted once per rack and held constant for the duration of the loop). - Replace the raw-count imbalance gate (exceedsImbalanceThreshold) with a new exceedsUtilImbalanceThreshold helper that compares fractional fullness. The raw-count gate is still used by Phase 2 and Phase 3, where the per-rack / per-volume semantics differ. - Drop the raw-count guards (maxCount <= avgShards \|\| minCount+1 > avgShards and maxCount-minCount <= 1) now that the per-move strict-improvement check handles termination correctly for both homogeneous and heterogeneous capacity. Also fix a latent bug in the inner shard-selection loop: it was not updating shardBits between iterations, so every iteration picked the same lowest-set bit and emitted duplicate move requests for the same physical shard. Update maxNode and minNode's shardBits immediately after appending a move, mirroring what applyMovesToTopology does between phases. Update TestDetectGlobalImbalance_HeterogeneousCapacity to assert: - Moves flow from the higher-util node2 to the lower-util node1 (direction check), and - Each (volumeID, shardID) pair appears at most once in the move list (duplicate-shard guard). * fix(ec_balance): keep source freeSlots in sync after planned shard moves All three phase loops that plan EC shard moves (detectCrossRackImbalance, detectWithinRackImbalance, detectGlobalImbalance) decrement the destination node's freeSlots but leave the source node's freeSlots stale. Over the course of a detection run that processes many volumes or iterates within a rack, the source's reported freeSlots drifts below its actual value. In Phase 4 specifically, the per-move strict-improvement guard prevents the source from becoming a destination candidate, so the stale value never affects decisions. In Phases 2 and 3 it can: a node that sheds shards for one volume's rebalance is eligible as a destination for another volume in the same run, and the destination selection uses node.freeSlots <= 0 as a hard skip (findDestNodeInUnderloadedRack / findLeastLoadedNodeInRack). A tightly-provisioned node could be skipped as a destination even after it has freed slots. Increment maxNode.freeSlots / node.freeSlots symmetrically at each scheduled move so freeSlots remains an accurate running view of available slot capacity throughout a detection run.	2026-04-15 12:47:59 -07:00
Lars Lehtonen	e0c361ec77	fix(weed/worker/tasks): log dropped errors (#9057 )	2026-04-13 16:03:02 -07:00
Lars Lehtonen	259e365104	Prune weed/worker/tasks (#9011 ) * chore(weed/worker/tasks): prune CommonConfigGetter type * chore(weed/worker/tasks): prune BaseTask type	2026-04-09 19:00:06 -07:00
Lars Lehtonen	ab8c982cec	Prune weed/worker/types (#8988 ) * chore(weed/worker/types): prune unused BaseWorker type * chore(weed/worker/types): prune unused UnifiedBaseTask type	2026-04-08 12:43:18 -07:00
Chris Lu	940eed0bd3	fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972 ) * fix(ec): generate .ecx before EC shards to prevent data inconsistency In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER the EC shards were generated from .dat. If any write occurred between these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses the read-only check), the .ecx would contain entries pointing to data that doesn't exist in the EC shards, causing "shard too short" and "size mismatch" errors on subsequent reads and scrubs. Fix by generating .ecx FIRST, then snapshotting datFileSize, then encoding EC shards. If a write sneaks in after .ecx generation, the EC shards contain more data than .ecx references — which is harmless (the extra data is simply not indexed). Also snapshot datFileSize before EC encoding to ensure the .vif reflects the same .dat state that .ecx was generated from. Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the race condition by appending data between EC encoding and .ecx generation. * fix: pass actual offset to ReadBytes, improve test quality - Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve correct error metrics and error messages within ReadBytes - Handle Stat() error in assembleFromIntervalsAllowError - Rename TestEcConsistency_DatFileGrowsDuringEncoding to TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size encoding, not concurrent growth) - Update test comment to clarify it reproduces the old buggy sequence - Fix verification loop to advance by readSize for full data coverage * fix(ec): add dat/idx consistency check in worker EC encoding The erasure_coding worker copies .dat and .idx as separate network transfers. If a write lands on the source between these copies, the .idx may have entries pointing past the end of .dat, leading to EC volumes with .ecx entries that reference non-existent shard data. Add verifyDatIdxConsistency() that walks the .idx and verifies no entry's offset+size exceeds the .dat file size. This fails the EC task early with a clear error instead of silently producing corrupt EC volumes. * test(ec): add integration test verifying .ecx/.ecd consistency TestEcIndexConsistencyAfterEncode uploads multiple needles of varying sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then reads every needle back via the EC read path and verifies payload correctness. This catches any inconsistency between .ecx index entries and EC shard data. * fix(test): account for needle overhead in test volume fixture WriteTestVolumeFiles created a .dat of exactly datSize bytes but the .idx entry claimed a needle of that same size. GetActualSize adds header + checksum + timestamp overhead, so the consistency check correctly rejects this as the needle extends past the .dat file. Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry is consistent with the .dat contents. * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 19:05:36 -07:00
Chris Lu	b0a4647d87	fix: prevent stack overflow in ECBalanceTask.reportProgress (#8949 ) * fix: prevent stack overflow in ECBalanceTask.reportProgress Add re-entry guard to reportProgress() to prevent infinite recursion. The progressCallback invoked by ReportProgressWithStage can re-enter reportProgress, causing a stack overflow that crashes the worker process (goroutine stack exceeds 1GB limit after ~22M frames). * fix: use atomics for progress and re-entry guard to avoid data races Address review feedback: GetProgress() can be called from a different goroutine while reportProgress is updating the value. Use atomic operations for both the progress field (via Float64bits/Float64frombits) and the reporting re-entry guard (via CompareAndSwap).	2026-04-06 12:26:38 -07:00
Chris Lu	995dfc4d5d	chore: remove ~50k lines of unreachable dead code (#8913 ) * chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking.	2026-04-03 16:04:27 -07:00
Lars Lehtonen	3a5016bcd7	fix(weed/worker/tasks/ec_balance): non-recursive reportProgress (#8892 ) * fix(weed/worker/tasks/ec_balance): non-recursive reportProgress * fix(ec_balance): call ReportProgressWithStage and include volumeID in log The original fix replaced infinite recursion with a glog.Infof, but skipped the framework progress callback. This adds the missing ReportProgressWithStage call so the admin server receives EC balance progress, and includes volumeID in the log for disambiguation. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-02 15:32:57 -07:00
Chris Lu	d074830016	fix(worker): pass compaction revision and file sizes in EC volume copy (#8835 ) * fix(worker): pass compaction revision and file sizes in EC volume copy The worker EC task was sending CopyFile requests without the current compaction revision (defaulting to 0) and with StopOffset set to math.MaxInt64. After a vacuum compaction this caused the volume server to reject the copy or return stale data. Read the volume file status first and forward the compaction revision and actual file sizes so the copy is consistent with the compacted volume. * propagate erasure coding task context * fix(worker): validate volume file status and detect short copies Reject zero dat file size from ReadVolumeFileStatus — a zero-sized snapshot would produce 0-byte copies and broken EC shards. After streaming, verify totalBytes matches the expected stopOffset and return an error on short copies instead of logging success. * fix(worker): reject zero idx file size in volume status validation A non-empty dat with zero idx indicates an empty or corrupt volume. Without this guard, copyFileFromSource gets stopOffset=0, produces a 0-byte .idx, passes the short-copy check, and generateEcShardsLocally runs against a volume with no index. * fix fake plugin volume file status * fix plugin volume balance test fixtures	2026-03-29 18:47:15 -07:00
Chris Lu	9dd43ca006	fix balance fallback replica placement (#8824 )	2026-03-29 00:05:42 -07:00
Lars Lehtonen	41aac90a9c	chore(feed/worker): prune unused registerWorker() (#8799 )	2026-03-27 07:36:55 -07:00
Chris Lu	2604ec7deb	Remove min_interval_seconds from plugin workers; vacuum default to 17m (#8790 ) remove min_interval_seconds from plugin workers and default vacuum interval to 17m The worker-level min_interval_seconds was redundant with the admin-side DetectionIntervalSeconds, complicating scheduling logic. Remove it from vacuum, volume_balance, erasure_coding, and ec_balance handlers. Also change the vacuum default DetectionIntervalSeconds from 2 hours to 17 minutes to match the previous default behavior.	2026-03-26 23:04:36 -07:00
Lars Lehtonen	9cc26d09e8	chore:(weed/worker/tasks/erasure_coding): Prune Unused and Untested Functions (#8761 ) * chore(weed/worker/tasks/erasure_coding): prune unused findVolumeReplicas() * chore(weed/worker/tasks/erasure_coding): prune unused isDiskSuitableForEC() * chore(weed/worker/tasks/erasure_coding): prune unused selectBestECDestinations() * chore(weed/worker/tasks/erasure_coding): prune unused candidatesToDiskInfos()	2026-03-24 10:10:28 -07:00
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 11:27:42 -07:00
Chris Lu	a838661b83	feat(plugin): EC shard balance handler for plugin worker (#8629 ) * feat(ec_balance): add TaskTypeECBalance constant and protobuf definitions Add the ec_balance task type constant to both topology and worker type systems. Define EcBalanceTaskParams, EcShardMoveSpec, and EcBalanceTaskConfig protobuf messages for EC shard balance operations. * feat(ec_balance): add configuration for EC shard balance task Config includes imbalance threshold, min server count, collection filter, disk type, and preferred tags for tag-aware placement. * feat(ec_balance): add multi-phase EC shard balance detection algorithm Implements four detection phases adapted from the ec.balance shell command: 1. Duplicate shard detection and removal proposals 2. Cross-rack shard distribution balancing 3. Within-rack node-level shard balancing 4. Global shard count equalization across nodes Detection is side-effect-free: it builds an EC topology view from ActiveTopology and generates move proposals without executing them. * feat(ec_balance): add EC shard move task execution Implements the shard move sequence using the same VolumeEcShardsCopy, VolumeEcShardsMount, VolumeEcShardsUnmount, and VolumeEcShardsDelete RPCs as the shell ec.balance command. Supports both regular shard moves and dedup-phase deletions (unmount+delete without copy). * feat(ec_balance): add task registration and scheduling Register EC balance task definition with auto-config update support. Scheduling respects max concurrent limits and worker capabilities. * feat(ec_balance): add plugin handler for EC shard balance Implements the full plugin handler with detection, execution, admin and worker config forms, proposal building, and decision trace reporting. Supports collection/DC/disk type filtering, preferred tag placement, and configurable detection intervals. Auto-registered via init() with the handler registry. * test(ec_balance): add tests for detection algorithm and plugin handler Detection tests cover: duplicate shard detection, cross-rack imbalance, within-rack imbalance, global rebalancing, topology building, collection filtering, and edge cases. Handler tests cover: config derivation with clamping, proposal building, protobuf encode/decode round-trip, fallback parameter decoding, capability, and config policy round-trip. * fix(ec_balance): address PR review feedback and fix CI test failure - Update TestWorkerDefaultJobTypes to expect 6 handlers (was 5) - Extract threshold constants (ecBalanceMinImbalanceThreshold, etc.) to eliminate magic numbers in Descriptor and config derivation - Remove duplicate ShardIdsToUint32 helper (use erasure_coding package) - Add bounds checks for int64→int/uint32 conversions to fix CodeQL integer conversion warnings * fix(ec_balance): address code review findings storage_impact.go: - Add TaskTypeECBalance case returning shard-level reservation (ShardSlots: -1/+1) instead of falling through to default which incorrectly reserves a full volume slot on target. detection.go: - Use dc:rack composite key to avoid cross-DC rack name collisions. Only create rack entries after confirming node has matching disks. - Add exceedsImbalanceThreshold check to cross-rack, within-rack, and global phases so trivial skews below the configured threshold are ignored. Dedup phase always runs since duplicates are errors. - Reserve destination capacity after each planned move (decrement destNode.freeSlots, update rackShardCount/nodeShardCount) to prevent overbooking the same destination. - Skip nodes with freeSlots <= 0 when selecting minNode in global balance to avoid proposing moves to full nodes. - Include loop index and source/target node IDs in TaskID to guarantee uniqueness across moves with the same volumeID/shardID. ec_balance_handler.go: - Fail fast with error when shard_id is absent in fallback parameter decoding instead of silently defaulting to shard 0. ec_balance_task.go: - Delegate GetProgress() to BaseTask.GetProgress() so progress updates from ReportProgressWithStage are visible to callers. - Add fail-fast guard rejecting multiple sources/targets until batch execution is implemented. Findings verified but not changed (matches existing codebase pattern in vacuum/balance/erasure_coding handlers): - register.go globalTaskDef.Config race: same unsynchronized pattern in all 4 task packages. - CreateTask using generated ID: same fmt.Sprintf pattern in all 4 task packages. * fix(ec_balance): harden parameter decoding, progress tracking, and validation ec_balance_handler.go (decodeECBalanceTaskParams): - Validate execution-critical fields (Sources[0].Node, ShardIds, Targets[0].Node, ShardIds) after protobuf deserialization. - Require source_disk_id and target_disk_id in legacy fallback path so Targets[0].DiskId is populated for VolumeEcShardsCopyRequest. - All error messages reference decodeECBalanceTaskParams and the specific missing field (TaskParams, shard_id, Targets[0].DiskId, EcBalanceTaskParams) for debuggability. ec_balance_task.go: - Track progress in ECBalanceTask.progress field, updated via reportProgress() helper called before ReportProgressWithStage(), so GetProgress() returns real stage progress instead of stale 0. - Validate: require exactly 1 source and 1 target (mirrors Execute guard), require ShardIds on both, with error messages referencing ECBalanceTask.Validate and the specific field. * fix(ec_balance): fix dedup execution path, stale topology, collection filter, timeout, and dedupeKey detection.go: - Dedup moves now set target=source so isDedupPhase() triggers the unmount+delete-only execution path instead of attempting a copy. - Apply moves to in-memory topology between phases via applyMovesToTopology() so subsequent phases see updated shard placement and don't conflict with already-planned moves. - detectGlobalImbalance now accepts allowedVids and filters both shard counting and shard selection to respect CollectionFilter. ec_balance_task.go: - Apply EcBalanceTaskParams.TimeoutSeconds to the context via context.WithTimeout so all RPC operations respect the configured timeout instead of hanging indefinitely. ec_balance_handler.go: - Include source node ID in dedupeKey so dedup deletions from different source nodes for the same shard aren't collapsed. - Clamp minServerCountRaw and minIntervalRaw lower bounds on int64 before narrowing to int, preventing undefined overflow on 32-bit. * fix(ec_balance): log warning before cancelling on progress send failure Log the error, job ID, job type, progress percentage, and stage before calling execCancel() in the progress callback so failed progress sends are diagnosable instead of silently cancelling.	2026-03-14 21:34:53 -07:00
Chris Lu	2f51a94416	feat(vacuum): add volume state and location filters to vacuum handler (#8625 ) * feat(vacuum): add volume state, location, and enhanced collection filters Align the vacuum handler's admin config with the balance handler by adding: - volume_state filter (ALL/ACTIVE/FULL) to scope vacuum to writable or read-only volumes - data_center_filter, rack_filter, node_filter to scope vacuum to specific infrastructure locations - Enhanced collection_filter description matching the balance handler's ALL_COLLECTIONS/EACH_COLLECTION/regex modes The new filters reuse filterMetricsByVolumeState() and filterMetricsByLocation() already defined in the same package. * use wildcard matchers for DC/rack/node filters Replace exact-match and CSV set lookups with wildcard matching from util/wildcard package. Patterns like "dc", "rack-1?", or "node-a" are now supported in all location filter fields for both balance and vacuum handlers. * add nil guard in filterMetricsByLocation	2026-03-13 23:41:58 -07:00
Chris Lu	89ccb6d825	use constants	2026-03-13 18:11:08 -07:00
Chris Lu	f48725a31d	add more tests	2026-03-13 17:44:44 -07:00
Chris Lu	8056b702ba	feat(balance): replica placement validation for volume moves (#8622 ) * feat(balance): add replica placement validation for volume moves When the volume balance detection proposes moving a volume, validate that the move does not violate the volume's replication policy (e.g., ReplicaPlacement=010 requires replicas on different racks). If the preferred destination violates the policy, fall back to score-based planning; if that also violates, skip the volume entirely. - Add ReplicaLocation type and VolumeReplicaMap to ClusterInfo - Build replica map from all volumes before collection filtering - Port placement validation logic from command_volume_fix_replication.go - Thread replica map through collectVolumeMetrics call chain - Add IsGoodMove check in createBalanceTask before destination use * address PR review: extract validation closure, add defensive checks - Extract validateMove closure to eliminate duplicated ReplicaLocation construction and IsGoodMove calls - Add defensive check for empty replica map entries (len(replicas) == 0) - Add bounds check for int-to-byte cast on ExpectedReplicas (0-255) * address nitpick: rp test helper accepts testing.T and fails on error Prevents silent failures from typos in replica placement codes. address review: add composite replica placement tests (011, 110) Test multi-constraint placement policies where both rack and DC rules must be satisfied simultaneously. * address review: use struct keys instead of string concatenation Replace string-concatenated map keys with typed rackKey/nodeKey structs to eliminate allocations and avoid ambiguity if IDs contain spaces. * address review: simplify bounds check, log fallback error, guard source - Remove unreachable ExpectedReplicas < 0 branch (outer condition already guarantees > 0), fold bounds check into single condition - Log error from planBalanceDestination in replica validation fallback - Return false from IsGoodMove when sourceNodeID not found in existing replicas (inconsistent cluster state) * address review: use slices.Contains instead of hand-rolled helpers Replace isAmongDC and isAmongRack with slices.Contains from the standard library, reducing boilerplate.	2026-03-13 17:39:25 -07:00
Chris Lu	47ddf05d95	feat(plugin): DC/rack/node filtering for volume balance (#8621 ) * feat(plugin): add DC/rack/node filtering for volume balance detection Add scoping filters so balance detection can be limited to specific data centers, racks, or nodes. Filters are applied both at the metrics level (in the handler) and at the topology seeding level (in detection) to ensure only the targeted infrastructure participates in balancing. * address PR review: use set lookups, deduplicate test helpers, add target checks * address review: assert non-empty tasks in filter tests Prevent vacuous test passes by requiring len(tasks) > 0 before checking source/target exclusions. * address review: enforce filter scope in fallback, clarify DC filter - Thread allowedServers into createBalanceTask so the fallback planner cannot produce out-of-scope targets when DC/rack/node filters are active - Update data_center_filter description to clarify single-DC usage * address review: centralize parseCSVSet, fix filter scope leak, iterate all targets - Extract ParseCSVSet to shared weed/worker/tasks/util package, remove duplicates from detection.go and volume_balance_handler.go - Fix metric accumulation re-introducing filtered-out servers by only counting metrics for servers that passed DC/rack/node filters - Trim DataCenterFilter before matching to handle trailing spaces - Iterate all task.TypedParams.Targets in filter tests, not just [0] * remove useless descriptor string test	2026-03-13 17:03:37 -07:00
Chris Lu	2ff4a07544	Reduce task logger glog noise and remove per-write fsync (#8603 ) * Reduce task logger noise: stop duplicating every log entry to glog and stderr Every task log entry was being tripled: written to the task log file, forwarded to glog (which writes to /tmp by default with no rotation), and echoed to stderr. This caused glog files to fill /tmp on long-running workers. - Remove INFO/DEBUG forwarding to glog (only ERROR/WARNING remain) - Remove stderr echo of every log line - Remove fsync on every single log write (unnecessary for log files) * Fix glog call depth for correct source file attribution The call stack is: caller → Error() → log() → writeLogEntry() → glog.ErrorDepth(), so depth=4 is needed for glog to report the original caller's file and line number.	2026-03-11 12:42:18 -07:00
Chris Lu	b17e2b411a	Add dynamic timeouts to plugin worker vacuum gRPC calls (#8593 ) * add dynamic timeouts to plugin worker vacuum gRPC calls All vacuum gRPC calls used context.Background() with no deadline, so the plugin scheduler's execution timeout could kill a job while a large volume compact was still in progress. Use volume-size-scaled timeouts matching the topology vacuum approach: 3 min/GB for compact, 1 min/GB for check, commit, and cleanup. Fixes #8591 * scale scheduler execution timeout by volume size The scheduler's per-job execution timeout (default 240s) would kill vacuum jobs on large volumes before they finish. Three changes: 1. Vacuum detection now includes estimated_runtime_seconds in job proposals, computed as 5 min/GB of volume size. 2. The scheduler checks for estimated_runtime_seconds in job parameters and uses it as the execution timeout when larger than the default — a generic mechanism any handler can use. 3. Vacuum task gRPC calls now use the passed-in ctx as parent instead of context.Background(), so scheduler cancellation propagates to in-flight RPCs. * extend job type runtime when proposals need more time The JobTypeMaxRuntime (default 30 min) wraps both detection and execution. Its context is the parent of all per-job execution contexts, so even with per-job estimated_runtime_seconds, jobCtx would cancel everything when it expires. After detection, scan proposals for the maximum estimated_runtime_seconds. If any proposal needs more time than the remaining JobTypeMaxRuntime, create a new execution context with enough headroom. This lets large vacuum jobs complete without being killed by the job type deadline while still respecting the configured limit for normal-sized jobs. * log missing volume size metric, remove dead minimum runtime guard Add a debug log in vacuumTimeout when t.volumeSize is 0 so operators can investigate why metrics are missing for a volume. Remove the unreachable estimatedRuntimeSeconds < 180 check in buildVacuumProposal — volumeSizeGB always >= 1 (due to +1 floor), so estimatedRuntimeSeconds is always >= 300. * cap estimated runtime and fix status check context - Cap maxEstimatedRuntime and per-job timeout overrides to 8 hours to prevent unbounded timeouts from bad metrics. - Check execCtx.Err() instead of jobCtx.Err() for status reporting, since dispatch runs under execCtx which may have a longer deadline. A successful dispatch under execCtx was misreported as "timeout" when jobCtx had expired.	2026-03-10 13:48:42 -07:00
Chris Lu	d89a78d9e3	reduce logs	2026-03-09 22:42:03 -07:00

1 2

95 Commits