mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-13 21:31:32 +00:00
master
1057 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
79859fc21d |
feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd Three diagnostic additions on top of #9460, all aimed at making the next production incident faster to triage than the one we just spent hours on. 1. [versioning-heal] grep prefix on every heal-related log line, with a small fixed event vocabulary (produced / surfaced / healed / enqueue / drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed / teardown_failed / queue_full). One grep gives operators a single event stream across the produce-to-drain lifecycle. 2. Escalate the "scanned N>0 entries but no valid latest" case in updateLatestVersionAfterDeletion from V(1) Infof to a Warning that names the orphan entries it saw. This is the listing-after-rm inconsistency signature that pinned down 259064a8's failure — it should not be invisible at default log levels. 3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]` that walks .versions/ directories under a prefix and reports the stranded population. With -heal it clears the latest-version pointer in place on stranded directories so subsequent reads return a clean NoSuchKey instead of replaying the 10-retry self-heal loop. * fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry Address PR review: 1. s3.versions.audit walked only the first 1024-entry page of each .versions/ directory, false-positiving "stranded" on large dirs. Loop until the page returns < 1024 entries, advancing startName. 2. clean and orphan-only categories double-counted when a directory had no pointer and at least one orphan: incremented both. Make them mutually exclusive so report totals sum to versionsDirs. 3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep, non-interruptible by ctx. A server shutdown / client disconnect would wait out the budget per in-flight delete. Thread ctx through deleteSpecificObjectVersion -> repointLatestBeforeDeletion / updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context(); gRPC lifecycle handlers pass the stream ctx. New test pins the behavior: cancelling ctx mid-backoff returns ctx.Err() in <500ms instead of blocking ~6.3s. * fix(s3/versioning): clearStale outcome + escape grep-able log fields Two coderabbit follow-ups: 1. Successful pointer clear should suppress `produced`. updateLatestVersionAfterDeletion's transient-rm fallback called clearStaleLatestVersionPointer best-effort, then unconditionally returned retryErr. The caller (deleteSpecificObjectVersion) saw the error and emitted `event=produced` + enqueued the reconciler, even though clearStaleLatestVersionPointer had just driven the pointer to consistency and the next reader would get NoSuchKey via the clean-miss path. Make clearStaleLatestVersionPointer return cleared bool; on success the caller returns nil so neither produced nor the reconciler enqueue fires. Concurrent-writer aborts, re-scan errors, and CAS mismatches still report false so genuinely stranded state keeps surfacing. 2. Escape user-controlled fields in heal log lines. versioningHealInfof / Warningf / Errorf interpolated raw bucket / key / filename / err text into a single-space-separated line. An S3 key (or error string from gRPC) containing whitespace, newlines, or `event=...` could split one event into multiple tokens and spoof fake fields downstream. Sanitize each arg in the helper: safe values pass through; anything with whitespace, quotes, control chars, or backslashes is replaced with its strconv.Quote form. No caller changes — the format strings remain unchanged. Tests pin both behaviors: sanitization table covers the field boundary cases; an end-to-end shape test confirms a key containing `event=spoof` stays inside a single quoted token. |
||
|
|
b1d59b04a8 |
fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
7b8647e8bc |
fix(shell): loop s3.lifecycle.run-shard so CI workflow stays alive (#9476)
The s3tests workflow (.github/workflows/s3tests.yml) backgrounds
`weed shell -c 's3.lifecycle.run-shard -shards 0-15 -s3 ... -refresh 2s'`
and then runs `kill -0 $pid` to confirm the worker stayed alive.
The PR-9475 restore ran dailyrun.Run once and exited cleanly — even
faster when no buckets had lifecycle rules yet ("nothing to run").
The aliveness check then failed and the s3tests job died with
"lifecycle worker died on startup". Caught on
https://github.com/seaweedfs/seaweedfs/actions/runs/25772523143/job/75698413401.
Fix:
- -refresh now drives an inter-pass loop. cadence=0 (default) is
one-shot, matching the test/s3/lifecycle/ integration-test
invocation that omits -refresh and expects synchronous return.
cadence>0 (the CI case) keeps the command alive until -runtime
expires, running a fresh dailyrun.Run on every tick.
- Each iteration re-loads bucket configs via
scheduler.LoadCompileInputs so rules created mid-run (the s3tests
flow creates rules AFTER the worker starts) get picked up.
- The "no rules; nothing to run" early return is gone — the
command stays alive even with an empty initial snapshot, waiting
for tests to add rules.
- -dispatch, -checkpoint, -bootstrap-interval stay accepted-but-
ignored (legacy streaming flags).
|
||
|
|
4ce027c2f3 |
fix(shell): restore s3.lifecycle.run-shard for CI/integration-test compatibility (#9475)
fix(shell): restore s3.lifecycle.run-shard as a dailyrun.Run wrapper PR #9466 deleted weed/shell/command_s3_lifecycle_run_shard.go on the premise that it was a debug-only tool. It wasn't: the s3tests CI workflow (.github/workflows/s3tests.yml) and the test/s3/lifecycle/ integration tests invoke it via `weed shell` to drive lifecycle expirations on demand. Both started failing with "unknown command: s3.lifecycle.run-shard". This PR restores the command with the same flag set so existing callers (CI scripts and integration tests) work unchanged. The implementation no longer drives the streaming dispatcher.Pipeline + scheduler.BucketBootstrapper (deleted) — instead it does one bounded dailyrun.Run pass through the same daily-replay code path the production worker exercises. The walker fires for walker-bound rules just like in the worker. Obsolete streaming flags (-dispatch / -checkpoint / -refresh / -bootstrap-interval) are accepted-but-ignored so existing scripts don't need to drop them. |
||
|
|
10cc06333b |
cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master. |
||
|
|
5004b4e542 |
feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) (#9466)
* feat(s3/lifecycle): delete streaming algorithm path (Phase 5b) Phase 5a (PR #9465) retired the algorithm flag and made daily_replay the only execution path. The streaming-side code (scheduler.Scheduler, scheduler.BucketBootstrapper, dispatcher.Pipeline, dispatcher.Dispatcher, dispatcher.FilerPersister, and their tests) has had no in-tree caller since then. This PR deletes it. Net change: ~4800 lines removed, ~130 added (the scheduler/configload tests' helper file the deleted bootstrap_test.go used to host). Removed: - weed/s3api/s3lifecycle/scheduler/{bootstrap,bootstrap_test, scheduler,scheduler_test,pipeline_fanout_test, refresh_default,refresh_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/{dispatcher,dispatcher_test, dispatcher_helpers_test,edge_cases_test,multi_shard_test, pipeline,pipeline_test,pipeline_helpers_test,toproto_test, dispatch_ticks_default,dispatch_ticks_s3tests}.go - weed/s3api/s3lifecycle/dispatcher/filer_persister_test.go (FilerPersister deleted; FilerStore tests don't need their own file) - weed/shell/command_s3_lifecycle_run_shard{,_test}.go (debug-only shell command that only ever wrapped the streaming pipeline; the production worker now exercises the same path every daily run) Trimmed: - dispatcher/filer_persister.go down to FilerStore + NewFilerStoreClient — the small interface daily_replay's cursor persister (dailyrun.FilerCursorPersister) plugs into. Kept (still consumed by daily_replay): - scheduler/configload.{go,_test.go} (LoadCompileInputs, AllActivePriorStates) - dispatcher/sibling_lister.{go,_test.go} (NewFilerSiblingLister, FilerSiblingLister) - dispatcher/filer_persister.go (FilerStore, NewFilerStoreClient) scheduler/testhelpers_test.go restores fakeFilerClient, fakeListStream, dirEntry, fileEntry — helpers the configload tests used to share with the deleted bootstrap_test.go. Updates the handler-package doc strings and one reader-package comment that still named the streaming pipeline. * fix(s3/lifecycle): hold lock through tree read in test filer client gemini caught an inconsistency in scheduler/testhelpers_test.go: LookupDirectoryEntry reads c.tree under c.mu, but ListEntries was releasing the lock before reading c.tree. The map is effectively static during tests so there's no actual race today, but matching the convention keeps the helper safe if a future test mutates the tree mid-run. |
||
|
|
2212cc8a5f |
shell: expose retention flags on mq.topic.configure (#9416)
* shell: expose retention flags on mq.topic.configure
The ConfigureTopicRequest proto already carries a TopicRetention message
({retention_seconds, enabled}), and GetTopicConfigurationResponse / the
admin UI both surface it — but the `mq.topic.configure` shell command
sent Retention: nil unconditionally, leaving CLI and IaC users with no
way to set retention without going through the admin UI. Add three
flags so the shell matches the existing surface:
-retention <duration> Go duration string (e.g. 168h, 30m)
-retentionSeconds <int> raw seconds (mutually exclusive with -retention)
-retentionEnabled bool toggle for retention enforcement
Behavior:
- If none of the retention flags are set, the request omits Retention
(`nil`) — preserving the prior "leave server-side state alone"
semantics for callers that only care about partition count.
- Setting any retention flag populates TopicRetention with both fields
so existing operators keep working when only one is provided.
- -retention and -retentionSeconds together is a hard error (ambiguous);
negative values are rejected.
The new behavior is detected via flag.FlagSet.Visit so default values
of 0 / false are distinguishable from "not provided".
Help text updated with examples for both setting a TTL and disabling
retention on an existing topic. No proto / server changes needed; this
is a CLI-only patch.
* broker, shell: address review for mq.topic.configure retention
Three follow-up fixes for the review comments on the original commit:
1. Server-side: detect retention changes in the early-return path.
Previously ConfigureTopic returned without persisting when
partitionCount + schema were unchanged, even if the request supplied
a different Retention. Now the early-return branch checks both
schema and retention, persists either or both when they differ, and
logs which fields actually changed.
2. Server-side: preserve existing retention when request.Retention is
nil in the fresh-allocation path. Capture the prior resp.Retention
before overwriting `resp = &mq_pb.ConfigureTopicResponse{}`, and
only overwrite the new resp.Retention with `request.Retention` when
the request actually supplies one. Otherwise carry the previous
retention forward, so partition-count changes (or any path that
bypasses the early-return branch) don't accidentally clear retention.
3. CLI: switch the `-retention` / `-retentionSeconds` mutual-exclusion
check from value-based (`!= 0`) to flag.FlagSet.Visit-based, so an
explicit `-retention=0 -retentionSeconds=N` is also rejected.
4. CLI: when any retention flag is set, fetch the current
GetTopicConfiguration and fill in the fields the user didn't
explicitly provide. This means `-retentionEnabled` alone no longer
zeros the existing duration, and `-retention 24h` alone preserves
the existing enabled flag. When the topic doesn't exist yet, the
GetTopicConfiguration error is treated as "no current state" and
we proceed with just the user-supplied values.
Build / vet / gofmt / `go test ./weed/shell/... ./weed/mq/broker/...`
clean.
|
||
|
|
1854101125 |
feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) (#9386)
* feat(s3/lifecycle): bootstrap re-walk cadence + operator hooks (Phase 8) scan_only actions only fire from the bootstrap walk: the engine classifies a rule as scan_only when its retention horizon exceeds the meta-log retention, so event-driven routing can't be trusted. Today each bucket walks once per process, so a long-running worker never revisits — scan_only retention only catches up when the worker restarts. Replace BucketBootstrapper.known (set) with BucketBootstrapper.lastWalk (name -> completion time). KickOffNew now re-walks a bucket whose last walk completed more than BootstrapInterval ago. Zero interval preserves the legacy walk-once-per-process behavior so existing deployments don't change cadence by default. walkBucket re-stamps on success and clears the stamp on failure (via MarkDirty), so the next KickOffNew picks failed walks back up. Add MarkDirty / MarkAllDirty operator hooks for forced re-walks, and a Now func() for testable time travel. weed shell run-shard grows --bootstrap-interval (cadence knob) and --force-bootstrap (drop in-memory state at startup so every bucket walks again immediately, useful when a config change should take effect without a restart). Tests: cadence respected (skip inside interval, re-walk past it); zero interval keeps once-per-process; MarkDirty forces re-walk under a 24h interval; MarkAllDirty resets every record. The fakeClock helper guards the test clock with a mutex so race-detector runs are clean. * fix(s3/lifecycle): split walk state, thread BootstrapInterval through worker, drop dead flag Three issues with the Phase 8 cadence work as it landed: 1. lastWalk did double duty as both completed-walk timestamp and in-flight debounce. A walk that took longer than BootstrapInterval would have a fresh KickOffNew start a duplicate goroutine on the next refresh tick because the stamp from KickOffNew looked stale against the interval. Split into lastCompleted (set on success) and inFlight (set on dispatch, cleared after the walk goroutine returns success or failure). KickOffNew skips inFlight buckets regardless of cadence. 2. The cadence knob existed on `weed shell` but not on the production path: scheduler.Scheduler constructed BucketBootstrapper without BootstrapInterval, and weed/worker/tasks/s3_lifecycle/Config had no field for it. Add Scheduler.BootstrapInterval, parse `bootstrap_interval_minutes` in ParseConfig (zero = legacy walk- once-per-process; negative clamps to zero), and forward it from the handler. Tests cover default, override, clamp, and explicit-zero. 3. --force-bootstrap was a no-op: BucketBootstrapper is freshly allocated at command start, so MarkAllDirty on empty state does nothing, and the flag couldn't influence an already-running process anyway. Remove it; a real runtime trigger (SIGHUP, control RPC) is a separate change. In-flight regression: a blockingInjector pins the first walk in progress while the test advances the clock past the interval. The second KickOffNew is a no-op (inFlight check). After release, the post-completion KickOffNew within the interval is also a no-op. * test(s3/lifecycle): wait for lastCompleted stamp before advancing fake clock The cadence test polled listedN to know "the walk happened" — but that fires once both list passes are issued, while the success-stamp lands later, after walkBucketDir returns. A clock.Advance(30m) between those two events would record the stamp at clock+30m instead of T0; the next assertion would then see now.Sub(last) < 1h and skip the expected re-walk. Tight in practice but exposed under -race / load. Add a waitForCompleted helper that polls b.lastCompleted directly, and use it before each clock advance in both the cadence and zero- interval tests. * fix(s3/lifecycle): expose bootstrap interval in worker UI; honor MarkDirty during walks Two follow-ups on Phase 8. The worker config descriptor had no bootstrap_interval_minutes field, so the production operator UI couldn't enable the cadence — only the internal ParseConfig + Scheduler wiring knew about it. Add the field to the cadence section (MinValue=0 since 0 is the legacy default) and include the default in DefaultValues so existing deployments see the knob with the right preset. MarkDirty / MarkAllDirty silently lost their effect when a walk was in flight: the methods cleared lastCompleted, but the walk's success path then wrote a fresh timestamp, hiding the operator's invalidation. Track a pendingDirty set; the walk goroutine consumes the flag on exit and skips the success stamp, so the next KickOffNew picks the bucket up immediately. Regression: pin a walk in progress with a blockingInjector, MarkDirty the bucket, release the walk, and assert lastCompleted stayed empty plus the next KickOffNew triggers a new walk inside the BootstrapInterval window. * refactor(s3/lifecycle): drop unused MarkDirty / MarkAllDirty + pendingDirty These methods were the operator-hook half of Phase 8, but the only caller (--force-bootstrap on the shell command) was removed when it turned out to be a no-op against a freshly-allocated bootstrapper. Nothing in production calls them anymore. Strip the dead surface: MarkDirty, MarkAllDirty, the pendingDirty set, the dirty-suppression branch in walkBucket, and the three tests that only exercised those methods. BootstrapInterval-driven re-bootstrap is the live mechanism. A real runtime trigger (SIGHUP, control RPC) is a separate change with a real call site. |
||
|
|
05d31a04b6 |
fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite
The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.
Three changes that work together:
- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
-runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
shell command runs the full pipeline (reader + scheduler + dispatcher) for the
duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
out for now — versioned-bucket expiration via the worker still needs its own
pass.
Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (
|
||
|
|
7f254e158e |
feat(worker/s3_lifecycle): plugin handler with admin UI config (#9362)
* feat(s3/lifecycle): scheduler — N pipelines over an even shard split
Scheduler.Run spawns Workers Pipeline goroutines plus one engine-refresh
ticker. Each worker owns a contiguous AssignShards(idx, total) slice of
[0, ShardCount) and runs Pipeline.Run with EventBudget bounding each
iteration; brief RetryBackoff between iterations avoids hot-loop on
errors. The refresh ticker rebuilds the engine snapshot from the filer's
bucket configs every RefreshInterval.
LoadCompileInputs / IsBucketVersioned / AllActivePriorStates are
exported from a configload.go sibling so the shell command can move to
this shared implementation in a follow-up.
* refactor(shell): reuse scheduler.LoadCompileInputs in run-shard
Drop the local copies of loadLifecycleCompileInputs / isBucketVersioned
/ allActivePriorStates / lifecycleParseError that the new
scheduler package now exports. Same behavior, one source of truth.
* feat(worker/s3_lifecycle): plugin handler with admin UI config
Registers a JobHandler for s3_lifecycle via pluginworker.RegisterHandler.
Admin pulls the descriptor over the worker plugin gRPC and renders the
AdminConfigForm + WorkerConfigForm in the existing UI:
Admin form (cluster shape):
- workers (1..16, default 1)
- s3_grpc_endpoints (comma list)
Worker form (operational tuning):
- dispatch_tick_ms (default 5000)
- checkpoint_tick_ms (default 30000)
- refresh_interval_ms (default 300000)
- event_budget (default 0 = unbounded)
Detect emits a single proposal whenever S3 endpoints + filer addresses
are configured. MaxExecutionConcurrency=1 so admin only ever runs one
lifecycle daemon per worker; a fresh proposal next cycle restarts it
if the prior Execute exits.
Execute dials the configured S3 endpoint + filer, builds a
scheduler.Scheduler with the parsed config, and runs it until
ctx cancellation. Reuses the existing scheduler / dispatcher /
reader / engine packages — the handler is the thin glue that
parses descriptor values and wires the long-running daemon.
* proto(plugin): add s3_grpc_addresses to ClusterContext
So workers can dial s3 servers discovered by the master rather than a
hand-typed list in the admin form.
* feat(admin): populate ClusterContext.s3_grpc_addresses from master
ListClusterNodes(S3Type) returns the live S3 servers; the plugin
scheduler now hands these to job handlers alongside filer/volume
addresses.
* feat(worker/s3_lifecycle): discover s3 endpoints from cluster context
Drop the s3_grpc_endpoints admin form field and read the master-supplied
ClusterContext.S3GrpcAddresses instead. Operators no longer maintain a
hand-typed list, and a stale entry self-heals when the master's view
updates.
* feat(worker/s3_lifecycle): time-based runtime cap, friendlier cadence units
- dispatch_tick_minutes (was *_ms): minutes is the natural granularity
for a daily batch; default 1 minute.
- checkpoint_tick_seconds: seconds for the durable cursor write; default
30 seconds.
- refresh_interval_minutes: minutes for the engine snapshot rebuild.
- max_runtime_minutes replaces event_budget. Each daily run is bounded
by wall clock — typical run wraps in well under an hour because the
cursor persists and the meta-log streams fast. Default 60 minutes.
- AdminRuntimeDefaults.DetectionIntervalSeconds = 86400 so the admin
schedules one job per day.
|
||
|
|
85abf3ca88 |
feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var |
||
|
|
35e3fe89bc |
feat(s3/lifecycle): filer-backed cursor Persister + drop BlockerStore (#9358)
* feat(s3/lifecycle): filer-backed cursor Persister FilerPersister persists per-shard cursor maps as JSON to /etc/s3/lifecycle/cursors/shard-NN.json via filer.SaveInsideFiler. One file per shard keeps Save atomic — the filer writes the entry in a single mutation, so a crash mid-write doesn't leak partial state. Pipeline.Run loads on start; the periodic checkpoint and graceful-shutdown save go through this implementation. A small FilerStore interface wraps the SeaweedFilerClient surface the persister needs, so tests inject an in-memory fake instead of mocking the full gRPC client. * refactor(s3/lifecycle): drop BlockerStore — durable cursor IS the block A frozen cursor doesn't advance, so the durable cursor (FilerPersister) encodes the blocked state on its own. On worker restart the reader re-encounters the poison event at MinTsNs, the dispatcher walks the same retry budget to BLOCKED, and the cursor freezes at the same EventTs. Other in-flight events between freeze tsNs and prior cursor positions self-resolve via NOOP_RESOLVED (STALE_IDENTITY) since the underlying objects were already deleted on the prior pass. Removed: - BlockerStore interface + InMemoryBlockerStore + BlockerRecord - Dispatcher.Blockers + Dispatcher.ReplayBlockers - the BlockerStore.Put call in handleBlocked - Pipeline.Blockers field + the ReplayBlockers call on startup Added a TestDispatchRestartReFreezesNaturally that pins the self-recovery property: a fresh Dispatcher with a fresh Cursor, fed the same poison event, reaches the same frozen state at the same EventTs without any durable blocker store. Operator visibility: a cursor whose MinTsNs hasn't advanced is the signal — surfaced via the durable cursor file. * refactor(filer): SaveInsideFiler accepts ctx ReadInsideFiler already takes ctx; SaveInsideFiler used context.Background() internally and silently dropped the caller's ctx. Symmetric API now; cancellation/deadlines propagate through LookupEntry / CreateEntry / UpdateEntry. Mechanical update of all callers — most pass context.Background() since the existing call sites have no ctx in scope. * fix(s3/lifecycle): deterministic order in cursor save Iterating Go maps yields random order, so json.Encode produced a different byte sequence on each save even when the state hadn't changed. Sort entries by (Bucket, ActionKind, RuleHash) before encoding so the on-disk file diffs cleanly. New test pins byte-identical output across two saves of the same map. * fix(s3/lifecycle): log reason when freezing cursor in handleBlocked handleBlocked dropped the reason via _ = reason with a comment claiming the caller logged it; none of the three callers do. A frozen cursor is the only surface where the operator finds out something stuck, so the reason has to land somewhere. glog.Warningf with shard, key, eventTs, and the original reason — same shape the rest of the package uses. |
||
|
|
b9bf45cb2e |
fix(shell): scope volume.fsck filer walk when -volumeId selects one bucketed collection (#9347)
* fix(shell): scope volume.fsck filer walk to the bucket when -volumeId selects one bucketed collection Closes #9345. -volumeId only filtered which volume .idx files were pulled; the filer-side BFS still walked from "/", printing every directory under -v and making it look like the flag was ignored. When all requested volumes share a single non-empty collection that maps to an existing <bucketsPath>/<collection> directory, restrict the BFS root to that bucket. Empty-collection volumes or multi-collection selections fall back to the full walk, since chunks for those can live anywhere. * trim comments * address review: collapse getCollectFilerFilePath; unshadow receiver in loop |
||
|
|
1c0e24f06a |
fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> |
||
|
|
60c76120fc |
fix(shell): use exact match for volume.balance -racks/-nodes filter (#9279)
* fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * fix(shell): use exact match for volume.balance -racks/-nodes filter The old strings.Contains-based filter quietly included any id that was a substring of the user-supplied flag value (e.g. -racks=rack10 also matched rack1). Replace it with an exact-match set parsed from the comma-separated flag value, and add regression tests for both -racks and -nodes paths. Also fix a small typo in the "remote storage" error returned by maybeMoveOneVolume. * refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers |
||
|
|
108e42fb8b |
chore(shell): fix typo in EC shard helper param names (#9277)
* fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison * chore(shell): fix typo in EC shard helper param names |
||
|
|
294f7c3d04 |
shell: expand ~ in local file path arguments (#9265)
* shell: expand `~` in local file path arguments The weed shell parses commands itself instead of going through an OS shell, so a path like `~/Downloads/foo.meta` was passed verbatim to `os.Open`, which fails because no `~` directory exists. Users had to spell out absolute home paths in every command. Add an `expandHomeDir` helper that resolves a leading `~` or `~/...` to the user's home directory, and run user-supplied local file paths in the affected shell commands through it: fs.meta.load (positional file) fs.meta.save (-o) fs.meta.changeVolumeId (-mapping) s3.iam.export (-file) s3.iam.import (-file) s3.policy (-file) s3tables.bucket (-file) s3tables.table (-file, -metadata) volume.fsck (-tempPath) Filer-namespace path flags (`-dir`, `-path`, `-locationPrefix`, etc.) are unaffected; they live in the filer, not on the local FS. * shell: reuse util.ResolvePath instead of a new helper util.ResolvePath already does tilde expansion; drop the local expandHomeDir helper and route every shell call site through it. |
||
|
|
21fadf5582 |
fix(shell): correct volume.list -writable filter unit and comparison (#9231)
* fix(shell): correct volume.list -writable filter unit and comparison * fix(shell): correct volume.list -writable filter unit and comparison |
||
|
|
da2e90aefd |
fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207)
* fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() *Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. * refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on. |
||
|
|
7364f148bd |
fix(s3/shell): factor EC volumes into bucket size metrics and collection.list (#9182)
* fix(s3/shell): include EC volumes in bucket size metrics and collection.list S3 bucket size metrics exported to Prometheus (and fed through stats.UpdateBucketSizeMetrics) are computed by collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos. As soon as a volume was encoded to EC it dropped out of every aggregate, so Grafana showed bucket sizes shrinking while physical disk usage kept climbing. The shell helper collectCollectionInfo — used by collection.list and s3.bucket.quota.enforce — had the same gap, with the EC branch left as a commented-out TODO. Fold EC shards into both paths using the same approach the admin dashboard already uses (PR #9093): - PhysicalSize / Size sum across shard holders: EC shards are node-local (not replicas), so per-node TotalSize() and MinusParityShards().TotalSize() sum to the whole-volume physical and logical sizes respectively. - FileCount is deduped via max across reporters (every shard holder reports the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and must not pin the aggregate). - DeleteCount is summed (each delete tombstones exactly one node's .ecj). - VolumeCount increments once per unique EC volume id. Adds regression tests covering pure-EC, mixed regular+EC, and the slow-reporter FileCount dedupe case. Refs #9086 * Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe - Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding package that walk the shard bitmap directly instead of materializing a ShardsInfo and copying it via MinusParityShards(). Keeps the DataShardsCount dependency encapsulated in one place and avoids the per-shard allocation/copy overhead in the metrics hot path. - Switch shell collectCollectionInfo ecVolumes map to a composite {collection, volumeId} key, matching the bucket_size_metrics collector and defending against any cross-collection volume id aliasing. - Dedupe VolumeCount in shell addToCollection by volume id so regular volumes aren't counted once per replica presence. Aligns the shell's collection.list output with the S3 metrics collector and the EC branch, all of which now report logical volume counts. - Add unit tests for the new helpers and for the regular-volume VolumeCount dedupe. * Parameterize EcShardsDataSize with dataShards for custom EC ratios Add a dataShards parameter to EcShardsDataSize so forks with per-volume ratio metadata (e.g. the enterprise data_shards field carried on an extended VolumeEcShardInformationMessage) can pass the configured value and get accurate logical sizes under custom EC policies like 6+3 or 16+6. Passing 0 or a negative value falls back to the upstream DataShardsCount default, which is correct for the fixed 10+4 layout — so OSS callers in s3api and shell pass 0 and keep their current behavior. Added table cases covering the custom 6+3 and 16+6 paths so the parameterization is pinned by tests. |
||
|
|
49f62df2cf |
fix(shell): mergeVolumes hard-link safety and cleaner cleanup logging (#9163)
* fix(shell): skip hard-linked entries in fs.mergeVolumes Hard-linked entries share a chunk list with their siblings, but the filer's UpdateEntry only rewrites one entry at a time. Moving a chunk here leaves every other hard-linked sibling pointing at a fid that either gets deleted by the filer's own garbage step after UpdateEntry or by deleteMovedSourceNeedles (#9160) — either way, the siblings end up with dangling references. Skip the entry with a visible log line so operators know the file was bypassed and can handle it explicitly (copy-then-unlink, or dedup before merge). Detected via entry.HardLinkId being non-empty, which is the same signal the filer itself uses (weed/pb/filer_pb/filer.pb.go:451). Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes suppresses 404 alongside 304 in source cleanup BatchDelete also returns StatusNotFound (404) for an already-deleted needle when ReadVolumeNeedle can't find it in the cookie-check path, not only StatusNotModified (304) from DeleteVolumeNeedle returning size 0. Both are benign races against a concurrent fsck purge or a replica that already reconciled, so don't clutter the output with "delete ... not found" warnings for them. Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes merges hard-linked files via dedup on HardLinkId Hard-linked siblings share one chunk list through a KV blob keyed by HardLinkId (see weed/filer/filerstore_hardlink.go). UpdateEntry's setHardLink rewrites that blob and maybeReadHardLink overrides per-entry chunks with the blob's on every read, so a single UpdateEntry propagates new fids to every sibling automatically — the previous skip-hardlinks bailout was overly conservative and left hard-linked files stuck on merge-source volumes forever. Process each HardLinkId exactly once per run with a sync.Map so BFS workers in different directories synchronize without a global lock. First sibling carries the chunk move + UpdateEntry; later siblings find the id in the map and return — preventing the real race, which is two siblings trying to re-download an already-moved source needle or double-queue the same fid for deletion. Also address the log-spam review on deleteMovedSourceNeedles: an unreachable volume server returns one error per needle, so collapse multiple failures into a single per-server line with the first error as an example. |
||
|
|
caaa53aee3 |
fix(shell): ec.encode health-check key mismatch on K8s deployments (#9164)
Build freeVolumeCountMap using dn.Address so the key matches wdclient.Location.Url during the subsequent lookup. Keying by dn.Id silently filtered out every replica in deployments where dn.Id is a short name (e.g. Kubernetes StatefulSet pod name) while the location Url is a FQDN:port, causing "no healthy replicas" even with ample free capacity. Also filter replicas before marking volumes readonly so that a failed health check no longer strands volumes in readonly state. Fixes #9145 |
||
|
|
e725eb4079 |
refactor(shell): run volume.fsck purge once per volume, after all replicas (#9159)
* refactor(shell): run volume.fsck purge once per volume, after all replicas The purge step in findExtraChunksInVolumeServers was nested inside the outer `for dataNodeId` loop, so it fired once per data-node iteration rather than once total. Two consequences: 1. The replica-intersection safety net was broken. The code marks a fid "found in all replicas" only after every replica has reported its orphans, but the purge ran after the first data node already, so fids contributed only by later replicas never got the `true` flag in time. Without `-forcePurging` that meant some legitimate orphans were never purged; with `-forcePurging` the flag was ignored so the bug was hidden. 2. Visible output got noisy: "purging orphan data for volume X..." printed 2-3 times per volume (N_datanodes * N_replicas RPCs to the same locations) since purgeFileIdsForOneVolume already fans out to every replica location via MasterClient.GetLocations. Split the work into two explicit phases: collect orphans from every replica first, then purge each volume once. Drop the per-replica loop around purgeFileIdsForOneVolume since it already handles all replicas internally. Keep the per-replica mark-writable loop (each replica's readonly bit has to be flipped before the purge RPC fans out to it). Also simplify the gating expression — `isSeveralReplicas && foundInAllReplicas` is redundant given the preceding `!isSeveralReplicas` branch — and replace `!(X > 0)` with the more idiomatic `len(X) == 0`. Related to #9116 follow-up on multiple fsck passes needed to fully clean a volume. * address review: per-replica readonly tracking, count-based intersection, defer-per-volume Three issues raised on the v1: 1. The readonly cleanup stored a single isReadOnlyReplicas[volumeId]=bool that flipped true if any replica was read-only, then the defer marked every replica in serverReplicas[volumeId] read-only on exit. If a volume had mixed replica modes (one RO, one RW), the originally-RW replica ended up RO after fsck returned. Track read-only state per replica in readOnlyServerReplicas[volumeId] and revert only those. 2. The defer inside the volumeId loop accumulated for the entire fsck run, so every volume we processed stayed writable until the whole command returned. Split the per-volume logic into purgeOneVolume so the defers unwind between volumes. 3. The intersection logic used a sticky bool that treated "seen on any 2 of 3 replicas" as "seen on all replicas" — a 3+-replica volume would get purged for fids only 2 replicas agreed on, which is what -forcePurging is supposed to opt into. Switch to a count-based map[fid]int compared against volumeReplicaCounts[volumeId], so we only purge without -forcePurging when every replica agrees. Also drop the now-unused serverReplicas map. |
||
|
|
08a7502b2c |
fix(shell): error on missing volume id in fsck, mergeVolumes, vacuum (#9158)
* fix(shell): error on missing volume id in fsck, mergeVolumes, vacuum Three shell commands silently report success when -volumeId / -fromVolumeId / -toVolumeId names a volume the master doesn't know about: typos, already-deleted volumes, and stale scripts all look identical to a clean no-op, which is what made the confusion in #9116 take as long as it did to diagnose. - volume.fsck: filter at the per-datanode loop drops unknown ids and findExtraChunksInVolumeServers ends with totalOrphanChunkCount==0, printing "no orphan data". - fs.mergeVolumes: createMergePlan iterates only known volumes, so an unknown -fromVolumeId produces an empty plan and we print just the "max volume size: N MB" header (indistinguishable from "nothing to merge"). - volume.vacuum: the master's VacuumVolume RPC silently iterates matching volumes; a missing id returns success having done nothing. Validate the requested ids against the current topology up front and return an explicit "volume(s) not found on master: [X Y]" error. Also drop a stale duplicate `if err != nil` in volume.fsck.Do left over from a prior refactor. Surfaces #9116 follow-up from madalee-com. * address review: propagate reloadVolumesInfo error; dedupe vacuum missing ids - fs.mergeVolumes: c.reloadVolumesInfo's return was ignored. If the master is unreachable or VolumeList fails, c.volumes stays empty and the new validation block reports "fromVolumeId X not found on master" — masking the real connection/RPC failure. Return the wrapped error instead. - volume.vacuum: "volume.vacuum -volumeId 5,5,5" on a missing volume 5 listed [5 5 5] in the error. Collect missing ids in a set so each missing id appears once. * address review: reject fromVolumeId/toVolumeId values that overflow uint32 flag.Uint produces a uint (64-bit on amd64), and the existing cast to needle.VolumeId silently truncates to uint32. A typo like `-fromVolumeId=4294967297` would wrap to volume 1 and slip past every other validation, so the merge would run against a completely different volume than the operator intended. Bail out with an explicit error when the raw flag value exceeds the uint32 range, before the cast. |
||
|
|
6bdd775963 |
feat(shell): fs.mergeVolumes deletes source needles after filer update (#9160)
* feat(shell): fs.mergeVolumes deletes source needles after filer update Before this change, mergeVolumes only copied chunks to the destination volume and updated the filer — the source needle sat untouched on its original volume as a silent orphan. Operators had to run a separate volume.fsck + volume.vacuum pass to actually reclaim the space, and #9116 (comment 4282692876) showed how that pipeline can look exactly like "mergeVolumes did nothing": the source volume keeps reporting its original size even though every chunk has been logically moved out. Clean up the source inline. For each entry, track the pre-move fids as they're captured, and after the UpdateEntry RPC commits, issue BatchDelete on every replica of each source volume. Key invariants: - Source fids are only deleted AFTER UpdateEntry succeeds; if the filer write fails we skip the cleanup for that entry so we never delete data the filer still references. - rewriteManifestChunk grew a fourth return value so nested manifest and sub-chunk moves propagate their moved-source list back to the top-level callsite. The outer manifest itself is recorded at the callsite, since only the callsite sees the pre-rewrite fid. - deleteMovedSourceNeedles logs errors but never returns them. Propagating would abort TraverseBfs mid-merge, stranding remaining entries; logging leaves the fallback path (fsck reconciles later) intact. - StatusNotModified from the volume server is expected whenever a concurrent fsck purge beat us to the delete or a replica already reconciled — don't warn on it. Readonly source volumes are already rejected up front by createMergePlan, so by the time we reach the delete the source is writable. If a replica's readonly bit has flipped since then the delete will fail and get logged; the user can re-run once they've fixed the replica (same failure mode as today's fsck purge). Fixes the space-not-reclaimed half of #9116. Related design discussion: #8589. * address review: cast r.Status to int in StatusNotModified compare http.StatusNotModified is an untyped constant so the compare works as written, but the int32/int mixed-type signal trips static analyzers and PR tooling. Cast explicitly and note why. |
||
|
|
9a6b566fb1 |
fix(shell): volume.fsck keeps going past a single broken chunk manifest (#9140)
* fix(shell): volume.fsck no longer aborts on a single broken chunk manifest Previously a single entry whose chunk-manifest could not be read (e.g. the manifest needle was missing or its sub-chunks pointed at a now-gone volume) caused collectFilerFileIdAndPaths to return immediately with "failed to ResolveChunkManifest". The whole fsck run failed, so an operator with even one corrupted file could not use volume.fsck to find or clean up unrelated orphan needles on other volumes — they had to locate and delete the bad entries first, blind, with no help from fsck. Log the resolution failure with the entry path, fall back to recording the top-level chunk fids the entry references (data fids and manifest fids themselves; sub-chunks behind the unresolvable manifest stay unknown), and keep traversing. Track the count of unresolved entries on the command struct and refuse -reallyDeleteFromVolume for the run when the count is non-zero, since the in-use fid set is incomplete and a purge could otherwise delete live sub-chunks behind the broken manifest. Read-only fsck still produces a useful (if conservatively over-reported) orphan listing so the operator can see and fix the broken entries first, then re-run with apply. Discovered while diagnosing #9116. * address review: use callback ctx and atomic counter - Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error cancellation propagates into the manifest fetch instead of using context.Background(). - TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go), so the unresolvedManifestEntries field on commandVolumeFsck is shared across workers and was racing. Switch it to atomic.Int64 with Add/Load. * address review: reset counter per Do(), pass through ctx errors - commandVolumeFsck is a singleton registered in init() and reused across shell invocations. Without resetting the unresolved-manifest counter at the top of Do(), a single failed run permanently suppressed -reallyDeleteFromVolume in the same shell session. Reset to 0 right after flag parsing. - Treating context cancellation as manifest corruption was wrong: a Ctrl+C or deadline mid-traversal would inflate the counter and emit misleading "manifest broken" warnings for entries that were never examined. Detect context.Canceled / context.DeadlineExceeded and return the error so the BFS unwinds cleanly. Not changing the findMissingChunksInFiler branch's purgeAbsent / applyPurging gating: that path checks recorded filer fids against volume idx files, and a broken-manifest entry's recorded manifest fid will fail the existence check and get purged — which is the cleanup the operator wants for those entries. Adding a gate would block the exact use case the warning points them at. |
||
|
|
d57fc67022 |
fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127)
* fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was true, printing "Change volume id for large file is not implemented yet" and continuing. Because the BFS traversal only looks at top-level entry.Chunks, sub-chunks referenced inside a manifest were never considered either. For any file stored as a chunk manifest (large files go this path), chunks in the source volume stayed put, leaving behind a few MB of live data that vacuum and volume.deleteEmpty couldn't clean up. This change resolves each manifest chunk recursively, moves any sub-chunk whose volume id is in the merge plan via the existing moveChunk path, and re-serializes the manifest. If the manifest chunk itself lives in a source volume, or any sub-chunk moved, the new manifest blob is uploaded to a freshly assigned file id (the old needle becomes orphaned and is reclaimed by vacuum like any other moved chunk). Fixes #9116. * address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes - Call UpdateEntry once per entry after the chunk loop instead of once per moved chunk (gemini nit). - In dry-run mode, mark anySubChanged when a sub-chunk in the plan is encountered and return changed=true after printing "rewrite manifest", so nested manifests also surface their would-rewrites (gemini nit). - Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is restored even when proto.Marshal fails (coderabbit nit). - Reject AssignVolume results whose file id lands on a volume that is a source in the merge plan, and retry — otherwise the replacement manifest could be written to the volume being emptied (coderabbit). |
||
|
|
96af27a131 |
feat(shell): add fs.distributeChunks command for even chunk distribution (#9117)
* feat(shell): add fs.distributeChunks command for even chunk distribution
Add a new weed shell command that redistributes a file's chunks evenly
across volume server nodes.
Supports three distribution modes via -mode flag:
- primary: balance chunk ownership across nodes (default)
- replica: balance both ownership and replica copies
- round-robin: assign chunks by offset order for sequential read
optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...)
Additional options:
- -nodes=N to target specific number of nodes
- -apply to execute (dry-run by default)
Usage:
fs.distributeChunks -path=/buckets/file.dat
fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply
fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply
fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply
* fix(shell): improve fs.distributeChunks robustness and code quality
- Propagate flag parse errors instead of swallowing them (return err)
- Handle nil chunk.Fid by falling back to legacy FileId string parsing
- Simplify node membership check using slices.Contains
* fix(shell): fix dead round-robin print loop in fs.distributeChunks
The loop was computing targetNode with sc.index%totalNodes (original
chunk index) instead of the sequential position, and discarding it via
_ = targetNode without printing anything. Replace with a correct loop
using pos%totalNodes and actually print the first 12 node assignments.
* fix(shell): compute replication/collection per-chunk in fs.distributeChunks
Previously replication and collection were derived once from chunks[0]
and reused for all moves, causing wrong volume placement for chunks
belonging to different volumes or collections. Now each chunk looks up
its own volumeInfoMap entry immediately before calling operation.Assign.
* fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks
When the master returns an Auth token in the Assign response, use it
directly for the upload instead of generating a new JWT from the local
viper signing key. Fall back to local key generation only when Auth is
empty, matching the pattern used by other upload paths.
* fix(shell): add timeout and error handling to delete requests in fs.distributeChunks
The delete loop was ignoring http.NewRequest errors and had no timeout,
risking a nil-request panic or indefinite block. Replace with
http.NewRequestWithContext and a 30s timeout, handle request creation
errors by incrementing deleteFailCount, and cancel the context
immediately after Do returns.
* feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup
Sequential chunk moves are a bottleneck for large LLM model files with
hundreds or thousands of chunks. Use ErrorWaitGroup with
DefaultMaxParallelization (10) to run download/assign/upload concurrently.
Guard movedRecords appends, chunk.Fid updates, and writer output with a
mutex. Individual chunk failures are non-fatal and logged inline; only
successfully moved chunks are included in the metadata update.
* fix(shell): try all replica URLs on download in fs.distributeChunks
Previously only the first volume server URL was attempted, causing chunk
moves to fail if that replica was unreachable. Now iterates through all
URLs returned by LookupVolumeServerUrl and stops at the first success.
* refactor(shell): apply extract method pattern to fs.distributeChunks
Do() was a single ~615-line function. Break it into focused helpers:
- lookupFileEntry: filer entry lookup
- validateChunks: chunk manifest guard
- collectVolumeTopology: master topology query + ownership mapping
- buildDistributionCounts: chunk→node mapping and owner/copy tallies
- selectActiveNodes: target node selection
- printCurrentDistribution: per-node distribution table
- planDistribution: mode-switch planning (primary/replica/round-robin)
- printRedistributionPlan: before/after plan table
- relevantNodes: active-or-occupied node filter
Do() is now ~100 lines of orchestration; each helper has a single
clear responsibility.
* test(shell): add unit tests for fs.distributeChunks algorithms
Cover all three distribution modes and supporting helpers:
- shortName, relevantNodes
- computeOwnerTarget (even/uneven split, inactive node drain)
- buildDistributionCounts (normal + nil Fid fallback)
- selectActiveNodes (all nodes / limited count)
- planOwnerMoves (imbalanced → balanced, already balanced)
- planDistribution primary (chunks balanced, no-op when even)
- planDistribution round-robin (offset ordering, correct assignment)
- planDistribution replica (owner + copy balancing)
- printRedistributionPlan (output format)
* fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks
Download requests had no per-request timeout, unlike delete operations
which already use 30s. Replace readUrl() calls with inline
http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume
server cannot block a goroutine indefinitely during redistribution.
* fix(shell): remove redundant deleteOldChunks in fs.distributeChunks
filer.UpdateEntry already calls deleteChunksIfNotNew internally, which
computes the diff between old and new entry chunks and deletes the ones
no longer referenced. Our explicit deleteOldChunks was racing with this
filer-side cleanup, causing spurious 404 warnings on ~75% of deletes.
Remove deleteOldChunks, movedChunkRecord type, and reduce
executeChunkMoves return type to (int, error) for the moved count.
* fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks
chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with
a FileId string instead of a Fid struct, causing them to be skipped in
the replica balancing loop and looked up incorrectly in volumeInfoMap.
Introduce chunkVolumeId() that uses Fid when present and falls back to
parsing the legacy FileId string, matching the logic in
buildDistributionCounts. Apply it in the replica-mode copies loop and
in executeChunkMoves' replication/collection lookup.
* fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks
chunkVolumeId(chunk) was being called to look up replication/collection
after oldFid had already been parsed and validated. Use oldFid.VolumeId
directly to avoid redundant parsing and guarantee the correct volume ID
regardless of whether chunk.Fid is nil.
* fix(shell): improve correctness and robustness in fs.distributeChunks
- Buffer download body before upload so dlCtx timeout only covers the
GET request; upload runs with context.Background() via bytes.NewReader
- Replace 'before, after := strings.Cut(...)' + '_ = before' with '_'
as the first return value directly
- Clone copiesCount before replica planner mutates it, keeping the
caller's map immutable
- Add nil-entry guard after filer LookupEntry to prevent panic on
unexpected nil response
* feat(shell): support chunk manifests in fs.distributeChunks
Large files stored as chunk manifests were previously rejected. Resolve
manifests up front via filer.ResolveChunkManifest, redistribute the
underlying data chunks, then re-pack through filer.MaybeManifestize
before UpdateEntry. The filer's MinusChunks resolves manifests on both
sides of the diff, so old manifest and inner data chunks are GC'd
automatically.
* fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature
Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
|
||
|
|
979c54f693 |
fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker. |
||
|
|
10e7f0f2bc |
fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy Instead of erroring when the user already exists, the command now creates the policy and attaches it to the existing user via UpdateUser. Credentials are only generated and displayed for newly created users. * fix(shell): skip duplicate policy attachment in s3.user.provision Check if the policy is already attached before appending and calling UpdateUser, making repeated runs idempotent. * fix(shell): generate service account ID in s3.serviceaccount.create The command built a ServiceAccount proto without setting Id, which was rejected by credential.ValidateServiceAccountId on any real store. Now generates sa:<parent>:<uuid> matching the format used by the admin UI. * test(s3): integration tests for s3.* shell commands Adds TestShell* integration tests covering ~40 previously untested shell commands: user, accesskey, group, serviceaccount, anonymous, bucket, policy.attach/detach, config.show, and iam.export/import. Switches the test cluster's credential store from memory to filer_etc because the memory store silently drops groups and service accounts in LoadConfiguration/SaveConfiguration. * fix(shell): rollback policy on key generation failure in s3.user.provision If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after the policy was persisted, the policy would be left orphaned. Extracts the rollback logic into a local closure and invokes it on all failure paths after policy creation for consistency. * address PR review feedback for s3 shell tests and serviceaccount - s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for the service account UUID instead of 4 bytes to eliminate collision risk - s3.serviceaccount.create: print the actual ID and drop the outdated "server-assigned" note (the ID is now client-generated) - tests: guard createdAK in accesskey rotate/delete subtests so sibling failures don't run invalid CLI calls - tests: requireContains/requireNotContains use t.Fatalf to fail fast - tests: Provision subtest asserts the "Attached policy" message on the second provision call for an existing user - tests: update extractServiceAccountID comment example to match the sa:<parent>:<uuid> format - tests: drop redundant saID empty-check (extractServiceAccountID fatals) * test(s3): use t.Fatalf for precondition check in serviceaccount test |
||
|
|
e648c76bcf | go fmt | ||
|
|
b1265de78f |
feat(shell): add group management commands (#8993)
* feat(shell): add group management commands Add weed shell commands for IAM group management: - s3.group.create -name <group> - s3.group.delete -name <group> - s3.group.list - s3.group.show -name <group> - s3.group.add-user -group <group> -user <user> - s3.group.remove-user -group <group> -user <user> All commands use GetConfiguration/PutConfiguration gRPC pattern, consistent with existing shell commands like s3.user.list. * fix: add nil check for Configuration in group shell commands Guard against nil Configuration response from GetConfiguration gRPC call to prevent potential panics. (Gemini review) |
||
|
|
7f3908297c |
fix(weed/shell): suppress prompt when piped (#8990)
* fix(weed/shell): suppress prompt when stdin or stdout is not a TTY When piping weed shell output (e.g. `echo "s3.user.list" | weed shell | jq`), the "> " prompt was written to stdout, breaking JSON parsers. `liner.TerminalSupported()` only checks platform support, not whether stdin/stdout are actual TTYs. Add explicit checks using `term.IsTerminal()` so the shell falls back to the non-interactive scanner path when piped. Fixes #8962 * fix(weed/shell): suppress informational logs unless -verbose is set Suppress glog info messages and connection status logs on stderr by default. Add -verbose flag to opt in to the previous noisy behavior. This keeps piped output clean (e.g. `echo "s3.user.list" | weed shell | jq`). * fix(weed/shell): defer liner init until after TTY check Move liner.NewLiner() and related setup (history, completion, interrupt handler) inside the interactive block so the terminal is not put into raw mode when stdout is redirected. Previously, liner would set raw mode unconditionally at startup, leaving the terminal broken when falling back to the scanner path. Addresses review feedback from gemini-code-assist. * refactor(weed/shell): consolidate verbose logging into single block Group all verbose stderr output within one conditional block instead of scattering three separate if-verbose checks around the filer logic. Addresses review feedback from gemini-code-assist. * fix(weed/shell): clean up global liner state and suppress logtostderr - Set line=nil after Close() to prevent stale state if RunShell is called again (e.g. in tests) - Add nil check in OnInterrupt handler for non-interactive sessions - Also set logtostderr=false when not verbose, in case it was enabled Addresses review feedback from gemini-code-assist. * refactor(weed/shell): make liner state local to eliminate data race Replace the package-level `line` variable with a local variable in RunShell, passing it explicitly to setCompletionHandler, loadHistory, and saveHistory. This eliminates a data race between the OnInterrupt goroutine and the defer that previously set the global to nil. Addresses review feedback from gemini-code-assist. * rename(weed/shell): rename -verbose flag to -debug Avoid conflict with -verbose flags already used by individual shell commands (e.g. ec.encode, volume.fix.replication, volume.check.disk). |
||
|
|
74905c4b5d |
shell: s3.* commands always output JSON, connection messages to stderr (#8976)
* shell: s3.* commands output JSON, connection messages to stderr
All s3.user.* and s3.policy.attach|detach commands now output structured
JSON to stdout instead of human-readable text:
- s3.user.create: {"name","access_key"} (secret key to stderr only)
- s3.user.list: [{name,status,policies,keys}]
- s3.user.show: {name,status,source,account,policies,credentials,...}
- s3.user.delete: {"name"}
- s3.user.enable/disable: {"name","status"}
- s3.policy.attach/detach: {"policy","user"}
Connection startup messages (master/filer) moved to stderr so they
don't pollute structured output when piping.
Closes #8962 (partial — covers merged s3.user/policy commands).
* shell: fix secret leak, duplicate JSON output, and non-interactive prompt
- s3.user.create: only echo secret key to stderr when auto-generated,
never echo caller-supplied secrets
- s3.user.enable/disable: fix duplicate JSON output — remove inner
write in early-return path, keep single write site after gRPC call
- shell_liner: use bufio.Scanner when stdin is not a terminal instead
of liner.Prompt, suppressing the "> " prompt in piped mode
* shell: check scanner error, idempotent enable output, history errors to stderr
- Check scanner.Err() after non-interactive input loop to surface read errors
- s3.user.enable: always emit JSON regardless of current state (idempotent)
- saveHistory: write error messages to stderr instead of stdout
|
||
|
|
fb0573ffc4 | shell: rename -force to -apply in s3.iam.import for consistency | ||
|
|
d50889002b |
shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands (#8956)
* shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands Add import/export, configuration summary, and a convenience provisioning command: - s3.iam.export: dump full IAM state as JSON (stdout or file) - s3.iam.import: replace IAM state from a JSON file - s3.config.show: human-readable summary (users, policies, service accounts, groups with status and counts) - s3.user.provision: one-step user+policy+credentials creation for common readonly/readwrite/admin roles Hide legacy commands from help listing: - s3.configure: still works but hidden from help output - s3.bucket.access: still works but hidden from help output Both hidden commands remain fully functional for existing scripts. Also adds a Hidden command tag and filters it from printGenericHelp. * shell: address review feedback for s3.iam.*, s3.config.show, s3.user.provision - Simplify joinMax using strings.Join - Fix rolePolicies: remove s3:ListBucket from object-level actions (already covered by bucket-level statement) - Fix admin role: grant s3:* on bucket resource too - Return flag parse errors instead of swallowing them * shell: address missed review feedback for PR 3 - s3.iam.import: require -force flag for destructive IAM overwrite - s3.config.show: add nil guard for resp.Configuration - s3.user.provision: check if user exists before creating policy - s3.user.provision: reject wildcard bucket names (* ?) * shell: distinguish NotFound from transient errors in provision, use %w wrapping - s3.user.provision: check gRPC status code on GetUser error — only proceed on NotFound, abort on transient/network errors - s3.iam.import: use %w for error wrapping to preserve error chains, wrap PutConfiguration error with context * shell: remove duplicate joinMax after PR 8954 merge command_s3_helpers.go defined joinMax which is already in command_s3_user_list.go from the merged PR 8954. * shell: restrict export file permissions, rollback policy on user create failure - s3.iam.export: use os.OpenFile with mode 0600 instead of os.Create to protect exported credentials from other users - s3.user.provision: rollback the created policy if CreateUser fails, with a warning if the rollback itself fails |
||
|
|
45bf3ad058 |
shell: add s3.user.* and s3.policy.attach|detach commands (#8954)
* shell: add s3.user.* and s3.policy.attach|detach commands Add focused IAM shell commands following a noun-verb model: - s3.user.create: create user with auto-generated or explicit credentials - s3.user.list: tabular listing with status, policies, key count - s3.user.show: detailed user view (status, source, policies, credentials) - s3.user.delete: delete a user - s3.user.enable: enable a disabled user - s3.user.disable: disable a user (preserves credentials and policies) - s3.policy.attach: attach a named policy to a user - s3.policy.detach: detach a policy from a user These commands are thin wrappers over the existing IAM gRPC service, producing human-readable output instead of raw protobuf text. This is part of a larger effort to replace the monolithic s3.configure command with a composable set of single-purpose commands. * shell: address review feedback for s3.user.* and s3.policy.attach|detach - Return flag parse errors instead of swallowing them (all commands) - Use GetConfiguration instead of N+1 GetUser calls in s3.user.list - Add nil check for resp.Identity in s3.user.show - Fix GetPolicy error masking in s3.policy.attach (wrap original error) - Simplify joinMax using strings.Join * shell: add nil identity guards and wrap gRPC errors - Add nil check for resp.Identity in policy_attach, policy_detach, user_enable, user_disable - Wrap GetUser errors with user context for better diagnostics |
||
|
|
d123a2768b |
shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands (#8955)
* shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands Add credential, anonymous access, and service account management commands: Access key commands: - s3.accesskey.create: add credentials to an existing user - s3.accesskey.list: list access keys for a user (key ID + status) - s3.accesskey.delete: remove a specific access key - s3.accesskey.rotate: atomic create-new + delete-old key rotation Anonymous access commands: - s3.anonymous.set: set/remove public access on a bucket - s3.anonymous.get: show anonymous access for a bucket - s3.anonymous.list: list all buckets with anonymous access Service account commands: - s3.serviceaccount.create: create with optional action subset and expiry - s3.serviceaccount.list: tabular listing, optionally filtered by parent - s3.serviceaccount.show: detailed view of a service account - s3.serviceaccount.delete: remove a service account These replace the credential and anonymous portions of the monolithic s3.configure and s3.bucket.access commands. * shell: address review feedback for s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* - Return flag parse errors instead of swallowing them (all commands) - Add action validation in s3.anonymous.set (Read, Write, List, Tagging, Admin) - Fix s3.serviceaccount.create output: note to use list for server-assigned ID since CreateServiceAccountResponse does not return the ID * shell: fix bucket matching and action validation in s3.anonymous.* - Use SplitN instead of HasSuffix for bucket name matching to avoid false positives when one bucket name is a suffix of another - Make action validation case-insensitive with canonical normalization * shell: fix nil panics, dedup actions, validate service account actions - Fix nil-pointer panic in getOrCreateAnonymousUser when GetUser returns err==nil with nil Identity (status.FromError(nil) returns nil status) - Add nil Identity guards in s3.anonymous.get and s3.anonymous.list - Deduplicate action values in s3.anonymous.set (e.g. -access Read,Read) - Add action validation in s3.serviceaccount.create with case normalization * shell: dedup actions and reject negative expiry in s3.serviceaccount.create - Deduplicate -actions values (e.g. Read,read,Read produces one entry) - Reject negative -expiry values instead of silently treating as no expiration |
||
|
|
0fed72d95a |
volume.tier.move: fulfill target replication before deleting old replicas (#8950)
* volume.tier.move: fulfill target replication before deleting old replicas When -toReplication is specified, volume.tier.move now creates all required replicas on the destination tier before deleting old replicas. This closes the data-loss window where only one copy existed on the target tier while awaiting volume.fix.replication. If replication fulfillment fails, old replicas are preserved and marked writable so the volume remains accessible. Also extracts replicateVolumeToServer and configureVolumeReplication helpers to reduce duplication across volume.tier.move and volume.fix.replication. Fixes #8937 * volume.tier.move: always fulfill replication before deleting old replicas When -toReplication is specified, use that replication setting. Otherwise, read the volume's existing replication from the super block. In both cases, all required replicas are created on the destination tier before old replicas are deleted. If replication fulfillment fails (e.g. not enough destination nodes), old replicas are preserved and marked writable so no data is lost. * volume.tier.move: address review feedback on ensureReplicationFulfilled - Add 5s delay before re-collecting topology to allow master heartbeat propagation after the move - Add nil guard for targetTierReplicas to prevent panic if the moved replica is not yet visible in the topology - Treat configureVolumeReplication failure as a hard error instead of a warning, so the rollback logic preserves old replicas * volume.tier.move: harden replication config error handling - Make configureVolumeReplication failure on the primary moved replica a hard error that aborts the move, instead of logging and continuing - Configure replication metadata on all existing target-tier replicas (not just newly created ones) when -toReplication is specified - Deletion of old replicas cannot affect new replicas since the locations list only contains pre-move servers (verified, no change) * volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery Fix 1: The cleanup loop now preserves pre-existing target-tier replicas that ensureReplicationFulfilled counted toward the replication target. Previously, a mixed-tier volume with an existing replica on the target tier could have that replica deleted right after being counted as fulfilled, leaving the volume under-replicated. ensureReplicationFulfilled now returns a preserveServers set that the deletion loop checks before removing any old replica. Fix 2: Failure paths after LiveMoveVolume (which deletes the source replica) now use restoreSurvivingReplicasWritable instead of markVolumeReplicasWritable. The old helper stopped on first error, so attempting to mark the already-deleted source writable would prevent all surviving replicas from being restored. The new helper skips the deleted source and continues through all remaining locations, logging per-replica errors instead of aborting. * volume.tier.move: mark preserved replicas writable, skip nodes with existing volume Fix 1: Preserved pre-existing target-tier replicas were left read-only after the move completed. They were marked read-only at the start (along with all other replicas) but never restored since the old code deleted them. Now they are explicitly marked writable before cleanup. Fix 2: The fulfillment loop could pick a candidate node that already hosts this volume on a different disk type, causing a VolumeCopy conflict. Added a guard that skips any node already hosting the volume (on any disk) before attempting replication. |
||
|
|
995dfc4d5d |
chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking. |
||
|
|
4c72512ea2 |
fix(shell): avoid marking skipped or unplaced volumes as fixed (#8866)
* fix(s3api): fix AWS Signature V2 format and validation * fix(s3api): Skip space after "AWS" prefix (+1 offset) * test(s3api): add unit tests for Signature V2 authentication fix * fix(s3api): simply comparing signatures * validation for the colon extraction in expectedAuth * fix(shell): avoid marking skipped or unplaced volumes as fixed --------- Co-authored-by: chrislu <chris.lu@gmail.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> |
||
|
|
af68449a26 |
Process .ecj deletions during EC decode and vacuum decoded volume (#8863)
* Process .ecj deletions during EC decode and vacuum decoded volume (#8798) When decoding EC volumes back to normal volumes, deletions recorded in the .ecj journal were not being applied before computing the dat file size or checking for live needles. This caused the decoded volume to include data for deleted files and could produce false positives in the all-deleted check. - Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first - Vacuum the decoded volume after mounting in ec.decode to compact out deleted needle data from the .dat file - Add integration tests for decoding with non-empty .ecj files * storage: add offline volume compaction helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: compact decoded volumes before deleting shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: address PR review comments - Fall back to data directory for .ecx when idx directory lacks it - Make compaction failure non-fatal during EC decode - Remove misleading "buffer: 10%" from space check error message * ec: collect .ecj from all shard locations during decode Each server's .ecj only contains deletions for needles whose data resides in shards held by that server. Previously, sources with no new data shards to contribute were skipped entirely, losing their .ecj deletion entries. Now .ecj is always appended from every shard location so RebuildEcxFile sees the full set of deletions. * ec: add integration tests for .ecj collection during decode TestEcDecodePreservesDeletedNeedles: verifies that needles deleted via VolumeEcBlobDelete are excluded from the decoded volume. TestEcDecodeCollectsEcjFromPeer: regression test for the fix in collectEcShards. Deletes a needle only on a peer server that holds no new data shards, then verifies the deletion survives decode via .ecj collection. * ec: address review nits in decode and tests - Remove double error wrapping in mountDecodedVolume - Check VolumeUnmount error in peer ecj test - Assert 404 specifically for deleted needles, fail on 5xx --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
f256002d0b |
fix ec.balance failing to rebalance when all nodes share all volumes (#8796)
* fix ec.balance failing to rebalance when all nodes share all volumes (#8793) Two bugs in doBalanceEcRack prevented rebalancing: 1. Sorting by freeEcSlot instead of actual shard count caused incorrect empty/full node selection when nodes have different total capacities. 2. The volume-level check skipped any volume already present on the target node. When every node has a shard of every volume (common with many EC volumes across N nodes with N shards each), no moves were possible. Fix: sort by actual shard count, and use a two-pass approach - first prefer moving shards of volumes not on the target (best diversity), then fall back to moving specific shard IDs not yet on the target. * add test simulating real cluster topology from issue #8793 Uses the actual node addresses and mixed max capacities (80 vs 33) from the reporter's 14-node cluster to verify ec.balance correctly rebalances with heterogeneous node sizes. * fix pass comments to match 0-indexed loop variable |
||
|
|
e5cf2d2a19 |
Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.
Also exposes this option in the shell `volume.scrub` command.
* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.
|
||
|
|
ccc662b90b |
shell: add s3.bucket.access command for anonymous access policy (#8774)
* shell: add s3.bucket.access command for anonymous access policy (#7738) Add a new weed shell command to view or change the anonymous access policy of an S3 bucket without external tools. Usage: s3.bucket.access -name <bucket> -access read,list s3.bucket.access -name <bucket> -access none Supported permissions: read, write, list. The command writes a standard bucket policy with Principal "*" and warns if no anonymous IAM identity exists. * shell: fix anonymous identity hint in s3.bucket.access warning The anonymous identity doesn't need IAM actions — the bucket policy controls what anonymous users can do. * shell: only warn about anonymous identity when write access is set Read and list operations use AuthWithPublicRead which evaluates bucket policies directly without requiring the anonymous identity. Only write operations go through the normal auth flow that needs it. * shell: rewrite s3.bucket.access to use IAM actions instead of bucket policies Replace the bucket policy approach with direct IAM identity actions, matching the s3.configure pattern. The user is auto-created if it does not exist. Usage: s3.bucket.access -name <bucket> -user anonymous -access Read,List s3.bucket.access -name <bucket> -user anonymous -access none s3.bucket.access -name <bucket> -user anonymous Actions are stored as "Action:bucket" on the identity, same as s3.configure -actions=Read -buckets=my-bucket. * shell: return flag parse errors instead of swallowing them * shell: normalize action names case-insensitively in s3.bucket.access Accept actions in any case (read, READ, Read) and normalize to canonical form (Read, Write, List, etc.) before storing. This matches the case-insensitive handling of "none" and avoids confusing rejections. |
||
|
|
7fbdb9b7b7 |
feat(shell): add volume.tier.compact command to reclaim cloud storage space (#8715)
* feat(shell): add volume.tier.compact command to reclaim cloud storage space Adds a new shell command that automates compaction of cloud tier volumes. When files are deleted from remote-tiered volumes, space is not reclaimed on the cloud storage. This command orchestrates: download from remote, compact locally, and re-upload to reclaim deleted space. Closes #8563 * fix: log cleanup errors in compactVolumeOnServer instead of discarding them Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup fails after a compaction or commit failure. * fix: return aggregate error from loop and use regex for collection filter - Track and return error count when one or more volumes fail to compact, so callers see partial failures instead of always getting nil. - Use compileCollectionPattern for -collection in -volumeId mode too, so regex patterns work consistently with the flag description. Empty pattern (no -collection given) matches all collections. |
||
|
|
81369b8a83 |
improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync
Three main throughput improvements:
1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
(32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
overhead (volume assign, gRPC call, needle write) by 3x.
2. Configurable concurrency at every layer:
- remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
- remote.cache S3 download concurrency: -downloadConcurrency flag
(default raised from 1 to 5 per chunk)
- filer.sync chunk concurrency: -chunkConcurrency flag (default 32)
3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
downloader was using Concurrency=1, serializing all part downloads
within each chunk. This alone can 5x per-chunk download speed.
The concurrency values flow through the gRPC request chain:
shell command → CacheRemoteObjectToLocalClusterRequest →
FetchAndWriteNeedleRequest → S3 downloader
Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.
Ref #8481
* fix: use full maxMB for chunk size cap and remove loop guard
Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
can always grow past maxChunkSize when needed to stay under 1000
chunks (e.g., extremely large files with small maxMB).
* address review feedback: help text, validation, naming, docs
- Fix help text for -chunkConcurrency and -downloadConcurrency flags
to say "0 = server default" instead of advertising specific numeric
defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.
* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs
- Use ceiling division for chunk count check to avoid overcounting
when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
default 8") so users see the actual values in -h output.
* fix data race on executionErr and use %w for error wrapping
- Protect concurrent writes to executionErr in remote.cache worker
goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
to preserve the error chain for errors.Is/errors.As callers.
|
||
|
|
c4d642b8aa |
fix(ec): gather shards from all disk locations before rebuild (#8633)
* fix(ec): gather shards from all disk locations before rebuild (#8631)
Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:
1. VolumeEcShardsRebuild only looked at a single disk location for shard
files. On multi-disk servers, the existing local shards could be on one
disk while copied shards were placed on another, causing the rebuild to
see fewer shards than actually available.
2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
len(vs.store.Locations) > 0) that was always true, making the
FindFreeLocation fallback dead code. This meant copies always went to
Locations[0] regardless of where existing shards were.
Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
then gathers shard files from other locations via hard links (or
symlinks for cross-device) before rebuilding. Gathered files are
cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
(explicitly set). Otherwise, it prefers the location that already has
the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
error message when not enough shards are found, instead of passing
through to the opaque reedsolomon "too few shards given" error.
* fix(ec): update test to match skip behavior for unrepairable volumes
The test expected an error for volumes with insufficient shards, but
commit
|
||
|
|
5acb4578ab |
Fix ec.rebuild failing on unrepairable volumes instead of skipping (#8632)
* Fix ec.rebuild failing on unrepairable volumes instead of skipping them When an EC volume has fewer shards than DataShardsCount, ec.rebuild would return an error and abort the entire operation. Now it logs a warning and continues rebuilding the remaining volumes. Fixes #8630 * Remove duplicate volume ID in unrepairable log message --------- Co-authored-by: Copilot <copilot@github.com> |
||
|
|
f3c5ba3cd6 |
feat(filer): add lazy directory listing for remote mounts (#8615)
* feat(filer): add lazy directory listing for remote mounts Directory listings on remote mounts previously only queried the local filer store. With lazy mounts the listing was empty; with eager mounts it went stale over time. Add on-demand directory listing that fetches from remote and caches results with a 5-minute TTL: - Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based, single-level listing, separate from recursive `Traverse`) - Implement in S3, GCS, and Azure backends using each platform's hierarchical listing API - Add `maybeLazyListFromRemote` to filer: before each directory listing, check if the directory is under a remote mount with an expired cache, fetch from remote, persist entries to the local store, then let existing listing logic run on the populated store - Use singleflight to deduplicate concurrent requests for the same directory - Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads - Errors are logged and swallowed (availability over consistency) * refactor: extract xattr key to constant xattrRemoteListingSyncedAt * feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf. When 0 (default), lazy directory listing is disabled for that mount. When >0, enables on-demand directory listing with the specified TTL. Expose as -listingCacheTTL flag on remote.mount command. * refactor: address review feedback for lazy directory listing - Add context.Context to ListDirectory interface and all implementations - Capture startTime before remote call for accurate TTL tracking - Simplify S3 ListDirectory using ListObjectsV2PagesWithContext - Make maybeLazyListFromRemote return void (errors always swallowed) - Remove redundant trailing-slash path manipulation in caller - Update tests to match new signatures * When an existing entry has Remote != nil, we should merge remote metadata into it rather than replacing it. * fix(gcs): wrap ListDirectory iterator error with context The raw iterator error was returned without bucket/path context, making it harder to debug. Wrap it consistently with the S3 pattern. * fix(s3): guard against nil pointer dereference in Traverse and ListDirectory Some S3-compatible backends may return nil for LastModified, Size, or ETag fields. Check for nil before dereferencing to prevent panics. * fix(filer): remove blanket 2-minute timeout from lazy listing context Individual SDK operations (S3, GCS, Azure) already have per-request timeouts and retry policies. The blanket timeout could cut off large directory listings mid-operation even though individual pages were succeeding. * fix(filer): preserve trace context in lazy listing with WithoutCancel Use context.WithoutCancel(ctx) instead of context.Background() so trace/span values from the incoming request are retained for distributed tracing, while still decoupling cancellation. * fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt - Use f.Store.FindEntry instead of f.FindEntry for staleness check and child lookups to avoid unnecessary lazy-fetch overhead - Set OS_UID/OS_GID on new file entries for consistency with directories - In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing directories instead of CreateEntry to avoid deleteChunksIfNotNew and NotifyUpdateEvent side effects * fix(filer): distinguish not-found from store errors in lazy listing Previously, any error from Store.FindEntry was treated as "not found," which could cause entry recreation/overwrite on transient DB failures. Now check for filer_pb.ErrNotFound explicitly and skip entries or bail out on real store errors. * refactor(filer): use errors.Is for ErrNotFound comparisons |