1031 Commits

Author SHA1 Message Date
Chris Lu
d57fc67022 fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127)
* fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files

Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was
true, printing "Change volume id for large file is not implemented yet" and
continuing. Because the BFS traversal only looks at top-level
entry.Chunks, sub-chunks referenced inside a manifest were never
considered either. For any file stored as a chunk manifest (large files
go this path), chunks in the source volume stayed put, leaving behind a
few MB of live data that vacuum and volume.deleteEmpty couldn't clean
up.

This change resolves each manifest chunk recursively, moves any
sub-chunk whose volume id is in the merge plan via the existing
moveChunk path, and re-serializes the manifest. If the manifest chunk
itself lives in a source volume, or any sub-chunk moved, the new
manifest blob is uploaded to a freshly assigned file id (the old
needle becomes orphaned and is reclaimed by vacuum like any other
moved chunk).

Fixes #9116.

* address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes

- Call UpdateEntry once per entry after the chunk loop instead of once per
  moved chunk (gemini nit).
- In dry-run mode, mark anySubChanged when a sub-chunk in the plan is
  encountered and return changed=true after printing "rewrite manifest",
  so nested manifests also surface their would-rewrites (gemini nit).
- Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is
  restored even when proto.Marshal fails (coderabbit nit).
- Reject AssignVolume results whose file id lands on a volume that is a
  source in the merge plan, and retry — otherwise the replacement
  manifest could be written to the volume being emptied (coderabbit).
2026-04-17 21:17:51 -07:00
Jaehoon Kim
96af27a131 feat(shell): add fs.distributeChunks command for even chunk distribution (#9117)
* feat(shell): add fs.distributeChunks command for even chunk distribution

Add a new weed shell command that redistributes a file's chunks evenly
across volume server nodes.

Supports three distribution modes via -mode flag:
- primary:     balance chunk ownership across nodes (default)
- replica:     balance both ownership and replica copies
- round-robin: assign chunks by offset order for sequential read
               optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...)

Additional options:
- -nodes=N to target specific number of nodes
- -apply to execute (dry-run by default)

Usage:
  fs.distributeChunks -path=/buckets/file.dat
  fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply
  fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply
  fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply

* fix(shell): improve fs.distributeChunks robustness and code quality

- Propagate flag parse errors instead of swallowing them (return err)
- Handle nil chunk.Fid by falling back to legacy FileId string parsing
- Simplify node membership check using slices.Contains

* fix(shell): fix dead round-robin print loop in fs.distributeChunks

The loop was computing targetNode with sc.index%totalNodes (original
chunk index) instead of the sequential position, and discarding it via
_ = targetNode without printing anything. Replace with a correct loop
using pos%totalNodes and actually print the first 12 node assignments.

* fix(shell): compute replication/collection per-chunk in fs.distributeChunks

Previously replication and collection were derived once from chunks[0]
and reused for all moves, causing wrong volume placement for chunks
belonging to different volumes or collections. Now each chunk looks up
its own volumeInfoMap entry immediately before calling operation.Assign.

* fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks

When the master returns an Auth token in the Assign response, use it
directly for the upload instead of generating a new JWT from the local
viper signing key. Fall back to local key generation only when Auth is
empty, matching the pattern used by other upload paths.

* fix(shell): add timeout and error handling to delete requests in fs.distributeChunks

The delete loop was ignoring http.NewRequest errors and had no timeout,
risking a nil-request panic or indefinite block. Replace with
http.NewRequestWithContext and a 30s timeout, handle request creation
errors by incrementing deleteFailCount, and cancel the context
immediately after Do returns.

* feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup

Sequential chunk moves are a bottleneck for large LLM model files with
hundreds or thousands of chunks. Use ErrorWaitGroup with
DefaultMaxParallelization (10) to run download/assign/upload concurrently.
Guard movedRecords appends, chunk.Fid updates, and writer output with a
mutex. Individual chunk failures are non-fatal and logged inline; only
successfully moved chunks are included in the metadata update.

* fix(shell): try all replica URLs on download in fs.distributeChunks

Previously only the first volume server URL was attempted, causing chunk
moves to fail if that replica was unreachable. Now iterates through all
URLs returned by LookupVolumeServerUrl and stops at the first success.

* refactor(shell): apply extract method pattern to fs.distributeChunks

Do() was a single ~615-line function. Break it into focused helpers:
- lookupFileEntry: filer entry lookup
- validateChunks: chunk manifest guard
- collectVolumeTopology: master topology query + ownership mapping
- buildDistributionCounts: chunk→node mapping and owner/copy tallies
- selectActiveNodes: target node selection
- printCurrentDistribution: per-node distribution table
- planDistribution: mode-switch planning (primary/replica/round-robin)
- printRedistributionPlan: before/after plan table
- relevantNodes: active-or-occupied node filter

Do() is now ~100 lines of orchestration; each helper has a single
clear responsibility.

* test(shell): add unit tests for fs.distributeChunks algorithms

Cover all three distribution modes and supporting helpers:
- shortName, relevantNodes
- computeOwnerTarget (even/uneven split, inactive node drain)
- buildDistributionCounts (normal + nil Fid fallback)
- selectActiveNodes (all nodes / limited count)
- planOwnerMoves (imbalanced → balanced, already balanced)
- planDistribution primary (chunks balanced, no-op when even)
- planDistribution round-robin (offset ordering, correct assignment)
- planDistribution replica (owner + copy balancing)
- printRedistributionPlan (output format)

* fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks

Download requests had no per-request timeout, unlike delete operations
which already use 30s. Replace readUrl() calls with inline
http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume
server cannot block a goroutine indefinitely during redistribution.

* fix(shell): remove redundant deleteOldChunks in fs.distributeChunks

filer.UpdateEntry already calls deleteChunksIfNotNew internally, which
computes the diff between old and new entry chunks and deletes the ones
no longer referenced. Our explicit deleteOldChunks was racing with this
filer-side cleanup, causing spurious 404 warnings on ~75% of deletes.

Remove deleteOldChunks, movedChunkRecord type, and reduce
executeChunkMoves return type to (int, error) for the moved count.

* fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks

chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with
a FileId string instead of a Fid struct, causing them to be skipped in
the replica balancing loop and looked up incorrectly in volumeInfoMap.

Introduce chunkVolumeId() that uses Fid when present and falls back to
parsing the legacy FileId string, matching the logic in
buildDistributionCounts. Apply it in the replica-mode copies loop and
in executeChunkMoves' replication/collection lookup.

* fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks

chunkVolumeId(chunk) was being called to look up replication/collection
after oldFid had already been parsed and validated. Use oldFid.VolumeId
directly to avoid redundant parsing and guarantee the correct volume ID
regardless of whether chunk.Fid is nil.

* fix(shell): improve correctness and robustness in fs.distributeChunks

- Buffer download body before upload so dlCtx timeout only covers the
  GET request; upload runs with context.Background() via bytes.NewReader
- Replace 'before, after := strings.Cut(...)' + '_ = before' with '_'
  as the first return value directly
- Clone copiesCount before replica planner mutates it, keeping the
  caller's map immutable
- Add nil-entry guard after filer LookupEntry to prevent panic on
  unexpected nil response

* feat(shell): support chunk manifests in fs.distributeChunks

Large files stored as chunk manifests were previously rejected. Resolve
manifests up front via filer.ResolveChunkManifest, redistribute the
underlying data chunks, then re-pack through filer.MaybeManifestize
before UpdateEntry. The filer's MinusChunks resolves manifests on both
sides of the diff, so old manifest and inner data chunks are GC'd
automatically.

* fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature

Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-17 21:09:36 -07:00
Chris Lu
979c54f693 fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals

Raft leader is advertised as host:httpPort.grpcPort, but clients dial
host:httpPort. Raw string comparison against VolumeLocation.Leader /
HeartbeatResponse.Leader therefore never matches, causing the
masterclient and the volume server heartbeat loop to continuously
"redirect" to the already-connected master, tearing down the stream
and reconnecting.

Use ServerAddress.Equals, which normalizes the grpc-port suffix.

* fix(filer,mq): compare ServerAddress via Equals in two more sites

filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local
partition assignment check both compared a wire-supplied address string
against the local self ServerAddress with raw string equality. Both are
vulnerable to the same plain-vs-host:port.grpcPort mismatch as the
masterclient/volume heartbeat sites: filer would bootstrap from itself,
and the broker would fail to claim a partition it was actually assigned.
Route both through ServerAddress.Equals.

* fix(master,shell): more ServerAddress comparisons via Equals

- raft_server_handlers.go HealthzHandler: s.serverAddr == leader would
  skip the child-lock check on the real leader when the two carry
  different plain/grpc-suffix forms, returning 200 OK instead of 423.
- master_server.go SetRaftServer leader-change callback: the
  Leader() == Name() guard for ensureTopologyId could disagree with
  topology.IsLeader() (which already uses Equals), so leader-only
  initialization could be skipped after an election.
- command_volume_merge.go isReplicaServer: the -target guard compared
  user-supplied host:port against NewServerAddressFromDataNode(...) with
  ==, letting an existing replica slip through when topology carries
  the embedded gRPC port.

All routed through pb.ServerAddress.Equals.

* fix(mq,cluster): more ServerAddress comparisons via Equals

- broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the
  partition ownership check gated listing on raw
  LeaderBroker == BrokerAddress().String(), so listings silently omitted
  partitions hosted locally when the assignment carried the other
  host:port / host:port.grpcPort form.
- lock_client.go: LockHostMovedTo comparison and the seedFiler fallback
  guard both used raw string equality against configured filer
  addresses (which may be plain host:port while LockHostMovedTo comes
  back suffixed), causing spurious host-change churn and blocking the
  seed-filer fallback.

* fix(mq): more ServerAddress comparisons via Equals

- pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct
  activeBrokers.Get() lookup missed brokers when a persisted assignment
  carried a different address encoding than the registered broker key,
  triggering a bogus reassignment on every read/write cycle. Added a
  findActiveBroker helper that falls back to an Equals-based scan and
  canonicalizes the assignment in place so later writes are stable.
- broker_grpc_lookup.go isLockOwner: used raw string equality between
  LockOwner() and BrokerAddress().String(), so a lock owner could fail
  to recognize itself and proxy local lookup/config/admin RPCs away.
- pub_client/scheduler.go onEachAssignments: reused publisher jobs only
  on exact LeaderBroker match, so an encoding flip in lookup results
  tore down and recreated a stream to the same broker.
2026-04-15 12:29:31 -07:00
Chris Lu
10e7f0f2bc fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy

Instead of erroring when the user already exists, the command now
creates the policy and attaches it to the existing user via UpdateUser.
Credentials are only generated and displayed for newly created users.

* fix(shell): skip duplicate policy attachment in s3.user.provision

Check if the policy is already attached before appending and calling
UpdateUser, making repeated runs idempotent.

* fix(shell): generate service account ID in s3.serviceaccount.create

The command built a ServiceAccount proto without setting Id, which was
rejected by credential.ValidateServiceAccountId on any real store. Now
generates sa:<parent>:<uuid> matching the format used by the admin UI.

* test(s3): integration tests for s3.* shell commands

Adds TestShell* integration tests covering ~40 previously untested
shell commands: user, accesskey, group, serviceaccount, anonymous,
bucket, policy.attach/detach, config.show, and iam.export/import.

Switches the test cluster's credential store from memory to filer_etc
because the memory store silently drops groups and service accounts
in LoadConfiguration/SaveConfiguration.

* fix(shell): rollback policy on key generation failure in s3.user.provision

If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after
the policy was persisted, the policy would be left orphaned. Extracts
the rollback logic into a local closure and invokes it on all failure
paths after policy creation for consistency.

* address PR review feedback for s3 shell tests and serviceaccount

- s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for
  the service account UUID instead of 4 bytes to eliminate collision risk
- s3.serviceaccount.create: print the actual ID and drop the outdated
  "server-assigned" note (the ID is now client-generated)
- tests: guard createdAK in accesskey rotate/delete subtests so sibling
  failures don't run invalid CLI calls
- tests: requireContains/requireNotContains use t.Fatalf to fail fast
- tests: Provision subtest asserts the "Attached policy" message on the
  second provision call for an existing user
- tests: update extractServiceAccountID comment example to match the
  sa:<parent>:<uuid> format
- tests: drop redundant saID empty-check (extractServiceAccountID fatals)

* test(s3): use t.Fatalf for precondition check in serviceaccount test
2026-04-11 22:30:51 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
b1265de78f feat(shell): add group management commands (#8993)
* feat(shell): add group management commands

Add weed shell commands for IAM group management:
- s3.group.create -name <group>
- s3.group.delete -name <group>
- s3.group.list
- s3.group.show -name <group>
- s3.group.add-user -group <group> -user <user>
- s3.group.remove-user -group <group> -user <user>

All commands use GetConfiguration/PutConfiguration gRPC pattern,
consistent with existing shell commands like s3.user.list.

* fix: add nil check for Configuration in group shell commands

Guard against nil Configuration response from GetConfiguration
gRPC call to prevent potential panics. (Gemini review)
2026-04-08 14:03:26 -07:00
Chris Lu
7f3908297c fix(weed/shell): suppress prompt when piped (#8990)
* fix(weed/shell): suppress prompt when stdin or stdout is not a TTY

When piping weed shell output (e.g. `echo "s3.user.list" | weed shell | jq`),
the "> " prompt was written to stdout, breaking JSON parsers.

`liner.TerminalSupported()` only checks platform support, not whether
stdin/stdout are actual TTYs. Add explicit checks using `term.IsTerminal()`
so the shell falls back to the non-interactive scanner path when piped.

Fixes #8962

* fix(weed/shell): suppress informational logs unless -verbose is set

Suppress glog info messages and connection status logs on stderr by
default. Add -verbose flag to opt in to the previous noisy behavior.
This keeps piped output clean (e.g. `echo "s3.user.list" | weed shell | jq`).

* fix(weed/shell): defer liner init until after TTY check

Move liner.NewLiner() and related setup (history, completion, interrupt
handler) inside the interactive block so the terminal is not put into
raw mode when stdout is redirected. Previously, liner would set raw mode
unconditionally at startup, leaving the terminal broken when falling
back to the scanner path.

Addresses review feedback from gemini-code-assist.

* refactor(weed/shell): consolidate verbose logging into single block

Group all verbose stderr output within one conditional block instead of
scattering three separate if-verbose checks around the filer logic.

Addresses review feedback from gemini-code-assist.

* fix(weed/shell): clean up global liner state and suppress logtostderr

- Set line=nil after Close() to prevent stale state if RunShell is
  called again (e.g. in tests)
- Add nil check in OnInterrupt handler for non-interactive sessions
- Also set logtostderr=false when not verbose, in case it was enabled

Addresses review feedback from gemini-code-assist.

* refactor(weed/shell): make liner state local to eliminate data race

Replace the package-level `line` variable with a local variable in
RunShell, passing it explicitly to setCompletionHandler, loadHistory,
and saveHistory. This eliminates a data race between the OnInterrupt
goroutine and the defer that previously set the global to nil.

Addresses review feedback from gemini-code-assist.

* rename(weed/shell): rename -verbose flag to -debug

Avoid conflict with -verbose flags already used by individual shell
commands (e.g. ec.encode, volume.fix.replication, volume.check.disk).
2026-04-08 13:07:15 -07:00
Chris Lu
74905c4b5d shell: s3.* commands always output JSON, connection messages to stderr (#8976)
* shell: s3.* commands output JSON, connection messages to stderr

All s3.user.* and s3.policy.attach|detach commands now output structured
JSON to stdout instead of human-readable text:

- s3.user.create: {"name","access_key"} (secret key to stderr only)
- s3.user.list: [{name,status,policies,keys}]
- s3.user.show: {name,status,source,account,policies,credentials,...}
- s3.user.delete: {"name"}
- s3.user.enable/disable: {"name","status"}
- s3.policy.attach/detach: {"policy","user"}

Connection startup messages (master/filer) moved to stderr so they
don't pollute structured output when piping.

Closes #8962 (partial — covers merged s3.user/policy commands).

* shell: fix secret leak, duplicate JSON output, and non-interactive prompt

- s3.user.create: only echo secret key to stderr when auto-generated,
  never echo caller-supplied secrets
- s3.user.enable/disable: fix duplicate JSON output — remove inner
  write in early-return path, keep single write site after gRPC call
- shell_liner: use bufio.Scanner when stdin is not a terminal instead
  of liner.Prompt, suppressing the "> " prompt in piped mode

* shell: check scanner error, idempotent enable output, history errors to stderr

- Check scanner.Err() after non-interactive input loop to surface read errors
- s3.user.enable: always emit JSON regardless of current state (idempotent)
- saveHistory: write error messages to stderr instead of stdout
2026-04-07 16:27:21 -07:00
Chris Lu
fb0573ffc4 shell: rename -force to -apply in s3.iam.import for consistency 2026-04-07 14:17:07 -07:00
Chris Lu
d50889002b shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands (#8956)
* shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands

Add import/export, configuration summary, and a convenience provisioning
command:

- s3.iam.export: dump full IAM state as JSON (stdout or file)
- s3.iam.import: replace IAM state from a JSON file
- s3.config.show: human-readable summary (users, policies, service
  accounts, groups with status and counts)
- s3.user.provision: one-step user+policy+credentials creation for
  common readonly/readwrite/admin roles

Hide legacy commands from help listing:
- s3.configure: still works but hidden from help output
- s3.bucket.access: still works but hidden from help output

Both hidden commands remain fully functional for existing scripts.

Also adds a Hidden command tag and filters it from printGenericHelp.

* shell: address review feedback for s3.iam.*, s3.config.show, s3.user.provision

- Simplify joinMax using strings.Join
- Fix rolePolicies: remove s3:ListBucket from object-level actions
  (already covered by bucket-level statement)
- Fix admin role: grant s3:* on bucket resource too
- Return flag parse errors instead of swallowing them

* shell: address missed review feedback for PR 3

- s3.iam.import: require -force flag for destructive IAM overwrite
- s3.config.show: add nil guard for resp.Configuration
- s3.user.provision: check if user exists before creating policy
- s3.user.provision: reject wildcard bucket names (* ?)

* shell: distinguish NotFound from transient errors in provision, use %w wrapping

- s3.user.provision: check gRPC status code on GetUser error — only
  proceed on NotFound, abort on transient/network errors
- s3.iam.import: use %w for error wrapping to preserve error chains,
  wrap PutConfiguration error with context

* shell: remove duplicate joinMax after PR 8954 merge

command_s3_helpers.go defined joinMax which is already in
command_s3_user_list.go from the merged PR 8954.

* shell: restrict export file permissions, rollback policy on user create failure

- s3.iam.export: use os.OpenFile with mode 0600 instead of os.Create
  to protect exported credentials from other users
- s3.user.provision: rollback the created policy if CreateUser fails,
  with a warning if the rollback itself fails
2026-04-07 14:10:15 -07:00
Chris Lu
45bf3ad058 shell: add s3.user.* and s3.policy.attach|detach commands (#8954)
* shell: add s3.user.* and s3.policy.attach|detach commands

Add focused IAM shell commands following a noun-verb model:

- s3.user.create: create user with auto-generated or explicit credentials
- s3.user.list: tabular listing with status, policies, key count
- s3.user.show: detailed user view (status, source, policies, credentials)
- s3.user.delete: delete a user
- s3.user.enable: enable a disabled user
- s3.user.disable: disable a user (preserves credentials and policies)
- s3.policy.attach: attach a named policy to a user
- s3.policy.detach: detach a policy from a user

These commands are thin wrappers over the existing IAM gRPC service,
producing human-readable output instead of raw protobuf text.

This is part of a larger effort to replace the monolithic s3.configure
command with a composable set of single-purpose commands.

* shell: address review feedback for s3.user.* and s3.policy.attach|detach

- Return flag parse errors instead of swallowing them (all commands)
- Use GetConfiguration instead of N+1 GetUser calls in s3.user.list
- Add nil check for resp.Identity in s3.user.show
- Fix GetPolicy error masking in s3.policy.attach (wrap original error)
- Simplify joinMax using strings.Join

* shell: add nil identity guards and wrap gRPC errors

- Add nil check for resp.Identity in policy_attach, policy_detach,
  user_enable, user_disable
- Wrap GetUser errors with user context for better diagnostics
2026-04-07 11:26:57 -07:00
Chris Lu
d123a2768b shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands (#8955)
* shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands

Add credential, anonymous access, and service account management commands:

Access key commands:
- s3.accesskey.create: add credentials to an existing user
- s3.accesskey.list: list access keys for a user (key ID + status)
- s3.accesskey.delete: remove a specific access key
- s3.accesskey.rotate: atomic create-new + delete-old key rotation

Anonymous access commands:
- s3.anonymous.set: set/remove public access on a bucket
- s3.anonymous.get: show anonymous access for a bucket
- s3.anonymous.list: list all buckets with anonymous access

Service account commands:
- s3.serviceaccount.create: create with optional action subset and expiry
- s3.serviceaccount.list: tabular listing, optionally filtered by parent
- s3.serviceaccount.show: detailed view of a service account
- s3.serviceaccount.delete: remove a service account

These replace the credential and anonymous portions of the monolithic
s3.configure and s3.bucket.access commands.

* shell: address review feedback for s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.*

- Return flag parse errors instead of swallowing them (all commands)
- Add action validation in s3.anonymous.set (Read, Write, List, Tagging, Admin)
- Fix s3.serviceaccount.create output: note to use list for server-assigned ID
  since CreateServiceAccountResponse does not return the ID

* shell: fix bucket matching and action validation in s3.anonymous.*

- Use SplitN instead of HasSuffix for bucket name matching to avoid
  false positives when one bucket name is a suffix of another
- Make action validation case-insensitive with canonical normalization

* shell: fix nil panics, dedup actions, validate service account actions

- Fix nil-pointer panic in getOrCreateAnonymousUser when GetUser returns
  err==nil with nil Identity (status.FromError(nil) returns nil status)
- Add nil Identity guards in s3.anonymous.get and s3.anonymous.list
- Deduplicate action values in s3.anonymous.set (e.g. -access Read,Read)
- Add action validation in s3.serviceaccount.create with case normalization

* shell: dedup actions and reject negative expiry in s3.serviceaccount.create

- Deduplicate -actions values (e.g. Read,read,Read produces one entry)
- Reject negative -expiry values instead of silently treating as no expiration
2026-04-07 11:20:15 -07:00
Chris Lu
0fed72d95a volume.tier.move: fulfill target replication before deleting old replicas (#8950)
* volume.tier.move: fulfill target replication before deleting old replicas

When -toReplication is specified, volume.tier.move now creates all
required replicas on the destination tier before deleting old replicas.
This closes the data-loss window where only one copy existed on the
target tier while awaiting volume.fix.replication.

If replication fulfillment fails, old replicas are preserved and marked
writable so the volume remains accessible.

Also extracts replicateVolumeToServer and configureVolumeReplication
helpers to reduce duplication across volume.tier.move and
volume.fix.replication.

Fixes #8937

* volume.tier.move: always fulfill replication before deleting old replicas

When -toReplication is specified, use that replication setting.
Otherwise, read the volume's existing replication from the super block.
In both cases, all required replicas are created on the destination
tier before old replicas are deleted.

If replication fulfillment fails (e.g. not enough destination nodes),
old replicas are preserved and marked writable so no data is lost.

* volume.tier.move: address review feedback on ensureReplicationFulfilled

- Add 5s delay before re-collecting topology to allow master heartbeat
  propagation after the move
- Add nil guard for targetTierReplicas to prevent panic if the moved
  replica is not yet visible in the topology
- Treat configureVolumeReplication failure as a hard error instead of a
  warning, so the rollback logic preserves old replicas

* volume.tier.move: harden replication config error handling

- Make configureVolumeReplication failure on the primary moved replica a
  hard error that aborts the move, instead of logging and continuing
- Configure replication metadata on all existing target-tier replicas
  (not just newly created ones) when -toReplication is specified
- Deletion of old replicas cannot affect new replicas since the
  locations list only contains pre-move servers (verified, no change)

* volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery

Fix 1: The cleanup loop now preserves pre-existing target-tier replicas
that ensureReplicationFulfilled counted toward the replication target.
Previously, a mixed-tier volume with an existing replica on the target
tier could have that replica deleted right after being counted as
fulfilled, leaving the volume under-replicated.

ensureReplicationFulfilled now returns a preserveServers set that the
deletion loop checks before removing any old replica.

Fix 2: Failure paths after LiveMoveVolume (which deletes the source
replica) now use restoreSurvivingReplicasWritable instead of
markVolumeReplicasWritable. The old helper stopped on first error, so
attempting to mark the already-deleted source writable would prevent
all surviving replicas from being restored. The new helper skips the
deleted source and continues through all remaining locations, logging
per-replica errors instead of aborting.

* volume.tier.move: mark preserved replicas writable, skip nodes with existing volume

Fix 1: Preserved pre-existing target-tier replicas were left read-only
after the move completed. They were marked read-only at the start
(along with all other replicas) but never restored since the old code
deleted them. Now they are explicitly marked writable before cleanup.

Fix 2: The fulfillment loop could pick a candidate node that already
hosts this volume on a different disk type, causing a VolumeCopy
conflict. Added a guard that skips any node already hosting the volume
(on any disk) before attempting replication.
2026-04-06 14:55:37 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
qzh
4c72512ea2 fix(shell): avoid marking skipped or unplaced volumes as fixed (#8866)
* fix(s3api): fix AWS Signature V2 format and validation

* fix(s3api): Skip space after "AWS" prefix (+1 offset)

* test(s3api): add unit tests for Signature V2 authentication fix

* fix(s3api): simply comparing signatures

* validation for the colon extraction in expectedAuth

* fix(shell): avoid marking skipped or unplaced volumes as fixed

---------

Co-authored-by: chrislu <chris.lu@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2026-04-01 01:20:25 -07:00
Chris Lu
af68449a26 Process .ecj deletions during EC decode and vacuum decoded volume (#8863)
* Process .ecj deletions during EC decode and vacuum decoded volume (#8798)

When decoding EC volumes back to normal volumes, deletions recorded in
the .ecj journal were not being applied before computing the dat file
size or checking for live needles. This caused the decoded volume to
include data for deleted files and could produce false positives in the
all-deleted check.

- Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in
  VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first
- Vacuum the decoded volume after mounting in ec.decode to compact out
  deleted needle data from the .dat file
- Add integration tests for decoding with non-empty .ecj files

* storage: add offline volume compaction helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: compact decoded volumes before deleting shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: address PR review comments

- Fall back to data directory for .ecx when idx directory lacks it
- Make compaction failure non-fatal during EC decode
- Remove misleading "buffer: 10%" from space check error message

* ec: collect .ecj from all shard locations during decode

Each server's .ecj only contains deletions for needles whose data
resides in shards held by that server. Previously, sources with no
new data shards to contribute were skipped entirely, losing their
.ecj deletion entries. Now .ecj is always appended from every shard
location so RebuildEcxFile sees the full set of deletions.

* ec: add integration tests for .ecj collection during decode

TestEcDecodePreservesDeletedNeedles: verifies that needles deleted
via VolumeEcBlobDelete are excluded from the decoded volume.

TestEcDecodeCollectsEcjFromPeer: regression test for the fix in
collectEcShards. Deletes a needle only on a peer server that holds
no new data shards, then verifies the deletion survives decode via
.ecj collection.

* ec: address review nits in decode and tests

- Remove double error wrapping in mountDecodedVolume
- Check VolumeUnmount error in peer ecj test
- Assert 404 specifically for deleted needles, fail on 5xx

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-01 01:15:26 -07:00
Chris Lu
f256002d0b fix ec.balance failing to rebalance when all nodes share all volumes (#8796)
* fix ec.balance failing to rebalance when all nodes share all volumes (#8793)

Two bugs in doBalanceEcRack prevented rebalancing:

1. Sorting by freeEcSlot instead of actual shard count caused incorrect
   empty/full node selection when nodes have different total capacities.

2. The volume-level check skipped any volume already present on the
   target node. When every node has a shard of every volume (common
   with many EC volumes across N nodes with N shards each), no moves
   were possible.

Fix: sort by actual shard count, and use a two-pass approach - first
prefer moving shards of volumes not on the target (best diversity),
then fall back to moving specific shard IDs not yet on the target.

* add test simulating real cluster topology from issue #8793

Uses the actual node addresses and mixed max capacities (80 vs 33)
from the reporter's 14-node cluster to verify ec.balance correctly
rebalances with heterogeneous node sizes.

* fix pass comments to match 0-indexed loop variable
2026-03-27 11:14:10 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Chris Lu
ccc662b90b shell: add s3.bucket.access command for anonymous access policy (#8774)
* shell: add s3.bucket.access command for anonymous access policy (#7738)

Add a new weed shell command to view or change the anonymous access
policy of an S3 bucket without external tools.

Usage:
  s3.bucket.access -name <bucket> -access read,list
  s3.bucket.access -name <bucket> -access none

Supported permissions: read, write, list. The command writes a standard
bucket policy with Principal "*" and warns if no anonymous IAM identity
exists.

* shell: fix anonymous identity hint in s3.bucket.access warning

The anonymous identity doesn't need IAM actions — the bucket policy
controls what anonymous users can do.

* shell: only warn about anonymous identity when write access is set

Read and list operations use AuthWithPublicRead which evaluates bucket
policies directly without requiring the anonymous identity. Only write
operations go through the normal auth flow that needs it.

* shell: rewrite s3.bucket.access to use IAM actions instead of bucket policies

Replace the bucket policy approach with direct IAM identity actions,
matching the s3.configure pattern. The user is auto-created if it does
not exist.

Usage:
  s3.bucket.access -name <bucket> -user anonymous -access Read,List
  s3.bucket.access -name <bucket> -user anonymous -access none
  s3.bucket.access -name <bucket> -user anonymous

Actions are stored as "Action:bucket" on the identity, same as
s3.configure -actions=Read -buckets=my-bucket.

* shell: return flag parse errors instead of swallowing them

* shell: normalize action names case-insensitively in s3.bucket.access

Accept actions in any case (read, READ, Read) and normalize to canonical
form (Read, Write, List, etc.) before storing. This matches the
case-insensitive handling of "none" and avoids confusing rejections.
2026-03-25 23:09:53 -07:00
Chris Lu
7fbdb9b7b7 feat(shell): add volume.tier.compact command to reclaim cloud storage space (#8715)
* feat(shell): add volume.tier.compact command to reclaim cloud storage space

Adds a new shell command that automates compaction of cloud tier volumes.
When files are deleted from remote-tiered volumes, space is not reclaimed
on the cloud storage. This command orchestrates: download from remote,
compact locally, and re-upload to reclaim deleted space.

Closes #8563

* fix: log cleanup errors in compactVolumeOnServer instead of discarding them

Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup
fails after a compaction or commit failure.

* fix: return aggregate error from loop and use regex for collection filter

- Track and return error count when one or more volumes fail to compact,
  so callers see partial failures instead of always getting nil.
- Use compileCollectionPattern for -collection in -volumeId mode too, so
  regex patterns work consistently with the flag description. Empty
  pattern (no -collection given) matches all collections.
2026-03-20 23:52:12 -07:00
Chris Lu
81369b8a83 improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync

Three main throughput improvements:

1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
   instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
   (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
   overhead (volume assign, gRPC call, needle write) by 3x.

2. Configurable concurrency at every layer:
   - remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
   - remote.cache S3 download concurrency: -downloadConcurrency flag
     (default raised from 1 to 5 per chunk)
   - filer.sync chunk concurrency: -chunkConcurrency flag (default 32)

3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
   downloader was using Concurrency=1, serializing all part downloads
   within each chunk. This alone can 5x per-chunk download speed.

The concurrency values flow through the gRPC request chain:
  shell command → CacheRemoteObjectToLocalClusterRequest →
  FetchAndWriteNeedleRequest → S3 downloader

Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.

Ref #8481

* fix: use full maxMB for chunk size cap and remove loop guard

Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
  unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
  can always grow past maxChunkSize when needed to stay under 1000
  chunks (e.g., extremely large files with small maxMB).

* address review feedback: help text, validation, naming, docs

- Fix help text for -chunkConcurrency and -downloadConcurrency flags
  to say "0 = server default" instead of advertising specific numeric
  defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
  range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
  during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
  ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.

* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs

- Use ceiling division for chunk count check to avoid overcounting
  when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
  at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
  falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
  the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
  default 8") so users see the actual values in -h output.

* fix data race on executionErr and use %w for error wrapping

- Protect concurrent writes to executionErr in remote.cache worker
  goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
  to preserve the error chain for errors.Is/errors.As callers.
2026-03-17 16:49:56 -07:00
Chris Lu
c4d642b8aa fix(ec): gather shards from all disk locations before rebuild (#8633)
* fix(ec): gather shards from all disk locations before rebuild (#8631)

Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:

1. VolumeEcShardsRebuild only looked at a single disk location for shard
   files. On multi-disk servers, the existing local shards could be on one
   disk while copied shards were placed on another, causing the rebuild to
   see fewer shards than actually available.

2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
   len(vs.store.Locations) > 0) that was always true, making the
   FindFreeLocation fallback dead code. This meant copies always went to
   Locations[0] regardless of where existing shards were.

Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
  then gathers shard files from other locations via hard links (or
  symlinks for cross-device) before rebuilding. Gathered files are
  cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
  (explicitly set). Otherwise, it prefers the location that already has
  the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
  error message when not enough shards are found, instead of passing
  through to the opaque reedsolomon "too few shards given" error.

* fix(ec): update test to match skip behavior for unrepairable volumes

The test expected an error for volumes with insufficient shards, but
commit 5acb4578a changed unrepairable volumes to be skipped with a log
message instead of returning an error. Update the test to verify the
skip behavior and log output.

* fix(ec): address PR review comments

- Add comment clarifying DiskId=0 means "not specified" (protobuf default),
  callers must use DiskId >= 1 to target a specific disk.
- Log warnings on cleanup failures for gathered shard links.

* fix(ec): read shard files from other disks directly instead of linking

Replace the hard link / symlink gathering approach with passing
additional search directories into RebuildEcFiles. The rebuild
function now opens shard files directly from whichever disk they
live on, avoiding filesystem link operations and cleanup.

RebuildEcFiles and RebuildEcFilesWithContext gain a variadic
additionalDirs parameter (backward compatible with existing callers).

* fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment

* fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations

- generateMissingEcFiles: two-pass approach — first discover present/missing
  shards and check reconstructability, only then create output files. This
  avoids leaving behind empty truncated shard files when there are too few
  shards to rebuild.

- VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations.
  A location with an .ecx file but no shard files (all shards on other disks)
  is now a valid rebuild candidate instead of being silently skipped.

* fix(ec): select ecx-only location as rebuildLocation when none chosen yet

When rebuildLocation is nil and a location has hasEcx=true but
existingShardCount=0 (all shards on other disks), the condition
0 > 0 was false so it was never promoted to rebuildLocation.
Add rebuildLocation == nil to the predicate so the first location
with an .ecx file is always selected as a candidate.
2026-03-14 20:59:47 -07:00
Chris Lu
5acb4578ab Fix ec.rebuild failing on unrepairable volumes instead of skipping (#8632)
* Fix ec.rebuild failing on unrepairable volumes instead of skipping them

When an EC volume has fewer shards than DataShardsCount, ec.rebuild would
return an error and abort the entire operation. Now it logs a warning and
continues rebuilding the remaining volumes.

Fixes #8630

* Remove duplicate volume ID in unrepairable log message

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-14 16:18:29 -07:00
Chris Lu
f3c5ba3cd6 feat(filer): add lazy directory listing for remote mounts (#8615)
* feat(filer): add lazy directory listing for remote mounts

Directory listings on remote mounts previously only queried the local
filer store. With lazy mounts the listing was empty; with eager mounts
it went stale over time.

Add on-demand directory listing that fetches from remote and caches
results with a 5-minute TTL:

- Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based,
  single-level listing, separate from recursive `Traverse`)
- Implement in S3, GCS, and Azure backends using each platform's
  hierarchical listing API
- Add `maybeLazyListFromRemote` to filer: before each directory listing,
  check if the directory is under a remote mount with an expired cache,
  fetch from remote, persist entries to the local store, then let existing
  listing logic run on the populated store
- Use singleflight to deduplicate concurrent requests for the same directory
- Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads
- Errors are logged and swallowed (availability over consistency)

* refactor: extract xattr key to constant xattrRemoteListingSyncedAt

* feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds

Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf.
When 0 (default), lazy directory listing is disabled for that mount.
When >0, enables on-demand directory listing with the specified TTL.

Expose as -listingCacheTTL flag on remote.mount command.

* refactor: address review feedback for lazy directory listing

- Add context.Context to ListDirectory interface and all implementations
- Capture startTime before remote call for accurate TTL tracking
- Simplify S3 ListDirectory using ListObjectsV2PagesWithContext
- Make maybeLazyListFromRemote return void (errors always swallowed)
- Remove redundant trailing-slash path manipulation in caller
- Update tests to match new signatures

* When an existing entry has Remote != nil, we should merge remote metadata   into it rather than replacing it.

* fix(gcs): wrap ListDirectory iterator error with context

The raw iterator error was returned without bucket/path context,
making it harder to debug. Wrap it consistently with the S3 pattern.

* fix(s3): guard against nil pointer dereference in Traverse and ListDirectory

Some S3-compatible backends may return nil for LastModified, Size, or
ETag fields. Check for nil before dereferencing to prevent panics.

* fix(filer): remove blanket 2-minute timeout from lazy listing context

Individual SDK operations (S3, GCS, Azure) already have per-request
timeouts and retry policies. The blanket timeout could cut off large
directory listings mid-operation even though individual pages were
succeeding.

* fix(filer): preserve trace context in lazy listing with WithoutCancel

Use context.WithoutCancel(ctx) instead of context.Background() so
trace/span values from the incoming request are retained for
distributed tracing, while still decoupling cancellation.

* fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt

- Use f.Store.FindEntry instead of f.FindEntry for staleness check and
  child lookups to avoid unnecessary lazy-fetch overhead
- Set OS_UID/OS_GID on new file entries for consistency with directories
- In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing
  directories instead of CreateEntry to avoid deleteChunksIfNotNew and
  NotifyUpdateEvent side effects

* fix(filer): distinguish not-found from store errors in lazy listing

Previously, any error from Store.FindEntry was treated as "not found,"
which could cause entry recreation/overwrite on transient DB failures.
Now check for filer_pb.ErrNotFound explicitly and skip entries or
bail out on real store errors.

* refactor(filer): use errors.Is for ErrNotFound comparisons
2026-03-13 09:36:54 -07:00
Peter Dodd
0e570d6a8f feat(remote.mount): add -metadataStrategy flag to control metadata caching (#8568)
* feat(remote): add -noSync flag to skip upfront metadata pull on mount

Made-with: Cursor

* refactor(remote): split mount setup from metadata sync

Extract ensureMountDirectory for create/validate; call pullMetadata
directly when sync is needed. Caller controls sync step for -noSync.

Made-with: Cursor

* fix(remote): validate mount root when -noSync so bad bucket/creds fail fast

When -noSync is used, perform a cheap remote check (ListBuckets and
verify bucket exists) instead of skipping all remote I/O. Invalid
buckets or credentials now fail at mount time.

Made-with: Cursor

* test(remote): add TestRemoteMountNoSync for -noSync mount and persisted mapping

Made-with: Cursor

* test(remote): assert no upfront metadata after -noSync mount

After remote.mount -noSync, run fs.ls on the mount dir and assert empty
listing so the test fails if pullMetadata was invoked eagerly.

Made-with: Cursor

* fix(remote): propagate non-ErrNotFound lookup errors in ensureMountDirectory

Return lookupErr immediately for any LookupDirectoryEntry failure that
is not filer_pb.ErrNotFound, so only the not-found case creates the
entry and other lookup failures are reported to the caller.

Made-with: Cursor

* fix(remote): use errors.Is for ErrNotFound in ensureMountDirectory

Replace fragile strings.Contains(lookupErr.Error(), ...) with
errors.Is(lookupErr, filer_pb.ErrNotFound) before calling CreateEntry.

Made-with: Cursor

* fix(remote): use LookupEntry so ErrNotFound is recognised after gRPC

Raw gRPC LookupDirectoryEntry returns a status error, not the sentinel,
so errors.Is(lookupErr, filer_pb.ErrNotFound) was always false. Use
filer_pb.LookupEntry which normalises not-found to ErrNotFound so the
mount directory is created when missing.

Made-with: Cursor

* test(remote): ignore weed shell banner in TestRemoteMountNoSync fs.ls count

Exclude master/filer and prompt lines from entry count so the assertion
checks only actual fs.ls output for empty -noSync mount.

Made-with: Cursor

* fix(remote.mount): use 0755 for mount dir, document bucket-less early return

Made-with: Cursor

* feat(remote.mount): replace -noSync with -metadataStrategy=lazy|eager

- Add -metadataStrategy flag (eager default, lazy skips upfront metadata pull)
- Accept lazy/eager case-insensitively; reject invalid values with clear error
- Rename TestRemoteMountNoSync to TestRemoteMountMetadataStrategyLazy
- Add TestRemoteMountMetadataStrategyEager and TestRemoteMountMetadataStrategyInvalid

Made-with: Cursor

* fix(remote.mount): validate strategy and remote before creating mount directory

Move strategy validation and validateMountRoot (lazy path) before
ensureMountDirectory so that invalid strategies or bad bucket/credentials
fail without leaving orphaned directory entries in the filer.

* refactor(remote.mount): remove unused remote param from ensureMountDirectory

The remote *RemoteStorageLocation parameter was left over from the old
syncMetadata signature. Only remoteConf.Name is used inside the function.

* doc(remote.mount): add TODO for HeadBucket-style validation

validateMountRoot currently lists all buckets to verify one exists.
Note the need for a targeted BucketExists method in the interface.

* refactor(remote.mount): use MetadataStrategy type and constants

Replace raw string comparisons with a MetadataStrategy type and
MetadataStrategyEager/MetadataStrategyLazy constants for clarity
and compile-time safety.

* refactor(remote.mount): rename MetadataStrategy to MetadataCacheStrategy

More precisely describes the purpose: controlling how metadata is
cached from the remote, not metadata handling in general.

* fix(remote.mount): remove validateMountRoot from lazy path

Lazy mount's purpose is to skip remote I/O. Validating via ListBuckets
contradicts that, especially on accounts with many buckets. Invalid
buckets or credentials will surface on first lazy access instead.

* fix(test): handle shell exit 0 in TestRemoteMountMetadataStrategyInvalid

The weed shell process exits with code 0 even when individual commands
fail — errors appear in stdout. Check output instead of requiring a
non-nil error.

* test(remote.mount): remove metadataStrategy shell integration tests

These tests only verify string output from a shell process that always
exits 0 — they cannot meaningfully validate eager vs lazy behavior
without a real remote backend.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-12 15:21:07 -07:00
Copilot
013362d2d3 fix(shell): show planned size in fs.mergeVolumes log to clarify size limit check (#8553)
The log message was comparing against the planned size of the destination
volume (including volumes already planned to merge into it) but only
displaying the raw volume size, making the output confusing when the
displayed sizes clearly didn't add up to exceed the limit.
2026-03-11 13:56:13 -07:00
Chris Lu
b799650357 fix(shell): set LastLocalSyncTsNs in remote.copy.local so remote.uncache works (#8604)
remote.uncache checks LastLocalSyncTsNs to determine if a file has been
synced to remote. remote.copy.local was not setting this field, leaving
it at 0, which caused uncache to skip all files uploaded via
remote.copy.local.

Fixes #8602
2026-03-11 12:55:45 -07:00
Chris Lu
e1e5b4a8a6 add admin script worker (#8491)
* admin: add plugin lock coordination

* shell: allow bypassing lock checks

* plugin worker: add admin script handler

* mini: include admin_script in plugin defaults

* admin script UI: drop name and enlarge text

* admin script: add default script

* admin_script: make run interval configurable

* plugin: gate other jobs during admin_script runs

* plugin: use last completed admin_script run

* admin: backfill plugin config defaults

* templ

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* comparable to default version

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* default to run

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* format

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: respect pre-set noLock for fix.replication

* shell: add force no-lock mode for admin scripts

* volume balance worker already exists

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* admin: expose scheduler status JSON

* shell: add sleep command

* shell: restrict sleep syntax

* Revert "shell: respect pre-set noLock for fix.replication"

This reverts commit 2b14e8b826.

* templ

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* fix import

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* less logs

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* Reduce master client logs on canceled contexts

* Update mini default job type count

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-03 15:10:40 -08:00
Chris Lu
2dd3944819 Respect -minFreeSpace during ec.decode (#8467)
* shell: add ec.decode ignoreMinFreeSpace flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: respect minFreeSpace in ec.decode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: rename ec.decode minFreeSpace flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: error when ec.decode has no shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: select ec.decode target with zero shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: adjust free counts across ec.decode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* unused

* Update weed/shell/command_ec_decode.go

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-02-27 23:54:30 -08:00
Chris Lu
9b6fc49946 Chart createBuckets config #8368: Add TTL, Object Lock, and Versioning support (#8375)
* Chart createBuckets config #8368: Add TTL, Object Lock, and Versioning support

* Update weed/shell/command_s3_bucket_versioning.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* address comments

* address comments

* go fmt

* fix failures are still treated like “bucket not found”

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-26 11:56:10 -08:00
Chris Lu
a3cb7fa8cc go fmt 2026-02-25 10:25:23 -08:00
Chris Lu
b565a0cc86 Adds volume.merge command with deduplication and disk-based backend (#8441)
* Enhance volume.merge command with deduplication and disk-based backend

* Fix copyVolume function call with correct argument order and missing bool parameter

* Revert "Fix copyVolume function call with correct argument order and missing bool parameter"

This reverts commit 7b4a190643.

* Fix critical issues: per-replica writable tracking, tail goroutine cancellation via done channel, and debug logging for allocation failures

* Optimize memory usage with watermark approach for duplicate detection

* Fix critical issues: swap copyVolume arguments, increase idle timeout, remove file double-close, use glog for logging

* Replace temporary file with in-memory buffer for needle blob serialization

* test(volume.merge): Add comprehensive unit and integration tests

Add 7 unit tests covering:
- Ordering by timestamp
- Cross-stream duplicate deduplication
- Empty stream handling
- Complex multi-stream deduplication
- Single stream passthrough
- Large needle ID support
- LastModified fallback when timestamp unavailable

Add 2 integration validation tests:
- TestMergeWorkflowValidation: Documents 9-stage merge workflow
- TestMergeEdgeCaseHandling: Validates 10 edge case handling

All tests passing (9/9)

* fix(volume.merge): Use time window for deduplication to handle clock skew

The same needle ID can have different timestamps on different servers due to
clock skew and replication lag. Needles with the same ID within a 5-second
time window are now treated as duplicates (same write with timestamp variance).

Key changes:
- Add mergeDeduplicationWindowNs constant (5 seconds)
- Replace exact timestamp matching with time window comparison
- Use windowInitialized flag to properly detect window transitions
- Add TestMergeNeedleStreamsTimeWindowDeduplication test

This ensures that replicated writes with slight timestamp differences are
properly deduplicated during merge, while separate updates to the same file
ID (outside the window) are preserved.

All tests passing (10/10)

* test: Add volume.merge integration tests with 5 comprehensive test cases

* test: integration tests for volume.merge command

* Fix integration tests: use TripleVolumeCluster for volume.merge testing

- Created new TripleVolumeCluster framework (cluster_triple.go) with 3 volume servers
- Rebuilt weed binary with volume.merge command compiled in
- Updated all 5 integration tests to use TripleVolumeCluster instead of DualVolumeCluster
- Tests now properly allocate volumes on 2 servers and let merge allocate on 3rd
- All 5 integration tests now pass:
  - TestVolumeMergeBasic
  - TestVolumeMergeReadonly
  - TestVolumeMergeRestore
  - TestVolumeMergeTailNeedles
  - TestVolumeMergeDivergentReplicas

* Refactor test framework: use parameterized server count instead of hardcoded

- Renamed TripleVolumeCluster to MultiVolumeCluster with serverCount parameter
- Replaced hardcoded volumePort0/1/2 with slices for flexible server count
- Updated StartTripleVolumeCluster as backward-compatible wrapper calling StartMultiVolumeCluster(t, profile, 3)
- Made directory creation, port allocation, and server startup loop-based
- Updated accessor methods (VolumeAdminAddress, VolumeGRPCAddress, etc.) to support any server count
- All 5 integration tests continue to pass with new parameterized cluster framework
- Enables future testing with 2, 4, 5+ volume servers by calling StartMultiVolumeCluster directly

* Consolidate cluster frameworks: StartDualVolumeCluster now uses MultiVolumeCluster

- Made DualVolumeCluster a type alias for MultiVolumeCluster
- Updated StartDualVolumeCluster to call StartMultiVolumeCluster(t, profile, 2)
- Removed duplicate code from cluster_dual.go (now just 17 lines)
- All existing tests using StartDualVolumeCluster continue to work without changes
- Backward compatible: existing code continues to use the old function signatures
- Added wrapper functions in cluster_multi.go for StartTripleVolumeCluster
- Enables unified cluster management across all test suites

* Address PR review comments: improve error handling and clean up code

- Replace parse error swallow with proper error return
- Log cleanup and restoration errors instead of silently discarding them
- Remove unused offset field from memoryBackendFile struct
- Fix WriteAt buffer truncation bug to preserve trailing bytes
- All unit tests passing (10/10)
- Code compiles successfully

* Fix PR review findings: test improvements and code quality

- Add timeout to runWeedShell to prevent hanging
- Add server 1 readonly status verification in tests
- Assert merge fails when replicas writable (not just log output)
- Replace sleep with polling for writable restoration check
- Fix WriteAt stale data snapshot bug in memoryBackendFile
- Fix startVolume error logging to show current server log
- Fix volumePubPorts double assignment in port allocation
- Rename test to reflect behavior: DoesNotDeduplicateAcrossWindows
- Fix misleading dedup window comment

Unit tests: 10/10 passing
Binary: Compiles successfully

* Fix test assumption: merge command marks volumes readonly automatically

TestVolumeMergeReadonly was expecting merge to fail on writable volumes, but the
merge command is designed to mark volumes readonly as part of its operation. Fixed
test to verify merge succeeds on writable volumes and properly restores writable
state afterward. Removed redundant Test 2 code that duplicated the new behavior.

* fmt

* Fix deduplication logic to correctly handle same-stream vs cross-stream duplicates

The dedup map previously used only NeedleId as key, causing same-stream
overwrites to be incorrectly skipped as duplicates. Changed to track which
stream first processed each needle ID in the current window:

- Cross-stream duplicates (same ID from different streams, within window) are skipped
- Same-stream duplicates (overwrites from same stream) are kept
- Map now stores: needleId -> streamIndex of first occurrence in window

Added TestMergeNeedleStreamsSameStreamDuplicates to verify same-stream
overwrites are preserved while cross-stream duplicates are skipped.

All unit tests passing (11/11)
Binary compiles successfully
2026-02-25 10:12:09 -08:00
Chris Lu
da4edb5fe6 Fix live volume move tail timestamp (#8440)
* Improve move tail timestamp

* Add move tail timestamp integration test

* Simulate traffic during move
2026-02-24 20:07:26 -08:00
Chris Lu
cd6832249b Fix volume.fsck crashing on EC volumes and add multi-volume vacuum support (#8406)
* helm: refine openshift-values.yaml to remove hardcoded UIDs

Remove hardcoded runAsUser, runAsGroup, and fsGroup from the
openshift-values.yaml example. This allows OpenShift's admission
controller to automatically assign a valid UID from the namespace's
allocated range, avoiding "forbidden" errors when UID 1000 is
outside the permissible range.

Updates #8381, #8390.

* helm: fix volume.logs and add consistent security context comments

* Update README.md

* fix volume.fsck crashing on EC volumes and add multi-volume vacuum support

* address comments
2026-02-22 22:07:15 -08:00
Konstantin Lebedev
01b3125815 [shell]: volume balance capacity by min volume density (#8026)
volume balance by min volume density and active volumes
2026-02-19 13:30:59 -08:00
Chris Lu
f44e25b422 fix(iam): ensure access key status is persisted and defaulted to Active (#8341)
* Fix master leader election startup issue

Fixes #error-log-leader-not-selected-yet

* not useful test

* fix(iam): ensure access key status is persisted and defaulted to Active

* make pb

* update tests

* using constants
2026-02-13 20:28:41 -08:00
Lisandro Pin
fbe7dd32c2 Implement full scrubbing for regular volumes (#8254)
Implement full scrubbing for regular volumes.
2026-02-13 15:47:29 -08:00
Lisandro Pin
221bd237c4 Fix file stat collection metric bug for the cluster.status command. (#8302)
When the `--files` flag is present, `cluster.status` will scrape file metrics
from volume servers to provide detailed stats on those. The progress indicator
was not being updated properly though, so the command would complete before
it read 100%.
2026-02-11 13:34:20 -08:00
Chris Lu
a3136c523f Fix volume.fsck 401 Unauthorized by adding JWT to HTTP delete requests (#8306)
* Fix volume.fsck 401 Unauthorized by adding JWT to HTTP delete requests

* Additionally, for performance, consider fetching the jwt.filer_signing.key once before any loops that call httpDelete, rather than inside httpDelete itself, to avoid repeated configuration lookups.
2026-02-11 13:32:56 -08:00
Lisandro Pin
e657e7d827 Implement local scrubbing for EC volumes. (#8283) 2026-02-11 11:04:08 -08:00
Lisandro Pin
2a73219397 Add weed shell command volumeServer.state to query/update volume server state settings. (#8271)
Add weed shell command `volumeServer.state` to query/update volume server states.
2026-02-11 11:02:37 -08:00
Lisandro Pin
f400fb44a0 Update cluster.status to resolve file details on EC volumes. (#8268)
Also parallelizes queries for file metrics collections when the `--files`
flag is specified, and improves the command's output for readability:

```
> cluster.status --files
collecting file stats: 100%

cluster:
	id:       topo
	status:   LOCKED
	nodes:    10
	topology: 1 DC, 10 disks on 1 rack

volumes:
	total:    3 volumes, 1 collection
	max size: 32 GB
	regular:  1/80 volume on 3 replicas, 3 writable (100%), 0 read-only (0%)
	EC:       2 EC volumes on 28 shards (14 shards/volume)

storage:
	total:           269 MB (522 MB raw, 193.95%)
	regular volumes: 91 MB (272 MB raw, 300%)
	EC volumes:      178 MB (250 MB raw, 140%)

files:
	total:   363 files, 300 readable (82.64%), 63 deleted (17.35%), avg 522 kB per file
	regular: 168 files, 105 readable (62.5%), 63 deleted (37.5%), avg 540 kB per file
	EC:      195 files, 195 readable (100%), 0 deleted (0%), avg 506 kB per file
```
2026-02-09 17:52:43 -08:00
Chris Lu
30812b85f3 fix ec.encode skipping volumes when one replica is on a full disk (#8227)
* fix ec.encode skipping volumes when one replica is on a full disk

This fixes issue #8218. Previously, ec.encode would skip a volume if ANY
of its replicas resided on a disk with low free volume count. Now it
accepts the volume if AT LEAST ONE replica is on a healthy disk.

* refine noFreeDisk counter logic in ec.encode

Ensure noFreeDisk is decremented if a volume initially marked as bad
is later found to have a healthy replica. This ensures accurate
summary statistics.

* defer noFreeDisk counting and refine logging in ec.encode

Updated logging to be replica-scoped and deferred noFreeDisk counting to
the final pass over vidMap. This ensures that the counter only reflects
volumes that are definitively excluded because all replicas are on full
disks.

* filter replicas by free space during ec.encode

Updated doEcEncode to filter out replicas on disks with
FreeVolumeCount < 2 before selecting the best replica for encoding.
This ensures that EC shards are not generated on healthy source
replicas that happen to be on disks with low free space.
2026-02-09 14:23:11 -08:00
Chris Lu
6a61037333 fix issue #8230: volume.fsck deletion logic to respect purgeAbsent flag (#8266)
* fix issue #8230: volume.fsck deletion logic to respect purgeAbsent flag

This commit fixes two issues in volume.fsck:
1. Missing chunks in existing volumes are now deleted if -reallyDeleteFilerEntries is set.
2. Missing volumes are now properly handled when a -volumeId filter is specified, allowing deletion of filer entries for those volumes.

* address PR feedback for issue #8230

- Ensure volume filter is applied before reporting missing volumes
- Fix potential nil-pointer dereferences in httpDelete method
- Use proper error checking throughout httpDelete

* address second round PR feedback for issue #8230

- Use fmt.Fprintf(c.writer, ...) instead of fmt.Printf
- Add missing newline in "deleting path" log message
2026-02-09 13:23:17 -08:00
Lisandro Pin
63b846b73b Parallelize operations for the volume.scrub and ec.scrub commands (#8247)
Parallelize operations for the `volume.scrub` and `ec.scrub` commands.
2026-02-09 09:07:06 -08:00
Chris Lu
2ed5a8f65c add tests 2026-02-09 01:37:56 -08:00
Feng Shao
963398ac8c use ReadFull (#40) (#8240)
* use ReadFull

* fix error checking
2026-02-06 20:51:47 -08:00
Lisandro Pin
9d751a7b61 Contrib/volume scrub local (#8226) 2026-02-05 14:44:12 -08:00
Chris Lu
3306abae10 shell: add minCacheAge flag to remote.uncache command (#8225)
* add minCacheAge flag to remote.uncache command #8221

* address code review feedback: add nil check and improve test isolation

* address code review feedback: use consistent timestamp in FileFilter
2026-02-05 12:57:27 -08:00
Lisandro Pin
2ecbae3611 Add volume.scrub and ec.scrub shell commands to scrub regular & EC volumes on demand. (#8188)
* Implement RPC skeleton for regular/EC volumes scrubbing.

See https://github.com/seaweedfs/seaweedfs/issues/8018 for details.

* Add `volume.scrub` and `ec.scrub` shell commands to scrub regular & EC volumes on demand.

F.ex:

```
> ec.scrub --full
Scrubbing 10.200.17.13:9005 (1/10)...
Scrubbing 10.200.17.13:9001 (2/10)...
Scrubbing 10.200.17.13:9008 (3/10)...
Scrubbing 10.200.17.13:9009 (4/10)...
Scrubbing 10.200.17.13:9004 (5/10)...
Scrubbing 10.200.17.13:9010 (6/10)...
Scrubbing 10.200.17.13:9007 (7/10)...
Scrubbing 10.200.17.13:9002 (8/10)...
Scrubbing 10.200.17.13:9003 (9/10)...
Scrubbing 10.200.17.13:9006 (10/10)...
Scrubbed 20 EC files and 20 volumes on 10 nodes

Got scrub failures on 1 EC volumes and 2 EC shards :(
Affected volumes: 10.200.17.13:9005:1
Details:
	[10.200.17.13:9005] expected 551041 bytes for needle 6, got 551072
	[10.200.17.13:9005] needles in volume file (1) don't match index entries (173) for volume 1
```
2026-02-04 17:08:31 -08:00