1043 Commits

Author SHA1 Message Date
qzhello
60c76120fc fix(shell): use exact match for volume.balance -racks/-nodes filter (#9279)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers
2026-04-29 10:19:16 -07:00
qzhello
108e42fb8b chore(shell): fix typo in EC shard helper param names (#9277)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names
2026-04-28 23:09:26 -07:00
Chris Lu
294f7c3d04 shell: expand ~ in local file path arguments (#9265)
* shell: expand `~` in local file path arguments

The weed shell parses commands itself instead of going through an OS
shell, so a path like `~/Downloads/foo.meta` was passed verbatim to
`os.Open`, which fails because no `~` directory exists. Users had to
spell out absolute home paths in every command.

Add an `expandHomeDir` helper that resolves a leading `~` or `~/...` to
the user's home directory, and run user-supplied local file paths in
the affected shell commands through it:

  fs.meta.load          (positional file)
  fs.meta.save          (-o)
  fs.meta.changeVolumeId (-mapping)
  s3.iam.export         (-file)
  s3.iam.import         (-file)
  s3.policy             (-file)
  s3tables.bucket       (-file)
  s3tables.table        (-file, -metadata)
  volume.fsck           (-tempPath)

Filer-namespace path flags (`-dir`, `-path`, `-locationPrefix`, etc.)
are unaffected; they live in the filer, not on the local FS.

* shell: reuse util.ResolvePath instead of a new helper

util.ResolvePath already does tilde expansion; drop the local
expandHomeDir helper and route every shell call site through it.
2026-04-28 12:30:13 -07:00
qzh
21fadf5582 fix(shell): correct volume.list -writable filter unit and comparison (#9231)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison
2026-04-26 22:20:46 -07:00
Chris Lu
da2e90aefd fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207)
* fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139)

A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash
"partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every
FUSE-initiated filer RPC fail with:

  rpc error: code = Internal desc = grpc: error while marshaling:
  string field contains invalid UTF-8

and then produced an avalanche of "connection is closing" errors on
unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the
volume-server QPS dips reported in #9139.

Root cause is twofold:

1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes
   raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/
   Symlink all forwarded those bytes directly into CreateEntryRequest.Name,
   DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and
   Entry.Name. saveDataAsChunk also copied the FullPath into
   AssignVolumeRequest.Path unchecked.

2. When the marshal failed, shouldInvalidateConnection treated the
   resulting codes.Internal as a connection problem and dropped the
   shared cached ClientConn — canceling every other in-flight RPC on it.

Fix:

- Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching
  util.FullPath.DirAndName) and make checkName return the sanitized name.
  Apply at every FUSE entry point that passes a name to the filer RPC,
  including Unlink/Rmdir (which did not previously call checkName) and
  both oldName/newName in Rename. Add a backstop scrub for
  AssignVolumeRequest.Path so async flush paths cannot reintroduce
  invalid bytes from a pre-sanitization cached FullPath.

- In weed/pb.shouldInvalidateConnection, detect client-side marshal
  errors via the gRPC library's "error while marshaling" prefix and
  return false: the connection is healthy, only the request is bad.

Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231

* fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe)

Sanitized filenames flow downstream into HTTP URLs (volume-server uploads,
filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string
delimiter and would split the path the first time the name lands in one,
so swap every invalid-UTF-8 replacement to '_'. This covers the two
pre-existing sites in weed/util/fullpath.go as well, keeping all paths
sanitized the same way.

* refactor(pb): detect client-side marshal errors via errors.As, not substring

Replace the raw `strings.Contains(err.Error(), ...)` check with a
type-based carve-out: use errors.As against the `GRPCStatus() *Status`
interface to pull the original Status out of any fmt.Errorf("...: %w")
wrapping, then match the library-owned "grpc:" prefix on that Status's
Message.

Why not errors.Is against a proto-level sentinel: gRPC's encode()
collapses the inner proto error with "%v" (stringification) before
wrapping it in a Status, so the original error type does not survive
into the caller. The Status itself is the structural signal that does
survive.

Why not status.FromError: when the caller wraps the Status error with
fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message
with the full err.Error() of the outermost wrapper, which defeats a
prefix check on the library-owned message. errors.As gives us the
original Status whose Message is still verbatim from the gRPC library.

A new test asserts that a plain errors.New("grpc: error while marshaling: …")
— i.e. the same text attached to something that is NOT a gRPC status —
does not short-circuit invalidation, so we never silently keep a cached
connection alive based on a coincidental substring match.

* refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized

Addresses review feedback on PR #9207.

Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName,
Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes
through a single util.SanitizeUTF8Name helper, so the replacement char
('_' — URL-safe) is chosen in one place.

Outside-diff: three proto fields took raw FullPath strings that could
break marshaling if an entry ever carried invalid UTF-8
(CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in
Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The
reviewer's suggested fix — using DirAndName() — would have silently
changed Directory from parent to grandparent, because DirAndName
sanitizes only the trailing component. Added FullPath.Sanitized(), which
scrubs every component, and applied it at the three sites. Exposure is
narrow in practice (FUSE-boundary sanitization and the gRPC-side
isClientSideMarshalError carve-out already cover the #9139 cascade),
but the defense-in-depth is cheap and consistent with the existing
AssignVolume backstop.

New tests in weed/util/fullpath_test.go document:
- SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes
  become '_' (not '?', which is URL-special).
- FullPath.Sanitized: scrubs bytes in any component, not just the last.
- FullPath.DirAndName: dir remains raw on purpose — callers needing a
  clean full path must use Sanitized(). The test pins this behavior so
  it is not accidentally "fixed" in a way that changes the (dir, name)
  semantics callers depend on.
2026-04-23 19:17:35 -07:00
Chris Lu
7364f148bd fix(s3/shell): factor EC volumes into bucket size metrics and collection.list (#9182)
* fix(s3/shell): include EC volumes in bucket size metrics and collection.list

S3 bucket size metrics exported to Prometheus (and fed through
stats.UpdateBucketSizeMetrics) are computed by
collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos.
As soon as a volume was encoded to EC it dropped out of every aggregate,
so Grafana showed bucket sizes shrinking while physical disk usage kept
climbing. The shell helper collectCollectionInfo — used by collection.list
and s3.bucket.quota.enforce — had the same gap, with the EC branch left as
a commented-out TODO.

Fold EC shards into both paths using the same approach the admin dashboard
already uses (PR #9093):

- PhysicalSize / Size sum across shard holders: EC shards are node-local
  (not replicas), so per-node TotalSize() and MinusParityShards().TotalSize()
  sum to the whole-volume physical and logical sizes respectively.
- FileCount is deduped via max across reporters (every shard holder reports
  the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and
  must not pin the aggregate).
- DeleteCount is summed (each delete tombstones exactly one node's .ecj).
- VolumeCount increments once per unique EC volume id.

Adds regression tests covering pure-EC, mixed regular+EC, and the
slow-reporter FileCount dedupe case.

Refs #9086

* Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe

- Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding
  package that walk the shard bitmap directly instead of materializing a
  ShardsInfo and copying it via MinusParityShards(). Keeps the
  DataShardsCount dependency encapsulated in one place and avoids the
  per-shard allocation/copy overhead in the metrics hot path.
- Switch shell collectCollectionInfo ecVolumes map to a composite
  {collection, volumeId} key, matching the bucket_size_metrics collector
  and defending against any cross-collection volume id aliasing.
- Dedupe VolumeCount in shell addToCollection by volume id so regular
  volumes aren't counted once per replica presence. Aligns the shell's
  collection.list output with the S3 metrics collector and the EC branch,
  all of which now report logical volume counts.
- Add unit tests for the new helpers and for the regular-volume
  VolumeCount dedupe.

* Parameterize EcShardsDataSize with dataShards for custom EC ratios

Add a dataShards parameter to EcShardsDataSize so forks with per-volume
ratio metadata (e.g. the enterprise data_shards field carried on an
extended VolumeEcShardInformationMessage) can pass the configured value
and get accurate logical sizes under custom EC policies like 6+3 or 16+6.
Passing 0 or a negative value falls back to the upstream DataShardsCount
default, which is correct for the fixed 10+4 layout — so OSS callers in
s3api and shell pass 0 and keep their current behavior.

Added table cases covering the custom 6+3 and 16+6 paths so the
parameterization is pinned by tests.
2026-04-21 20:17:42 -07:00
Chris Lu
49f62df2cf fix(shell): mergeVolumes hard-link safety and cleaner cleanup logging (#9163)
* fix(shell): skip hard-linked entries in fs.mergeVolumes

Hard-linked entries share a chunk list with their siblings, but the
filer's UpdateEntry only rewrites one entry at a time. Moving a chunk
here leaves every other hard-linked sibling pointing at a fid that
either gets deleted by the filer's own garbage step after UpdateEntry
or by deleteMovedSourceNeedles (#9160) — either way, the siblings end
up with dangling references.

Skip the entry with a visible log line so operators know the file was
bypassed and can handle it explicitly (copy-then-unlink, or dedup
before merge). Detected via entry.HardLinkId being non-empty, which is
the same signal the filer itself uses (weed/pb/filer_pb/filer.pb.go:451).

Flagged by coderabbit on #9160 post-merge.

* fix(shell): mergeVolumes suppresses 404 alongside 304 in source cleanup

BatchDelete also returns StatusNotFound (404) for an already-deleted
needle when ReadVolumeNeedle can't find it in the cookie-check path,
not only StatusNotModified (304) from DeleteVolumeNeedle returning
size 0. Both are benign races against a concurrent fsck purge or a
replica that already reconciled, so don't clutter the output with
"delete ... not found" warnings for them.

Flagged by coderabbit on #9160 post-merge.

* fix(shell): mergeVolumes merges hard-linked files via dedup on HardLinkId

Hard-linked siblings share one chunk list through a KV blob keyed by
HardLinkId (see weed/filer/filerstore_hardlink.go). UpdateEntry's
setHardLink rewrites that blob and maybeReadHardLink overrides per-entry
chunks with the blob's on every read, so a single UpdateEntry propagates
new fids to every sibling automatically — the previous skip-hardlinks
bailout was overly conservative and left hard-linked files stuck on
merge-source volumes forever.

Process each HardLinkId exactly once per run with a sync.Map so BFS
workers in different directories synchronize without a global lock.
First sibling carries the chunk move + UpdateEntry; later siblings find
the id in the map and return — preventing the real race, which is two
siblings trying to re-download an already-moved source needle or
double-queue the same fid for deletion.

Also address the log-spam review on deleteMovedSourceNeedles: an
unreachable volume server returns one error per needle, so collapse
multiple failures into a single per-server line with the first error as
an example.
2026-04-20 17:55:20 -07:00
Chris Lu
caaa53aee3 fix(shell): ec.encode health-check key mismatch on K8s deployments (#9164)
Build freeVolumeCountMap using dn.Address so the key matches
wdclient.Location.Url during the subsequent lookup. Keying by dn.Id
silently filtered out every replica in deployments where dn.Id is a
short name (e.g. Kubernetes StatefulSet pod name) while the location
Url is a FQDN:port, causing "no healthy replicas" even with ample
free capacity.

Also filter replicas before marking volumes readonly so that a failed
health check no longer strands volumes in readonly state.

Fixes #9145
2026-04-20 15:57:30 -07:00
Chris Lu
e725eb4079 refactor(shell): run volume.fsck purge once per volume, after all replicas (#9159)
* refactor(shell): run volume.fsck purge once per volume, after all replicas

The purge step in findExtraChunksInVolumeServers was nested inside the
outer `for dataNodeId` loop, so it fired once per data-node iteration
rather than once total. Two consequences:

1. The replica-intersection safety net was broken. The code marks a fid
   "found in all replicas" only after every replica has reported its
   orphans, but the purge ran after the first data node already, so
   fids contributed only by later replicas never got the `true` flag
   in time. Without `-forcePurging` that meant some legitimate orphans
   were never purged; with `-forcePurging` the flag was ignored so the
   bug was hidden.

2. Visible output got noisy: "purging orphan data for volume X..."
   printed 2-3 times per volume (N_datanodes * N_replicas RPCs to the
   same locations) since purgeFileIdsForOneVolume already fans out to
   every replica location via MasterClient.GetLocations.

Split the work into two explicit phases: collect orphans from every
replica first, then purge each volume once. Drop the per-replica loop
around purgeFileIdsForOneVolume since it already handles all replicas
internally. Keep the per-replica mark-writable loop (each replica's
readonly bit has to be flipped before the purge RPC fans out to it).

Also simplify the gating expression — `isSeveralReplicas &&
foundInAllReplicas` is redundant given the preceding `!isSeveralReplicas`
branch — and replace `!(X > 0)` with the more idiomatic `len(X) == 0`.

Related to #9116 follow-up on multiple fsck passes needed to fully
clean a volume.

* address review: per-replica readonly tracking, count-based intersection, defer-per-volume

Three issues raised on the v1:

1. The readonly cleanup stored a single isReadOnlyReplicas[volumeId]=bool
   that flipped true if any replica was read-only, then the defer marked
   every replica in serverReplicas[volumeId] read-only on exit. If a
   volume had mixed replica modes (one RO, one RW), the originally-RW
   replica ended up RO after fsck returned. Track read-only state per
   replica in readOnlyServerReplicas[volumeId] and revert only those.

2. The defer inside the volumeId loop accumulated for the entire fsck
   run, so every volume we processed stayed writable until the whole
   command returned. Split the per-volume logic into purgeOneVolume so
   the defers unwind between volumes.

3. The intersection logic used a sticky bool that treated "seen on any
   2 of 3 replicas" as "seen on all replicas" — a 3+-replica volume
   would get purged for fids only 2 replicas agreed on, which is what
   -forcePurging is supposed to opt into. Switch to a count-based
   map[fid]int compared against volumeReplicaCounts[volumeId], so we
   only purge without -forcePurging when every replica agrees.

Also drop the now-unused serverReplicas map.
2026-04-20 15:32:47 -07:00
Chris Lu
08a7502b2c fix(shell): error on missing volume id in fsck, mergeVolumes, vacuum (#9158)
* fix(shell): error on missing volume id in fsck, mergeVolumes, vacuum

Three shell commands silently report success when -volumeId /
-fromVolumeId / -toVolumeId names a volume the master doesn't know
about: typos, already-deleted volumes, and stale scripts all look
identical to a clean no-op, which is what made the confusion in #9116
take as long as it did to diagnose.

- volume.fsck: filter at the per-datanode loop drops unknown ids and
  findExtraChunksInVolumeServers ends with totalOrphanChunkCount==0,
  printing "no orphan data".
- fs.mergeVolumes: createMergePlan iterates only known volumes, so an
  unknown -fromVolumeId produces an empty plan and we print just the
  "max volume size: N MB" header (indistinguishable from "nothing to
  merge").
- volume.vacuum: the master's VacuumVolume RPC silently iterates
  matching volumes; a missing id returns success having done nothing.

Validate the requested ids against the current topology up front and
return an explicit "volume(s) not found on master: [X Y]" error. Also
drop a stale duplicate `if err != nil` in volume.fsck.Do left over from
a prior refactor.

Surfaces #9116 follow-up from madalee-com.

* address review: propagate reloadVolumesInfo error; dedupe vacuum missing ids

- fs.mergeVolumes: c.reloadVolumesInfo's return was ignored. If the
  master is unreachable or VolumeList fails, c.volumes stays empty and
  the new validation block reports "fromVolumeId X not found on master"
  — masking the real connection/RPC failure. Return the wrapped error
  instead.

- volume.vacuum: "volume.vacuum -volumeId 5,5,5" on a missing volume 5
  listed [5 5 5] in the error. Collect missing ids in a set so each
  missing id appears once.

* address review: reject fromVolumeId/toVolumeId values that overflow uint32

flag.Uint produces a uint (64-bit on amd64), and the existing cast to
needle.VolumeId silently truncates to uint32. A typo like
`-fromVolumeId=4294967297` would wrap to volume 1 and slip past every
other validation, so the merge would run against a completely
different volume than the operator intended.

Bail out with an explicit error when the raw flag value exceeds the
uint32 range, before the cast.
2026-04-20 15:32:31 -07:00
Chris Lu
6bdd775963 feat(shell): fs.mergeVolumes deletes source needles after filer update (#9160)
* feat(shell): fs.mergeVolumes deletes source needles after filer update

Before this change, mergeVolumes only copied chunks to the destination
volume and updated the filer — the source needle sat untouched on its
original volume as a silent orphan. Operators had to run a separate
volume.fsck + volume.vacuum pass to actually reclaim the space, and
#9116 (comment 4282692876) showed how that pipeline can look exactly
like "mergeVolumes did nothing": the source volume keeps reporting its
original size even though every chunk has been logically moved out.

Clean up the source inline. For each entry, track the pre-move fids as
they're captured, and after the UpdateEntry RPC commits, issue
BatchDelete on every replica of each source volume. Key invariants:

- Source fids are only deleted AFTER UpdateEntry succeeds; if the
  filer write fails we skip the cleanup for that entry so we never
  delete data the filer still references.
- rewriteManifestChunk grew a fourth return value so nested manifest
  and sub-chunk moves propagate their moved-source list back to the
  top-level callsite. The outer manifest itself is recorded at the
  callsite, since only the callsite sees the pre-rewrite fid.
- deleteMovedSourceNeedles logs errors but never returns them.
  Propagating would abort TraverseBfs mid-merge, stranding remaining
  entries; logging leaves the fallback path (fsck reconciles later)
  intact.
- StatusNotModified from the volume server is expected whenever a
  concurrent fsck purge beat us to the delete or a replica already
  reconciled — don't warn on it.

Readonly source volumes are already rejected up front by
createMergePlan, so by the time we reach the delete the source is
writable. If a replica's readonly bit has flipped since then the
delete will fail and get logged; the user can re-run once they've
fixed the replica (same failure mode as today's fsck purge).

Fixes the space-not-reclaimed half of #9116.

Related design discussion: #8589.

* address review: cast r.Status to int in StatusNotModified compare

http.StatusNotModified is an untyped constant so the compare works as
written, but the int32/int mixed-type signal trips static analyzers
and PR tooling. Cast explicitly and note why.
2026-04-20 14:37:20 -07:00
Chris Lu
9a6b566fb1 fix(shell): volume.fsck keeps going past a single broken chunk manifest (#9140)
* fix(shell): volume.fsck no longer aborts on a single broken chunk manifest

Previously a single entry whose chunk-manifest could not be read (e.g. the
manifest needle was missing or its sub-chunks pointed at a now-gone volume)
caused collectFilerFileIdAndPaths to return immediately with
"failed to ResolveChunkManifest". The whole fsck run failed, so an operator
with even one corrupted file could not use volume.fsck to find or clean up
unrelated orphan needles on other volumes — they had to locate and delete
the bad entries first, blind, with no help from fsck.

Log the resolution failure with the entry path, fall back to recording the
top-level chunk fids the entry references (data fids and manifest fids
themselves; sub-chunks behind the unresolvable manifest stay unknown), and
keep traversing. Track the count of unresolved entries on the command struct
and refuse -reallyDeleteFromVolume for the run when the count is non-zero,
since the in-use fid set is incomplete and a purge could otherwise delete
live sub-chunks behind the broken manifest. Read-only fsck still produces a
useful (if conservatively over-reported) orphan listing so the operator can
see and fix the broken entries first, then re-run with apply.

Discovered while diagnosing #9116.

* address review: use callback ctx and atomic counter

- Pass the BFS callback's ctx to ResolveChunkManifest so a Ctrl+C / first-error
  cancellation propagates into the manifest fetch instead of using
  context.Background().
- TraverseBfs runs the callback across K=5 worker goroutines (filer_pb/filer_client_bfs.go),
  so the unresolvedManifestEntries field on commandVolumeFsck is shared across
  workers and was racing. Switch it to atomic.Int64 with Add/Load.

* address review: reset counter per Do(), pass through ctx errors

- commandVolumeFsck is a singleton registered in init() and reused across
  shell invocations. Without resetting the unresolved-manifest counter at
  the top of Do(), a single failed run permanently suppressed
  -reallyDeleteFromVolume in the same shell session. Reset to 0 right
  after flag parsing.
- Treating context cancellation as manifest corruption was wrong: a
  Ctrl+C or deadline mid-traversal would inflate the counter and emit
  misleading "manifest broken" warnings for entries that were never
  examined. Detect context.Canceled / context.DeadlineExceeded and
  return the error so the BFS unwinds cleanly.

Not changing the findMissingChunksInFiler branch's purgeAbsent /
applyPurging gating: that path checks recorded filer fids against
volume idx files, and a broken-manifest entry's recorded manifest fid
will fail the existence check and get purged — which is the cleanup
the operator wants for those entries. Adding a gate would block the
exact use case the warning points them at.
2026-04-19 23:06:28 -07:00
Chris Lu
d57fc67022 fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files (#9127)
* fix(shell): fs.mergeVolumes now rewrites manifest chunks for large files

Previously fs.mergeVolumes skipped any chunk whose IsChunkManifest flag was
true, printing "Change volume id for large file is not implemented yet" and
continuing. Because the BFS traversal only looks at top-level
entry.Chunks, sub-chunks referenced inside a manifest were never
considered either. For any file stored as a chunk manifest (large files
go this path), chunks in the source volume stayed put, leaving behind a
few MB of live data that vacuum and volume.deleteEmpty couldn't clean
up.

This change resolves each manifest chunk recursively, moves any
sub-chunk whose volume id is in the merge plan via the existing
moveChunk path, and re-serializes the manifest. If the manifest chunk
itself lives in a source volume, or any sub-chunk moved, the new
manifest blob is uploaded to a freshly assigned file id (the old
needle becomes orphaned and is reclaimed by vacuum like any other
moved chunk).

Fixes #9116.

* address review: batch UpdateEntry, fix dry-run, defer restore, avoid source volumes

- Call UpdateEntry once per entry after the chunk loop instead of once per
  moved chunk (gemini nit).
- In dry-run mode, mark anySubChanged when a sub-chunk in the plan is
  encountered and return changed=true after printing "rewrite manifest",
  so nested manifests also surface their would-rewrites (gemini nit).
- Defer filer_pb.AfterEntryDeserialization so the manifest chunk list is
  restored even when proto.Marshal fails (coderabbit nit).
- Reject AssignVolume results whose file id lands on a volume that is a
  source in the merge plan, and retry — otherwise the replacement
  manifest could be written to the volume being emptied (coderabbit).
2026-04-17 21:17:51 -07:00
Jaehoon Kim
96af27a131 feat(shell): add fs.distributeChunks command for even chunk distribution (#9117)
* feat(shell): add fs.distributeChunks command for even chunk distribution

Add a new weed shell command that redistributes a file's chunks evenly
across volume server nodes.

Supports three distribution modes via -mode flag:
- primary:     balance chunk ownership across nodes (default)
- replica:     balance both ownership and replica copies
- round-robin: assign chunks by offset order for sequential read
               optimization (chunk[0]->A, chunk[1]->B, chunk[2]->C, ...)

Additional options:
- -nodes=N to target specific number of nodes
- -apply to execute (dry-run by default)

Usage:
  fs.distributeChunks -path=/buckets/file.dat
  fs.distributeChunks -path=/buckets/file.dat -mode=round-robin -apply
  fs.distributeChunks -path=/buckets/file.dat -mode=replica -apply
  fs.distributeChunks -path=/buckets/file.dat -nodes=5 -apply

* fix(shell): improve fs.distributeChunks robustness and code quality

- Propagate flag parse errors instead of swallowing them (return err)
- Handle nil chunk.Fid by falling back to legacy FileId string parsing
- Simplify node membership check using slices.Contains

* fix(shell): fix dead round-robin print loop in fs.distributeChunks

The loop was computing targetNode with sc.index%totalNodes (original
chunk index) instead of the sequential position, and discarding it via
_ = targetNode without printing anything. Replace with a correct loop
using pos%totalNodes and actually print the first 12 node assignments.

* fix(shell): compute replication/collection per-chunk in fs.distributeChunks

Previously replication and collection were derived once from chunks[0]
and reused for all moves, causing wrong volume placement for chunks
belonging to different volumes or collections. Now each chunk looks up
its own volumeInfoMap entry immediately before calling operation.Assign.

* fix(shell): prefer assignResult.Auth JWT over local signing key in fs.distributeChunks

When the master returns an Auth token in the Assign response, use it
directly for the upload instead of generating a new JWT from the local
viper signing key. Fall back to local key generation only when Auth is
empty, matching the pattern used by other upload paths.

* fix(shell): add timeout and error handling to delete requests in fs.distributeChunks

The delete loop was ignoring http.NewRequest errors and had no timeout,
risking a nil-request panic or indefinite block. Replace with
http.NewRequestWithContext and a 30s timeout, handle request creation
errors by incrementing deleteFailCount, and cancel the context
immediately after Do returns.

* feat(shell): parallelize chunk moves in fs.distributeChunks using ErrorWaitGroup

Sequential chunk moves are a bottleneck for large LLM model files with
hundreds or thousands of chunks. Use ErrorWaitGroup with
DefaultMaxParallelization (10) to run download/assign/upload concurrently.
Guard movedRecords appends, chunk.Fid updates, and writer output with a
mutex. Individual chunk failures are non-fatal and logged inline; only
successfully moved chunks are included in the metadata update.

* fix(shell): try all replica URLs on download in fs.distributeChunks

Previously only the first volume server URL was attempted, causing chunk
moves to fail if that replica was unreachable. Now iterates through all
URLs returned by LookupVolumeServerUrl and stops at the first success.

* refactor(shell): apply extract method pattern to fs.distributeChunks

Do() was a single ~615-line function. Break it into focused helpers:
- lookupFileEntry: filer entry lookup
- validateChunks: chunk manifest guard
- collectVolumeTopology: master topology query + ownership mapping
- buildDistributionCounts: chunk→node mapping and owner/copy tallies
- selectActiveNodes: target node selection
- printCurrentDistribution: per-node distribution table
- planDistribution: mode-switch planning (primary/replica/round-robin)
- printRedistributionPlan: before/after plan table
- relevantNodes: active-or-occupied node filter

Do() is now ~100 lines of orchestration; each helper has a single
clear responsibility.

* test(shell): add unit tests for fs.distributeChunks algorithms

Cover all three distribution modes and supporting helpers:
- shortName, relevantNodes
- computeOwnerTarget (even/uneven split, inactive node drain)
- buildDistributionCounts (normal + nil Fid fallback)
- selectActiveNodes (all nodes / limited count)
- planOwnerMoves (imbalanced → balanced, already balanced)
- planDistribution primary (chunks balanced, no-op when even)
- planDistribution round-robin (offset ordering, correct assignment)
- planDistribution replica (owner + copy balancing)
- printRedistributionPlan (output format)

* fix(shell): add 5-minute timeout to chunk downloads in fs.distributeChunks

Download requests had no per-request timeout, unlike delete operations
which already use 30s. Replace readUrl() calls with inline
http.NewRequestWithContext + context.WithTimeout(5m) so a hung volume
server cannot block a goroutine indefinitely during redistribution.

* fix(shell): remove redundant deleteOldChunks in fs.distributeChunks

filer.UpdateEntry already calls deleteChunksIfNotNew internally, which
computes the diff between old and new entry chunks and deletes the ones
no longer referenced. Our explicit deleteOldChunks was racing with this
filer-side cleanup, causing spurious 404 warnings on ~75% of deletes.

Remove deleteOldChunks, movedChunkRecord type, and reduce
executeChunkMoves return type to (int, error) for the moved count.

* fix(shell): handle nil chunk.Fid via chunkVolumeId helper in fs.distributeChunks

chunk.Fid.GetVolumeId() silently returns 0 for legacy chunks stored with
a FileId string instead of a Fid struct, causing them to be skipped in
the replica balancing loop and looked up incorrectly in volumeInfoMap.

Introduce chunkVolumeId() that uses Fid when present and falls back to
parsing the legacy FileId string, matching the logic in
buildDistributionCounts. Apply it in the replica-mode copies loop and
in executeChunkMoves' replication/collection lookup.

* fix(shell): use already-parsed oldFid for volumeInfoMap lookup in fs.distributeChunks

chunkVolumeId(chunk) was being called to look up replication/collection
after oldFid had already been parsed and validated. Use oldFid.VolumeId
directly to avoid redundant parsing and guarantee the correct volume ID
regardless of whether chunk.Fid is nil.

* fix(shell): improve correctness and robustness in fs.distributeChunks

- Buffer download body before upload so dlCtx timeout only covers the
  GET request; upload runs with context.Background() via bytes.NewReader
- Replace 'before, after := strings.Cut(...)' + '_ = before' with '_'
  as the first return value directly
- Clone copiesCount before replica planner mutates it, keeping the
  caller's map immutable
- Add nil-entry guard after filer LookupEntry to prevent panic on
  unexpected nil response

* feat(shell): support chunk manifests in fs.distributeChunks

Large files stored as chunk manifests were previously rejected. Resolve
manifests up front via filer.ResolveChunkManifest, redistribute the
underlying data chunks, then re-pack through filer.MaybeManifestize
before UpdateEntry. The filer's MinusChunks resolves manifests on both
sides of the diff, so old manifest and inner data chunks are GC'd
automatically.

* fix(shell): match master's SaveDataAsChunkFunctionType 5-param signature

Master added expectedDataSize uint64; ignore it in shell-side saveAsChunk.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-17 21:09:36 -07:00
Chris Lu
979c54f693 fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals

Raft leader is advertised as host:httpPort.grpcPort, but clients dial
host:httpPort. Raw string comparison against VolumeLocation.Leader /
HeartbeatResponse.Leader therefore never matches, causing the
masterclient and the volume server heartbeat loop to continuously
"redirect" to the already-connected master, tearing down the stream
and reconnecting.

Use ServerAddress.Equals, which normalizes the grpc-port suffix.

* fix(filer,mq): compare ServerAddress via Equals in two more sites

filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local
partition assignment check both compared a wire-supplied address string
against the local self ServerAddress with raw string equality. Both are
vulnerable to the same plain-vs-host:port.grpcPort mismatch as the
masterclient/volume heartbeat sites: filer would bootstrap from itself,
and the broker would fail to claim a partition it was actually assigned.
Route both through ServerAddress.Equals.

* fix(master,shell): more ServerAddress comparisons via Equals

- raft_server_handlers.go HealthzHandler: s.serverAddr == leader would
  skip the child-lock check on the real leader when the two carry
  different plain/grpc-suffix forms, returning 200 OK instead of 423.
- master_server.go SetRaftServer leader-change callback: the
  Leader() == Name() guard for ensureTopologyId could disagree with
  topology.IsLeader() (which already uses Equals), so leader-only
  initialization could be skipped after an election.
- command_volume_merge.go isReplicaServer: the -target guard compared
  user-supplied host:port against NewServerAddressFromDataNode(...) with
  ==, letting an existing replica slip through when topology carries
  the embedded gRPC port.

All routed through pb.ServerAddress.Equals.

* fix(mq,cluster): more ServerAddress comparisons via Equals

- broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the
  partition ownership check gated listing on raw
  LeaderBroker == BrokerAddress().String(), so listings silently omitted
  partitions hosted locally when the assignment carried the other
  host:port / host:port.grpcPort form.
- lock_client.go: LockHostMovedTo comparison and the seedFiler fallback
  guard both used raw string equality against configured filer
  addresses (which may be plain host:port while LockHostMovedTo comes
  back suffixed), causing spurious host-change churn and blocking the
  seed-filer fallback.

* fix(mq): more ServerAddress comparisons via Equals

- pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct
  activeBrokers.Get() lookup missed brokers when a persisted assignment
  carried a different address encoding than the registered broker key,
  triggering a bogus reassignment on every read/write cycle. Added a
  findActiveBroker helper that falls back to an Equals-based scan and
  canonicalizes the assignment in place so later writes are stable.
- broker_grpc_lookup.go isLockOwner: used raw string equality between
  LockOwner() and BrokerAddress().String(), so a lock owner could fail
  to recognize itself and proxy local lookup/config/admin RPCs away.
- pub_client/scheduler.go onEachAssignments: reused publisher jobs only
  on exact LeaderBroker match, so an encoding flip in lookup results
  tore down and recreated a stream to the same broker.
2026-04-15 12:29:31 -07:00
Chris Lu
10e7f0f2bc fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy

Instead of erroring when the user already exists, the command now
creates the policy and attaches it to the existing user via UpdateUser.
Credentials are only generated and displayed for newly created users.

* fix(shell): skip duplicate policy attachment in s3.user.provision

Check if the policy is already attached before appending and calling
UpdateUser, making repeated runs idempotent.

* fix(shell): generate service account ID in s3.serviceaccount.create

The command built a ServiceAccount proto without setting Id, which was
rejected by credential.ValidateServiceAccountId on any real store. Now
generates sa:<parent>:<uuid> matching the format used by the admin UI.

* test(s3): integration tests for s3.* shell commands

Adds TestShell* integration tests covering ~40 previously untested
shell commands: user, accesskey, group, serviceaccount, anonymous,
bucket, policy.attach/detach, config.show, and iam.export/import.

Switches the test cluster's credential store from memory to filer_etc
because the memory store silently drops groups and service accounts
in LoadConfiguration/SaveConfiguration.

* fix(shell): rollback policy on key generation failure in s3.user.provision

If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after
the policy was persisted, the policy would be left orphaned. Extracts
the rollback logic into a local closure and invokes it on all failure
paths after policy creation for consistency.

* address PR review feedback for s3 shell tests and serviceaccount

- s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for
  the service account UUID instead of 4 bytes to eliminate collision risk
- s3.serviceaccount.create: print the actual ID and drop the outdated
  "server-assigned" note (the ID is now client-generated)
- tests: guard createdAK in accesskey rotate/delete subtests so sibling
  failures don't run invalid CLI calls
- tests: requireContains/requireNotContains use t.Fatalf to fail fast
- tests: Provision subtest asserts the "Attached policy" message on the
  second provision call for an existing user
- tests: update extractServiceAccountID comment example to match the
  sa:<parent>:<uuid> format
- tests: drop redundant saID empty-check (extractServiceAccountID fatals)

* test(s3): use t.Fatalf for precondition check in serviceaccount test
2026-04-11 22:30:51 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
b1265de78f feat(shell): add group management commands (#8993)
* feat(shell): add group management commands

Add weed shell commands for IAM group management:
- s3.group.create -name <group>
- s3.group.delete -name <group>
- s3.group.list
- s3.group.show -name <group>
- s3.group.add-user -group <group> -user <user>
- s3.group.remove-user -group <group> -user <user>

All commands use GetConfiguration/PutConfiguration gRPC pattern,
consistent with existing shell commands like s3.user.list.

* fix: add nil check for Configuration in group shell commands

Guard against nil Configuration response from GetConfiguration
gRPC call to prevent potential panics. (Gemini review)
2026-04-08 14:03:26 -07:00
Chris Lu
7f3908297c fix(weed/shell): suppress prompt when piped (#8990)
* fix(weed/shell): suppress prompt when stdin or stdout is not a TTY

When piping weed shell output (e.g. `echo "s3.user.list" | weed shell | jq`),
the "> " prompt was written to stdout, breaking JSON parsers.

`liner.TerminalSupported()` only checks platform support, not whether
stdin/stdout are actual TTYs. Add explicit checks using `term.IsTerminal()`
so the shell falls back to the non-interactive scanner path when piped.

Fixes #8962

* fix(weed/shell): suppress informational logs unless -verbose is set

Suppress glog info messages and connection status logs on stderr by
default. Add -verbose flag to opt in to the previous noisy behavior.
This keeps piped output clean (e.g. `echo "s3.user.list" | weed shell | jq`).

* fix(weed/shell): defer liner init until after TTY check

Move liner.NewLiner() and related setup (history, completion, interrupt
handler) inside the interactive block so the terminal is not put into
raw mode when stdout is redirected. Previously, liner would set raw mode
unconditionally at startup, leaving the terminal broken when falling
back to the scanner path.

Addresses review feedback from gemini-code-assist.

* refactor(weed/shell): consolidate verbose logging into single block

Group all verbose stderr output within one conditional block instead of
scattering three separate if-verbose checks around the filer logic.

Addresses review feedback from gemini-code-assist.

* fix(weed/shell): clean up global liner state and suppress logtostderr

- Set line=nil after Close() to prevent stale state if RunShell is
  called again (e.g. in tests)
- Add nil check in OnInterrupt handler for non-interactive sessions
- Also set logtostderr=false when not verbose, in case it was enabled

Addresses review feedback from gemini-code-assist.

* refactor(weed/shell): make liner state local to eliminate data race

Replace the package-level `line` variable with a local variable in
RunShell, passing it explicitly to setCompletionHandler, loadHistory,
and saveHistory. This eliminates a data race between the OnInterrupt
goroutine and the defer that previously set the global to nil.

Addresses review feedback from gemini-code-assist.

* rename(weed/shell): rename -verbose flag to -debug

Avoid conflict with -verbose flags already used by individual shell
commands (e.g. ec.encode, volume.fix.replication, volume.check.disk).
2026-04-08 13:07:15 -07:00
Chris Lu
74905c4b5d shell: s3.* commands always output JSON, connection messages to stderr (#8976)
* shell: s3.* commands output JSON, connection messages to stderr

All s3.user.* and s3.policy.attach|detach commands now output structured
JSON to stdout instead of human-readable text:

- s3.user.create: {"name","access_key"} (secret key to stderr only)
- s3.user.list: [{name,status,policies,keys}]
- s3.user.show: {name,status,source,account,policies,credentials,...}
- s3.user.delete: {"name"}
- s3.user.enable/disable: {"name","status"}
- s3.policy.attach/detach: {"policy","user"}

Connection startup messages (master/filer) moved to stderr so they
don't pollute structured output when piping.

Closes #8962 (partial — covers merged s3.user/policy commands).

* shell: fix secret leak, duplicate JSON output, and non-interactive prompt

- s3.user.create: only echo secret key to stderr when auto-generated,
  never echo caller-supplied secrets
- s3.user.enable/disable: fix duplicate JSON output — remove inner
  write in early-return path, keep single write site after gRPC call
- shell_liner: use bufio.Scanner when stdin is not a terminal instead
  of liner.Prompt, suppressing the "> " prompt in piped mode

* shell: check scanner error, idempotent enable output, history errors to stderr

- Check scanner.Err() after non-interactive input loop to surface read errors
- s3.user.enable: always emit JSON regardless of current state (idempotent)
- saveHistory: write error messages to stderr instead of stdout
2026-04-07 16:27:21 -07:00
Chris Lu
fb0573ffc4 shell: rename -force to -apply in s3.iam.import for consistency 2026-04-07 14:17:07 -07:00
Chris Lu
d50889002b shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands (#8956)
* shell: add s3.iam.*, s3.config.show, s3.user.provision; hide legacy commands

Add import/export, configuration summary, and a convenience provisioning
command:

- s3.iam.export: dump full IAM state as JSON (stdout or file)
- s3.iam.import: replace IAM state from a JSON file
- s3.config.show: human-readable summary (users, policies, service
  accounts, groups with status and counts)
- s3.user.provision: one-step user+policy+credentials creation for
  common readonly/readwrite/admin roles

Hide legacy commands from help listing:
- s3.configure: still works but hidden from help output
- s3.bucket.access: still works but hidden from help output

Both hidden commands remain fully functional for existing scripts.

Also adds a Hidden command tag and filters it from printGenericHelp.

* shell: address review feedback for s3.iam.*, s3.config.show, s3.user.provision

- Simplify joinMax using strings.Join
- Fix rolePolicies: remove s3:ListBucket from object-level actions
  (already covered by bucket-level statement)
- Fix admin role: grant s3:* on bucket resource too
- Return flag parse errors instead of swallowing them

* shell: address missed review feedback for PR 3

- s3.iam.import: require -force flag for destructive IAM overwrite
- s3.config.show: add nil guard for resp.Configuration
- s3.user.provision: check if user exists before creating policy
- s3.user.provision: reject wildcard bucket names (* ?)

* shell: distinguish NotFound from transient errors in provision, use %w wrapping

- s3.user.provision: check gRPC status code on GetUser error — only
  proceed on NotFound, abort on transient/network errors
- s3.iam.import: use %w for error wrapping to preserve error chains,
  wrap PutConfiguration error with context

* shell: remove duplicate joinMax after PR 8954 merge

command_s3_helpers.go defined joinMax which is already in
command_s3_user_list.go from the merged PR 8954.

* shell: restrict export file permissions, rollback policy on user create failure

- s3.iam.export: use os.OpenFile with mode 0600 instead of os.Create
  to protect exported credentials from other users
- s3.user.provision: rollback the created policy if CreateUser fails,
  with a warning if the rollback itself fails
2026-04-07 14:10:15 -07:00
Chris Lu
45bf3ad058 shell: add s3.user.* and s3.policy.attach|detach commands (#8954)
* shell: add s3.user.* and s3.policy.attach|detach commands

Add focused IAM shell commands following a noun-verb model:

- s3.user.create: create user with auto-generated or explicit credentials
- s3.user.list: tabular listing with status, policies, key count
- s3.user.show: detailed user view (status, source, policies, credentials)
- s3.user.delete: delete a user
- s3.user.enable: enable a disabled user
- s3.user.disable: disable a user (preserves credentials and policies)
- s3.policy.attach: attach a named policy to a user
- s3.policy.detach: detach a policy from a user

These commands are thin wrappers over the existing IAM gRPC service,
producing human-readable output instead of raw protobuf text.

This is part of a larger effort to replace the monolithic s3.configure
command with a composable set of single-purpose commands.

* shell: address review feedback for s3.user.* and s3.policy.attach|detach

- Return flag parse errors instead of swallowing them (all commands)
- Use GetConfiguration instead of N+1 GetUser calls in s3.user.list
- Add nil check for resp.Identity in s3.user.show
- Fix GetPolicy error masking in s3.policy.attach (wrap original error)
- Simplify joinMax using strings.Join

* shell: add nil identity guards and wrap gRPC errors

- Add nil check for resp.Identity in policy_attach, policy_detach,
  user_enable, user_disable
- Wrap GetUser errors with user context for better diagnostics
2026-04-07 11:26:57 -07:00
Chris Lu
d123a2768b shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands (#8955)
* shell: add s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.* commands

Add credential, anonymous access, and service account management commands:

Access key commands:
- s3.accesskey.create: add credentials to an existing user
- s3.accesskey.list: list access keys for a user (key ID + status)
- s3.accesskey.delete: remove a specific access key
- s3.accesskey.rotate: atomic create-new + delete-old key rotation

Anonymous access commands:
- s3.anonymous.set: set/remove public access on a bucket
- s3.anonymous.get: show anonymous access for a bucket
- s3.anonymous.list: list all buckets with anonymous access

Service account commands:
- s3.serviceaccount.create: create with optional action subset and expiry
- s3.serviceaccount.list: tabular listing, optionally filtered by parent
- s3.serviceaccount.show: detailed view of a service account
- s3.serviceaccount.delete: remove a service account

These replace the credential and anonymous portions of the monolithic
s3.configure and s3.bucket.access commands.

* shell: address review feedback for s3.accesskey.*, s3.anonymous.*, s3.serviceaccount.*

- Return flag parse errors instead of swallowing them (all commands)
- Add action validation in s3.anonymous.set (Read, Write, List, Tagging, Admin)
- Fix s3.serviceaccount.create output: note to use list for server-assigned ID
  since CreateServiceAccountResponse does not return the ID

* shell: fix bucket matching and action validation in s3.anonymous.*

- Use SplitN instead of HasSuffix for bucket name matching to avoid
  false positives when one bucket name is a suffix of another
- Make action validation case-insensitive with canonical normalization

* shell: fix nil panics, dedup actions, validate service account actions

- Fix nil-pointer panic in getOrCreateAnonymousUser when GetUser returns
  err==nil with nil Identity (status.FromError(nil) returns nil status)
- Add nil Identity guards in s3.anonymous.get and s3.anonymous.list
- Deduplicate action values in s3.anonymous.set (e.g. -access Read,Read)
- Add action validation in s3.serviceaccount.create with case normalization

* shell: dedup actions and reject negative expiry in s3.serviceaccount.create

- Deduplicate -actions values (e.g. Read,read,Read produces one entry)
- Reject negative -expiry values instead of silently treating as no expiration
2026-04-07 11:20:15 -07:00
Chris Lu
0fed72d95a volume.tier.move: fulfill target replication before deleting old replicas (#8950)
* volume.tier.move: fulfill target replication before deleting old replicas

When -toReplication is specified, volume.tier.move now creates all
required replicas on the destination tier before deleting old replicas.
This closes the data-loss window where only one copy existed on the
target tier while awaiting volume.fix.replication.

If replication fulfillment fails, old replicas are preserved and marked
writable so the volume remains accessible.

Also extracts replicateVolumeToServer and configureVolumeReplication
helpers to reduce duplication across volume.tier.move and
volume.fix.replication.

Fixes #8937

* volume.tier.move: always fulfill replication before deleting old replicas

When -toReplication is specified, use that replication setting.
Otherwise, read the volume's existing replication from the super block.
In both cases, all required replicas are created on the destination
tier before old replicas are deleted.

If replication fulfillment fails (e.g. not enough destination nodes),
old replicas are preserved and marked writable so no data is lost.

* volume.tier.move: address review feedback on ensureReplicationFulfilled

- Add 5s delay before re-collecting topology to allow master heartbeat
  propagation after the move
- Add nil guard for targetTierReplicas to prevent panic if the moved
  replica is not yet visible in the topology
- Treat configureVolumeReplication failure as a hard error instead of a
  warning, so the rollback logic preserves old replicas

* volume.tier.move: harden replication config error handling

- Make configureVolumeReplication failure on the primary moved replica a
  hard error that aborts the move, instead of logging and continuing
- Configure replication metadata on all existing target-tier replicas
  (not just newly created ones) when -toReplication is specified
- Deletion of old replicas cannot affect new replicas since the
  locations list only contains pre-move servers (verified, no change)

* volume.tier.move: fix cleanup deleting fulfilled replicas and broken recovery

Fix 1: The cleanup loop now preserves pre-existing target-tier replicas
that ensureReplicationFulfilled counted toward the replication target.
Previously, a mixed-tier volume with an existing replica on the target
tier could have that replica deleted right after being counted as
fulfilled, leaving the volume under-replicated.

ensureReplicationFulfilled now returns a preserveServers set that the
deletion loop checks before removing any old replica.

Fix 2: Failure paths after LiveMoveVolume (which deletes the source
replica) now use restoreSurvivingReplicasWritable instead of
markVolumeReplicasWritable. The old helper stopped on first error, so
attempting to mark the already-deleted source writable would prevent
all surviving replicas from being restored. The new helper skips the
deleted source and continues through all remaining locations, logging
per-replica errors instead of aborting.

* volume.tier.move: mark preserved replicas writable, skip nodes with existing volume

Fix 1: Preserved pre-existing target-tier replicas were left read-only
after the move completed. They were marked read-only at the start
(along with all other replicas) but never restored since the old code
deleted them. Now they are explicitly marked writable before cleanup.

Fix 2: The fulfillment loop could pick a candidate node that already
hosts this volume on a different disk type, causing a VolumeCopy
conflict. Added a guard that skips any node already hosting the volume
(on any disk) before attempting replication.
2026-04-06 14:55:37 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
qzh
4c72512ea2 fix(shell): avoid marking skipped or unplaced volumes as fixed (#8866)
* fix(s3api): fix AWS Signature V2 format and validation

* fix(s3api): Skip space after "AWS" prefix (+1 offset)

* test(s3api): add unit tests for Signature V2 authentication fix

* fix(s3api): simply comparing signatures

* validation for the colon extraction in expectedAuth

* fix(shell): avoid marking skipped or unplaced volumes as fixed

---------

Co-authored-by: chrislu <chris.lu@gmail.com>
Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
2026-04-01 01:20:25 -07:00
Chris Lu
af68449a26 Process .ecj deletions during EC decode and vacuum decoded volume (#8863)
* Process .ecj deletions during EC decode and vacuum decoded volume (#8798)

When decoding EC volumes back to normal volumes, deletions recorded in
the .ecj journal were not being applied before computing the dat file
size or checking for live needles. This caused the decoded volume to
include data for deleted files and could produce false positives in the
all-deleted check.

- Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in
  VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first
- Vacuum the decoded volume after mounting in ec.decode to compact out
  deleted needle data from the .dat file
- Add integration tests for decoding with non-empty .ecj files

* storage: add offline volume compaction helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: compact decoded volumes before deleting shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: address PR review comments

- Fall back to data directory for .ecx when idx directory lacks it
- Make compaction failure non-fatal during EC decode
- Remove misleading "buffer: 10%" from space check error message

* ec: collect .ecj from all shard locations during decode

Each server's .ecj only contains deletions for needles whose data
resides in shards held by that server. Previously, sources with no
new data shards to contribute were skipped entirely, losing their
.ecj deletion entries. Now .ecj is always appended from every shard
location so RebuildEcxFile sees the full set of deletions.

* ec: add integration tests for .ecj collection during decode

TestEcDecodePreservesDeletedNeedles: verifies that needles deleted
via VolumeEcBlobDelete are excluded from the decoded volume.

TestEcDecodeCollectsEcjFromPeer: regression test for the fix in
collectEcShards. Deletes a needle only on a peer server that holds
no new data shards, then verifies the deletion survives decode via
.ecj collection.

* ec: address review nits in decode and tests

- Remove double error wrapping in mountDecodedVolume
- Check VolumeUnmount error in peer ecj test
- Assert 404 specifically for deleted needles, fail on 5xx

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-01 01:15:26 -07:00
Chris Lu
f256002d0b fix ec.balance failing to rebalance when all nodes share all volumes (#8796)
* fix ec.balance failing to rebalance when all nodes share all volumes (#8793)

Two bugs in doBalanceEcRack prevented rebalancing:

1. Sorting by freeEcSlot instead of actual shard count caused incorrect
   empty/full node selection when nodes have different total capacities.

2. The volume-level check skipped any volume already present on the
   target node. When every node has a shard of every volume (common
   with many EC volumes across N nodes with N shards each), no moves
   were possible.

Fix: sort by actual shard count, and use a two-pass approach - first
prefer moving shards of volumes not on the target (best diversity),
then fall back to moving specific shard IDs not yet on the target.

* add test simulating real cluster topology from issue #8793

Uses the actual node addresses and mixed max capacities (80 vs 33)
from the reporter's 14-node cluster to verify ec.balance correctly
rebalances with heterogeneous node sizes.

* fix pass comments to match 0-indexed loop variable
2026-03-27 11:14:10 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Chris Lu
ccc662b90b shell: add s3.bucket.access command for anonymous access policy (#8774)
* shell: add s3.bucket.access command for anonymous access policy (#7738)

Add a new weed shell command to view or change the anonymous access
policy of an S3 bucket without external tools.

Usage:
  s3.bucket.access -name <bucket> -access read,list
  s3.bucket.access -name <bucket> -access none

Supported permissions: read, write, list. The command writes a standard
bucket policy with Principal "*" and warns if no anonymous IAM identity
exists.

* shell: fix anonymous identity hint in s3.bucket.access warning

The anonymous identity doesn't need IAM actions — the bucket policy
controls what anonymous users can do.

* shell: only warn about anonymous identity when write access is set

Read and list operations use AuthWithPublicRead which evaluates bucket
policies directly without requiring the anonymous identity. Only write
operations go through the normal auth flow that needs it.

* shell: rewrite s3.bucket.access to use IAM actions instead of bucket policies

Replace the bucket policy approach with direct IAM identity actions,
matching the s3.configure pattern. The user is auto-created if it does
not exist.

Usage:
  s3.bucket.access -name <bucket> -user anonymous -access Read,List
  s3.bucket.access -name <bucket> -user anonymous -access none
  s3.bucket.access -name <bucket> -user anonymous

Actions are stored as "Action:bucket" on the identity, same as
s3.configure -actions=Read -buckets=my-bucket.

* shell: return flag parse errors instead of swallowing them

* shell: normalize action names case-insensitively in s3.bucket.access

Accept actions in any case (read, READ, Read) and normalize to canonical
form (Read, Write, List, etc.) before storing. This matches the
case-insensitive handling of "none" and avoids confusing rejections.
2026-03-25 23:09:53 -07:00
Chris Lu
7fbdb9b7b7 feat(shell): add volume.tier.compact command to reclaim cloud storage space (#8715)
* feat(shell): add volume.tier.compact command to reclaim cloud storage space

Adds a new shell command that automates compaction of cloud tier volumes.
When files are deleted from remote-tiered volumes, space is not reclaimed
on the cloud storage. This command orchestrates: download from remote,
compact locally, and re-upload to reclaim deleted space.

Closes #8563

* fix: log cleanup errors in compactVolumeOnServer instead of discarding them

Helps operators diagnose leftover temp files (.cpd/.cpx) if cleanup
fails after a compaction or commit failure.

* fix: return aggregate error from loop and use regex for collection filter

- Track and return error count when one or more volumes fail to compact,
  so callers see partial failures instead of always getting nil.
- Use compileCollectionPattern for -collection in -volumeId mode too, so
  regex patterns work consistently with the flag description. Empty
  pattern (no -collection given) matches all collections.
2026-03-20 23:52:12 -07:00
Chris Lu
81369b8a83 improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync

Three main throughput improvements:

1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
   instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
   (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
   overhead (volume assign, gRPC call, needle write) by 3x.

2. Configurable concurrency at every layer:
   - remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
   - remote.cache S3 download concurrency: -downloadConcurrency flag
     (default raised from 1 to 5 per chunk)
   - filer.sync chunk concurrency: -chunkConcurrency flag (default 32)

3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
   downloader was using Concurrency=1, serializing all part downloads
   within each chunk. This alone can 5x per-chunk download speed.

The concurrency values flow through the gRPC request chain:
  shell command → CacheRemoteObjectToLocalClusterRequest →
  FetchAndWriteNeedleRequest → S3 downloader

Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.

Ref #8481

* fix: use full maxMB for chunk size cap and remove loop guard

Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
  unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
  can always grow past maxChunkSize when needed to stay under 1000
  chunks (e.g., extremely large files with small maxMB).

* address review feedback: help text, validation, naming, docs

- Fix help text for -chunkConcurrency and -downloadConcurrency flags
  to say "0 = server default" instead of advertising specific numeric
  defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
  range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
  during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
  ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.

* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs

- Use ceiling division for chunk count check to avoid overcounting
  when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
  at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
  falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
  the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
  default 8") so users see the actual values in -h output.

* fix data race on executionErr and use %w for error wrapping

- Protect concurrent writes to executionErr in remote.cache worker
  goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
  to preserve the error chain for errors.Is/errors.As callers.
2026-03-17 16:49:56 -07:00
Chris Lu
c4d642b8aa fix(ec): gather shards from all disk locations before rebuild (#8633)
* fix(ec): gather shards from all disk locations before rebuild (#8631)

Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:

1. VolumeEcShardsRebuild only looked at a single disk location for shard
   files. On multi-disk servers, the existing local shards could be on one
   disk while copied shards were placed on another, causing the rebuild to
   see fewer shards than actually available.

2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
   len(vs.store.Locations) > 0) that was always true, making the
   FindFreeLocation fallback dead code. This meant copies always went to
   Locations[0] regardless of where existing shards were.

Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
  then gathers shard files from other locations via hard links (or
  symlinks for cross-device) before rebuilding. Gathered files are
  cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
  (explicitly set). Otherwise, it prefers the location that already has
  the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
  error message when not enough shards are found, instead of passing
  through to the opaque reedsolomon "too few shards given" error.

* fix(ec): update test to match skip behavior for unrepairable volumes

The test expected an error for volumes with insufficient shards, but
commit 5acb4578a changed unrepairable volumes to be skipped with a log
message instead of returning an error. Update the test to verify the
skip behavior and log output.

* fix(ec): address PR review comments

- Add comment clarifying DiskId=0 means "not specified" (protobuf default),
  callers must use DiskId >= 1 to target a specific disk.
- Log warnings on cleanup failures for gathered shard links.

* fix(ec): read shard files from other disks directly instead of linking

Replace the hard link / symlink gathering approach with passing
additional search directories into RebuildEcFiles. The rebuild
function now opens shard files directly from whichever disk they
live on, avoiding filesystem link operations and cleanup.

RebuildEcFiles and RebuildEcFilesWithContext gain a variadic
additionalDirs parameter (backward compatible with existing callers).

* fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment

* fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations

- generateMissingEcFiles: two-pass approach — first discover present/missing
  shards and check reconstructability, only then create output files. This
  avoids leaving behind empty truncated shard files when there are too few
  shards to rebuild.

- VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations.
  A location with an .ecx file but no shard files (all shards on other disks)
  is now a valid rebuild candidate instead of being silently skipped.

* fix(ec): select ecx-only location as rebuildLocation when none chosen yet

When rebuildLocation is nil and a location has hasEcx=true but
existingShardCount=0 (all shards on other disks), the condition
0 > 0 was false so it was never promoted to rebuildLocation.
Add rebuildLocation == nil to the predicate so the first location
with an .ecx file is always selected as a candidate.
2026-03-14 20:59:47 -07:00
Chris Lu
5acb4578ab Fix ec.rebuild failing on unrepairable volumes instead of skipping (#8632)
* Fix ec.rebuild failing on unrepairable volumes instead of skipping them

When an EC volume has fewer shards than DataShardsCount, ec.rebuild would
return an error and abort the entire operation. Now it logs a warning and
continues rebuilding the remaining volumes.

Fixes #8630

* Remove duplicate volume ID in unrepairable log message

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-14 16:18:29 -07:00
Chris Lu
f3c5ba3cd6 feat(filer): add lazy directory listing for remote mounts (#8615)
* feat(filer): add lazy directory listing for remote mounts

Directory listings on remote mounts previously only queried the local
filer store. With lazy mounts the listing was empty; with eager mounts
it went stale over time.

Add on-demand directory listing that fetches from remote and caches
results with a 5-minute TTL:

- Add `ListDirectory` to `RemoteStorageClient` interface (delimiter-based,
  single-level listing, separate from recursive `Traverse`)
- Implement in S3, GCS, and Azure backends using each platform's
  hierarchical listing API
- Add `maybeLazyListFromRemote` to filer: before each directory listing,
  check if the directory is under a remote mount with an expired cache,
  fetch from remote, persist entries to the local store, then let existing
  listing logic run on the populated store
- Use singleflight to deduplicate concurrent requests for the same directory
- Skip local-only entries (no RemoteEntry) to avoid overwriting unsynced uploads
- Errors are logged and swallowed (availability over consistency)

* refactor: extract xattr key to constant xattrRemoteListingSyncedAt

* feat: make listing cache TTL configurable per mount via listing_cache_ttl_seconds

Add listing_cache_ttl_seconds field to RemoteStorageLocation protobuf.
When 0 (default), lazy directory listing is disabled for that mount.
When >0, enables on-demand directory listing with the specified TTL.

Expose as -listingCacheTTL flag on remote.mount command.

* refactor: address review feedback for lazy directory listing

- Add context.Context to ListDirectory interface and all implementations
- Capture startTime before remote call for accurate TTL tracking
- Simplify S3 ListDirectory using ListObjectsV2PagesWithContext
- Make maybeLazyListFromRemote return void (errors always swallowed)
- Remove redundant trailing-slash path manipulation in caller
- Update tests to match new signatures

* When an existing entry has Remote != nil, we should merge remote metadata   into it rather than replacing it.

* fix(gcs): wrap ListDirectory iterator error with context

The raw iterator error was returned without bucket/path context,
making it harder to debug. Wrap it consistently with the S3 pattern.

* fix(s3): guard against nil pointer dereference in Traverse and ListDirectory

Some S3-compatible backends may return nil for LastModified, Size, or
ETag fields. Check for nil before dereferencing to prevent panics.

* fix(filer): remove blanket 2-minute timeout from lazy listing context

Individual SDK operations (S3, GCS, Azure) already have per-request
timeouts and retry policies. The blanket timeout could cut off large
directory listings mid-operation even though individual pages were
succeeding.

* fix(filer): preserve trace context in lazy listing with WithoutCancel

Use context.WithoutCancel(ctx) instead of context.Background() so
trace/span values from the incoming request are retained for
distributed tracing, while still decoupling cancellation.

* fix(filer): use Store.FindEntry for internal lookups, add Uid/Gid to files, fix updateDirectoryListingSyncedAt

- Use f.Store.FindEntry instead of f.FindEntry for staleness check and
  child lookups to avoid unnecessary lazy-fetch overhead
- Set OS_UID/OS_GID on new file entries for consistency with directories
- In updateDirectoryListingSyncedAt, use Store.UpdateEntry for existing
  directories instead of CreateEntry to avoid deleteChunksIfNotNew and
  NotifyUpdateEvent side effects

* fix(filer): distinguish not-found from store errors in lazy listing

Previously, any error from Store.FindEntry was treated as "not found,"
which could cause entry recreation/overwrite on transient DB failures.
Now check for filer_pb.ErrNotFound explicitly and skip entries or
bail out on real store errors.

* refactor(filer): use errors.Is for ErrNotFound comparisons
2026-03-13 09:36:54 -07:00
Peter Dodd
0e570d6a8f feat(remote.mount): add -metadataStrategy flag to control metadata caching (#8568)
* feat(remote): add -noSync flag to skip upfront metadata pull on mount

Made-with: Cursor

* refactor(remote): split mount setup from metadata sync

Extract ensureMountDirectory for create/validate; call pullMetadata
directly when sync is needed. Caller controls sync step for -noSync.

Made-with: Cursor

* fix(remote): validate mount root when -noSync so bad bucket/creds fail fast

When -noSync is used, perform a cheap remote check (ListBuckets and
verify bucket exists) instead of skipping all remote I/O. Invalid
buckets or credentials now fail at mount time.

Made-with: Cursor

* test(remote): add TestRemoteMountNoSync for -noSync mount and persisted mapping

Made-with: Cursor

* test(remote): assert no upfront metadata after -noSync mount

After remote.mount -noSync, run fs.ls on the mount dir and assert empty
listing so the test fails if pullMetadata was invoked eagerly.

Made-with: Cursor

* fix(remote): propagate non-ErrNotFound lookup errors in ensureMountDirectory

Return lookupErr immediately for any LookupDirectoryEntry failure that
is not filer_pb.ErrNotFound, so only the not-found case creates the
entry and other lookup failures are reported to the caller.

Made-with: Cursor

* fix(remote): use errors.Is for ErrNotFound in ensureMountDirectory

Replace fragile strings.Contains(lookupErr.Error(), ...) with
errors.Is(lookupErr, filer_pb.ErrNotFound) before calling CreateEntry.

Made-with: Cursor

* fix(remote): use LookupEntry so ErrNotFound is recognised after gRPC

Raw gRPC LookupDirectoryEntry returns a status error, not the sentinel,
so errors.Is(lookupErr, filer_pb.ErrNotFound) was always false. Use
filer_pb.LookupEntry which normalises not-found to ErrNotFound so the
mount directory is created when missing.

Made-with: Cursor

* test(remote): ignore weed shell banner in TestRemoteMountNoSync fs.ls count

Exclude master/filer and prompt lines from entry count so the assertion
checks only actual fs.ls output for empty -noSync mount.

Made-with: Cursor

* fix(remote.mount): use 0755 for mount dir, document bucket-less early return

Made-with: Cursor

* feat(remote.mount): replace -noSync with -metadataStrategy=lazy|eager

- Add -metadataStrategy flag (eager default, lazy skips upfront metadata pull)
- Accept lazy/eager case-insensitively; reject invalid values with clear error
- Rename TestRemoteMountNoSync to TestRemoteMountMetadataStrategyLazy
- Add TestRemoteMountMetadataStrategyEager and TestRemoteMountMetadataStrategyInvalid

Made-with: Cursor

* fix(remote.mount): validate strategy and remote before creating mount directory

Move strategy validation and validateMountRoot (lazy path) before
ensureMountDirectory so that invalid strategies or bad bucket/credentials
fail without leaving orphaned directory entries in the filer.

* refactor(remote.mount): remove unused remote param from ensureMountDirectory

The remote *RemoteStorageLocation parameter was left over from the old
syncMetadata signature. Only remoteConf.Name is used inside the function.

* doc(remote.mount): add TODO for HeadBucket-style validation

validateMountRoot currently lists all buckets to verify one exists.
Note the need for a targeted BucketExists method in the interface.

* refactor(remote.mount): use MetadataStrategy type and constants

Replace raw string comparisons with a MetadataStrategy type and
MetadataStrategyEager/MetadataStrategyLazy constants for clarity
and compile-time safety.

* refactor(remote.mount): rename MetadataStrategy to MetadataCacheStrategy

More precisely describes the purpose: controlling how metadata is
cached from the remote, not metadata handling in general.

* fix(remote.mount): remove validateMountRoot from lazy path

Lazy mount's purpose is to skip remote I/O. Validating via ListBuckets
contradicts that, especially on accounts with many buckets. Invalid
buckets or credentials will surface on first lazy access instead.

* fix(test): handle shell exit 0 in TestRemoteMountMetadataStrategyInvalid

The weed shell process exits with code 0 even when individual commands
fail — errors appear in stdout. Check output instead of requiring a
non-nil error.

* test(remote.mount): remove metadataStrategy shell integration tests

These tests only verify string output from a shell process that always
exits 0 — they cannot meaningfully validate eager vs lazy behavior
without a real remote backend.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-12 15:21:07 -07:00
Copilot
013362d2d3 fix(shell): show planned size in fs.mergeVolumes log to clarify size limit check (#8553)
The log message was comparing against the planned size of the destination
volume (including volumes already planned to merge into it) but only
displaying the raw volume size, making the output confusing when the
displayed sizes clearly didn't add up to exceed the limit.
2026-03-11 13:56:13 -07:00
Chris Lu
b799650357 fix(shell): set LastLocalSyncTsNs in remote.copy.local so remote.uncache works (#8604)
remote.uncache checks LastLocalSyncTsNs to determine if a file has been
synced to remote. remote.copy.local was not setting this field, leaving
it at 0, which caused uncache to skip all files uploaded via
remote.copy.local.

Fixes #8602
2026-03-11 12:55:45 -07:00
Chris Lu
e1e5b4a8a6 add admin script worker (#8491)
* admin: add plugin lock coordination

* shell: allow bypassing lock checks

* plugin worker: add admin script handler

* mini: include admin_script in plugin defaults

* admin script UI: drop name and enlarge text

* admin script: add default script

* admin_script: make run interval configurable

* plugin: gate other jobs during admin_script runs

* plugin: use last completed admin_script run

* admin: backfill plugin config defaults

* templ

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* comparable to default version

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* default to run

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* format

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: respect pre-set noLock for fix.replication

* shell: add force no-lock mode for admin scripts

* volume balance worker already exists

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* admin: expose scheduler status JSON

* shell: add sleep command

* shell: restrict sleep syntax

* Revert "shell: respect pre-set noLock for fix.replication"

This reverts commit 2b14e8b826.

* templ

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* fix import

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* less logs

Co-Authored-By: Copilot <223556219+Copilot@users.noreply.github.com>

* Reduce master client logs on canceled contexts

* Update mini default job type count

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-03 15:10:40 -08:00
Chris Lu
2dd3944819 Respect -minFreeSpace during ec.decode (#8467)
* shell: add ec.decode ignoreMinFreeSpace flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: respect minFreeSpace in ec.decode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: rename ec.decode minFreeSpace flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: error when ec.decode has no shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: select ec.decode target with zero shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* shell: adjust free counts across ec.decode

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* unused

* Update weed/shell/command_ec_decode.go

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2026-02-27 23:54:30 -08:00
Chris Lu
9b6fc49946 Chart createBuckets config #8368: Add TTL, Object Lock, and Versioning support (#8375)
* Chart createBuckets config #8368: Add TTL, Object Lock, and Versioning support

* Update weed/shell/command_s3_bucket_versioning.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* address comments

* address comments

* go fmt

* fix failures are still treated like “bucket not found”

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-26 11:56:10 -08:00
Chris Lu
a3cb7fa8cc go fmt 2026-02-25 10:25:23 -08:00
Chris Lu
b565a0cc86 Adds volume.merge command with deduplication and disk-based backend (#8441)
* Enhance volume.merge command with deduplication and disk-based backend

* Fix copyVolume function call with correct argument order and missing bool parameter

* Revert "Fix copyVolume function call with correct argument order and missing bool parameter"

This reverts commit 7b4a190643.

* Fix critical issues: per-replica writable tracking, tail goroutine cancellation via done channel, and debug logging for allocation failures

* Optimize memory usage with watermark approach for duplicate detection

* Fix critical issues: swap copyVolume arguments, increase idle timeout, remove file double-close, use glog for logging

* Replace temporary file with in-memory buffer for needle blob serialization

* test(volume.merge): Add comprehensive unit and integration tests

Add 7 unit tests covering:
- Ordering by timestamp
- Cross-stream duplicate deduplication
- Empty stream handling
- Complex multi-stream deduplication
- Single stream passthrough
- Large needle ID support
- LastModified fallback when timestamp unavailable

Add 2 integration validation tests:
- TestMergeWorkflowValidation: Documents 9-stage merge workflow
- TestMergeEdgeCaseHandling: Validates 10 edge case handling

All tests passing (9/9)

* fix(volume.merge): Use time window for deduplication to handle clock skew

The same needle ID can have different timestamps on different servers due to
clock skew and replication lag. Needles with the same ID within a 5-second
time window are now treated as duplicates (same write with timestamp variance).

Key changes:
- Add mergeDeduplicationWindowNs constant (5 seconds)
- Replace exact timestamp matching with time window comparison
- Use windowInitialized flag to properly detect window transitions
- Add TestMergeNeedleStreamsTimeWindowDeduplication test

This ensures that replicated writes with slight timestamp differences are
properly deduplicated during merge, while separate updates to the same file
ID (outside the window) are preserved.

All tests passing (10/10)

* test: Add volume.merge integration tests with 5 comprehensive test cases

* test: integration tests for volume.merge command

* Fix integration tests: use TripleVolumeCluster for volume.merge testing

- Created new TripleVolumeCluster framework (cluster_triple.go) with 3 volume servers
- Rebuilt weed binary with volume.merge command compiled in
- Updated all 5 integration tests to use TripleVolumeCluster instead of DualVolumeCluster
- Tests now properly allocate volumes on 2 servers and let merge allocate on 3rd
- All 5 integration tests now pass:
  - TestVolumeMergeBasic
  - TestVolumeMergeReadonly
  - TestVolumeMergeRestore
  - TestVolumeMergeTailNeedles
  - TestVolumeMergeDivergentReplicas

* Refactor test framework: use parameterized server count instead of hardcoded

- Renamed TripleVolumeCluster to MultiVolumeCluster with serverCount parameter
- Replaced hardcoded volumePort0/1/2 with slices for flexible server count
- Updated StartTripleVolumeCluster as backward-compatible wrapper calling StartMultiVolumeCluster(t, profile, 3)
- Made directory creation, port allocation, and server startup loop-based
- Updated accessor methods (VolumeAdminAddress, VolumeGRPCAddress, etc.) to support any server count
- All 5 integration tests continue to pass with new parameterized cluster framework
- Enables future testing with 2, 4, 5+ volume servers by calling StartMultiVolumeCluster directly

* Consolidate cluster frameworks: StartDualVolumeCluster now uses MultiVolumeCluster

- Made DualVolumeCluster a type alias for MultiVolumeCluster
- Updated StartDualVolumeCluster to call StartMultiVolumeCluster(t, profile, 2)
- Removed duplicate code from cluster_dual.go (now just 17 lines)
- All existing tests using StartDualVolumeCluster continue to work without changes
- Backward compatible: existing code continues to use the old function signatures
- Added wrapper functions in cluster_multi.go for StartTripleVolumeCluster
- Enables unified cluster management across all test suites

* Address PR review comments: improve error handling and clean up code

- Replace parse error swallow with proper error return
- Log cleanup and restoration errors instead of silently discarding them
- Remove unused offset field from memoryBackendFile struct
- Fix WriteAt buffer truncation bug to preserve trailing bytes
- All unit tests passing (10/10)
- Code compiles successfully

* Fix PR review findings: test improvements and code quality

- Add timeout to runWeedShell to prevent hanging
- Add server 1 readonly status verification in tests
- Assert merge fails when replicas writable (not just log output)
- Replace sleep with polling for writable restoration check
- Fix WriteAt stale data snapshot bug in memoryBackendFile
- Fix startVolume error logging to show current server log
- Fix volumePubPorts double assignment in port allocation
- Rename test to reflect behavior: DoesNotDeduplicateAcrossWindows
- Fix misleading dedup window comment

Unit tests: 10/10 passing
Binary: Compiles successfully

* Fix test assumption: merge command marks volumes readonly automatically

TestVolumeMergeReadonly was expecting merge to fail on writable volumes, but the
merge command is designed to mark volumes readonly as part of its operation. Fixed
test to verify merge succeeds on writable volumes and properly restores writable
state afterward. Removed redundant Test 2 code that duplicated the new behavior.

* fmt

* Fix deduplication logic to correctly handle same-stream vs cross-stream duplicates

The dedup map previously used only NeedleId as key, causing same-stream
overwrites to be incorrectly skipped as duplicates. Changed to track which
stream first processed each needle ID in the current window:

- Cross-stream duplicates (same ID from different streams, within window) are skipped
- Same-stream duplicates (overwrites from same stream) are kept
- Map now stores: needleId -> streamIndex of first occurrence in window

Added TestMergeNeedleStreamsSameStreamDuplicates to verify same-stream
overwrites are preserved while cross-stream duplicates are skipped.

All unit tests passing (11/11)
Binary compiles successfully
2026-02-25 10:12:09 -08:00
Chris Lu
da4edb5fe6 Fix live volume move tail timestamp (#8440)
* Improve move tail timestamp

* Add move tail timestamp integration test

* Simulate traffic during move
2026-02-24 20:07:26 -08:00
Chris Lu
cd6832249b Fix volume.fsck crashing on EC volumes and add multi-volume vacuum support (#8406)
* helm: refine openshift-values.yaml to remove hardcoded UIDs

Remove hardcoded runAsUser, runAsGroup, and fsGroup from the
openshift-values.yaml example. This allows OpenShift's admission
controller to automatically assign a valid UID from the namespace's
allocated range, avoiding "forbidden" errors when UID 1000 is
outside the permissible range.

Updates #8381, #8390.

* helm: fix volume.logs and add consistent security context comments

* Update README.md

* fix volume.fsck crashing on EC volumes and add multi-volume vacuum support

* address comments
2026-02-22 22:07:15 -08:00
Konstantin Lebedev
01b3125815 [shell]: volume balance capacity by min volume density (#8026)
volume balance by min volume density and active volumes
2026-02-19 13:30:59 -08:00
Chris Lu
f44e25b422 fix(iam): ensure access key status is persisted and defaulted to Active (#8341)
* Fix master leader election startup issue

Fixes #error-log-leader-not-selected-yet

* not useful test

* fix(iam): ensure access key status is persisted and defaulted to Active

* make pb

* update tests

* using constants
2026-02-13 20:28:41 -08:00
Lisandro Pin
fbe7dd32c2 Implement full scrubbing for regular volumes (#8254)
Implement full scrubbing for regular volumes.
2026-02-13 15:47:29 -08:00
Lisandro Pin
221bd237c4 Fix file stat collection metric bug for the cluster.status command. (#8302)
When the `--files` flag is present, `cluster.status` will scrape file metrics
from volume servers to provide detailed stats on those. The progress indicator
was not being updated properly though, so the command would complete before
it read 100%.
2026-02-11 13:34:20 -08:00