1106 Commits

Author SHA1 Message Date
Chris Lu 746ed82662 remote.meta.sync: sync directories and remove files deleted from remote (#10184)
* remote.meta.sync: materialize directory entries, including empty ones

Pull metadata by walking the remote tree one directory level at a time
with a delimiter, so subdirectories, including empty ones, are listed as
their own entries and created locally. The previous flat listing only
returned files, so empty remote directories never appeared locally and
non-empty ones only existed as filer-synthesized parents.

* remote.meta.sync: remove local metadata for entries deleted from remote

After reconciling each directory, drop local entries whose remote source
is gone: files are deleted outright, and a directory removed from the
remote is descended into so its remote-backed children are cleaned while
local-only entries are kept. remote.meta.sync exposes -delete (default
on) and remote.mount.buckets reconciles the same way; a plain
remote.mount stays additive.

* remote.meta.sync: reconcile type swaps and prune emptied directories

- when the remote swaps an entry's type (file <-> directory), drop the
  stale local entry and recreate it with the right type; local-only
  entries are left alone
- mark synced directories remote-backed and clean a directory removed
  from the remote locally, deleting it once it holds no local-only
  entries, instead of re-listing the missing remote path
- treat a differing remote size or mtime, not only a newer mtime, as a
  change worth pulling
2026-07-01 12:14:19 -07:00
Chris Lu 05b4b5bf56 ec: expose force_deleted_needles_check in ScrubEcVolume RPC and shell (#10176)
* ec: expose force_deleted_needles_check in ScrubEcVolume RPC and shell

FULL EC scrubs can opt into strict deleted-needle verification via the
-forceDeletedNeedlesCheck shell flag, off by default since it can report
false positives when EC indexes disagree. Rejected for non-FULL modes.

The Rust volume server parses the new field and ignores it: its FULL
scrub verifies shards via RS parity, not per-needle reads.

* volume: require admin auth for ScrubEcVolume

ScrubEcVolume ran unauthenticated while its sibling ScrubVolume, and the
rest of the mutating volume handlers, gate on checkGrpcAdminAuth. Close
the gap so an EC scrub can't be triggered anonymously.

* shell: reject ec.scrub -forceDeletedNeedlesCheck outside full mode

Fail in the client before fanning out to every volume server, instead of
erroring halfway through once the servers reject the request.
2026-06-30 23:20:50 -07:00
Chris Lu b872d5e683 balance: extract the bytes-aware density metric to a shared package (#10174)
* balance: extract the bytes-aware density metric to weed/topology/balancer

The shell's volume.balance ranks servers by a bytes-aware density (used volume
equivalents over free capacity). Move that math into the shared balancer package
(VolumeDensity / DensityRatio / DensityNextRatio) so the maintenance worker can
adopt the same metric next. Shell behavior is unchanged.

* balance: rank a server with no free capacity as the fullest

DensityRatio/DensityNextRatio divided by capacity, so a server past its slot
limit (negative capacity) returned a negative ratio and sorted as the emptiest
under ascending consumers — the opposite of reality. Treat any non-positive
capacity (full, or overfull mid-run after receiving volumes) as the fullest
(+Inf) so it is a move source, never a target. Covered by negative-capacity and
ordering tests.
2026-06-30 21:26:57 -07:00
Chris Lu d02ee6d5df balance: share replica-placement logic between shell and worker (#10169)
The replica-placement rule (data-center/rack/same-node limits plus host
anti-affinity) existed three times: the shell's satisfyReplicaPlacement/isGoodMove
used by volume.balance, fix.replication, and tier.move, and a line-for-line port
in the maintenance balance worker. Move the canonical logic into
weed/topology/balancer on a shared Location type; the shell and worker keep thin
adapters that convert their own location representation and call it. Behavior is
unchanged (the shared IsGoodMove keeps the shell's reject-move-to-self guard, and
all four replica test suites pass).
2026-06-30 20:02:23 -07:00
Chris Lu bea1357d38 ec: skip physically near-full disks when placing EC shards (#10167)
EC placement scored destinations purely by free EC shard slots (derived from
maxVolumeCount) and shard counts, blind to real disk fullness — the same defect
as volume balancing. A disk that is physically full but still shows free EC slots
kept being chosen, and EC shard bytes are captured by statfs free space yet not
by any slot accounting, so the slot math is exactly the metric that can't see EC
fullness.

Treat a disk at/above 90% physical usage as having zero free EC slots at
snapshot-build time, so every existing freeSlots>0 placement predicate excludes
it. Applied in all three snapshot builders (shell countFreeShardSlots, the shared
ecbalancer FromActiveTopology, and the worker ec_balance buildBalancerTopology)
via the shared balancer.DiskTooFullAfter gate. Servers not reporting disk bytes
fall back to slot-only behavior. ec.rebuild recovery is left ungated so shard
recovery can still complete onto fuller disks.
2026-06-30 20:01:55 -07:00
Chris Lu 77bf2a3ab0 volume.balance: gate on real physical disk usage (fixes #10160) (#10162)
* shell: add volume.balance -byDiskUsage to balance by actual data

The default balancer ranks servers by slot density, dividing used volumes by
MaxVolumeCount. When MaxVolumeCount is configured higher than the disk can hold,
a physically near-full server looks nearly empty and gets picked as the move
target, so balancing drains less-full servers onto an already-full one.

-byDiskUsage ranks servers by the actual data they hold (sum of volume sizes)
instead, so the fullest-by-data server is treated as full and balancing drains
it. It assumes comparable disk sizes per disk type and still respects each
server's free volume slots. Default behavior is unchanged.

* plumb physical disk usage into topology, gate volume.balance on it

Volume servers now report each disk's filesystem total/free bytes in the
heartbeat, and the master stores them in DiskInfo. volume.balance uses them to
skip any move target whose disk is already near full (-maxDiskUsagePercent,
default 90), so an over-configured maxVolumeCount can no longer make a
physically full server look empty and get drained onto. The gate judges each
server against its own disk, so heterogeneous disk sizes are fine; servers that
do not report bytes fall back to slot-only behavior.

Rust seaweed-volume mirrors the heartbeat reporting.

* admin: report real physical disk capacity when volume servers provide it

The dashboard estimated server capacity as maxVolumeCount * volumeSizeLimit,
which overstates it when maxVolumeCount is set higher than the disk holds.
Prefer the filesystem capacity now reported per disk, falling back to the
estimate for servers that do not report it.

* worker: gate automatic balance on physical disk fullness too

The maintenance balance worker selects the least slot-utilized server as the
move destination, so an over-configured maxVolumeCount makes a physically full
server look empty and get drained onto — the same defect as the shell command.
Now that DiskInfo carries real disk bytes, skip any destination whose disk is
at/above 90% used (per server, against its own disk); a full server can still be
a source. When every candidate destination is full, create no tasks. Servers
that do not report disk bytes are not gated.

* balance: share the physical-disk-fullness gate between shell and worker

The shell volume.balance command and the maintenance balance worker each grew
their own copy of the disk-fullness gate (targetDiskTooFull / destinationDiskTooFull)
and a maxDiskUsagePercent=90 constant. Pull both into weed/topology/balancer
(DiskTooFullAfter + DefaultMaxDiskUsagePercent) so the policy has one home and the
two balancers can't drift.

* balance: harden the physical-disk gate

Guard against a nil DiskInfo in the byte/slot lookups. Let a zero disk-capacity
report clear previously stored bytes (0 means "not reported" for bytes, unlike
maxVolumeCount), so a server that stops reporting falls back to slot-only instead
of trusting stale capacity. In the worker, charge each planned move's bytes to
its destination within a detection cycle so the gate sees a target fill up rather
than only its heartbeat-time free space. Note the per-location capacity summing
assumes one location per filesystem (the used ratio the gate relies on stays
correct regardless; absolute capacity can over-report).
2026-06-30 19:31:12 -07:00
Chris Lu a9c0ed91b5 fix(topology): keep physical disk 0 distinct in SplitByPhysicalDisk (#10161)
* fix(topology): keep physical disk 0 distinct in SplitByPhysicalDisk

DiskId 0 doubles as the first physical disk (Locations[0]) and the
protobuf "unset" default. SplitByPhysicalDisk folded every DiskId-0
record onto the aggregate DiskId whenever that was non-zero, so on a
multi-disk node the first disk's volumes merged into whichever disk
held volumes[0]: the node reported one fewer disk, the sibling showed
~2x volumes, and per-disk max was smeared across the survivors. This
surfaced as cluster.status and volume.list undercounting disks.

Only treat 0 as unset when no record carries a non-zero DiskId; with a
mix, 0 is a real disk and keeps its own entry.

* fix(admin): resolve physical disk 0 in active-topology indexes

rebuildIndexes re-derived each volume/EC record's physical disk id with
the same "DiskId 0 means unset" heuristic SplitByPhysicalDisk used, so
the two agreed only by sharing the bug. Now that SplitByPhysicalDisk
keeps disk 0 distinct, the duplicated heuristic would fold disk-0 records
onto a sibling while at.disks kept them on disk 0; GetVolumeLocations and
GetECShardLocations then matched no record and silently dropped every
volume and EC shard on the first disk, starving balance and EC tasks.

Build the indexes from the same SplitByPhysicalDisk reconstruction that
builds at.disks, so the keys always resolve. One source of truth instead
of a parallel normalize.

* fix(ec): allow physical disk 0 as preferred EC shard target

pickBestDiskOnNode gated its result on bestDiskId != 0, but 0 is both a
valid physical disk and the uint32 zero value, so a best-scoring disk 0
was discarded and the non-matching fallback returned instead. Gate on
bestScore.

* test(admin): cover EC-shard index resolution for physical disk 0

rebuildIndexes builds ecShardIndex the same way as volumeIndex; pin the EC
path too so a shard on disk 0 keeps resolving via GetECShardLocations.
2026-06-30 15:35:27 -07:00
Chris Lu a653a7f72a fix(shell): honor explicit fs.mergeVolumes from/to direction (#10159)
* fix(shell): honor explicit fs.mergeVolumes from/to direction

mergeVolumes only ever merged a smaller volume into a larger one. When the
user named both -fromVolumeId and -toVolumeId with the source larger than the
target, the planner produced an empty plan and the command printed just
"max volume size: N MB" and moved nothing.

Build the requested pair directly when both ids are given, instead of routing
through the size-descending heuristic. Read-only, empty, and wrong-collection
endpoints are rejected with a clear error rather than a silent no-op.

* fix(shell): allow fs.mergeVolumes into an empty target volume

Merging chunks into an empty volume is valid, e.g. consolidating data into a
freshly created or recently vacuumed volume. Only reject an empty source, which
has nothing to move.

* fix(shell): reject self-map in directed mergeVolumes planner

createMergePlan with from == to returned a {vid: vid} self-merge when called
directly. Guard it in the planner so it is correct independent of the Do
entrypoint.
2026-06-30 13:28:53 -07:00
7y-9 b55a608ae0 feat: add collection pattern to delete empty volumes (#10129)
* feat: add collection pattern to delete empty volumes

Co-authored-by: Codex <noreply@openai.com>

* shell: match collection pattern with wildcard matcher

Use wildcard.MatchesWildcard in the shared collection-pattern helper,
matching command_volume_fix_replication's matchCollectionPattern. The
flag only advertises '*' and '?', which is exactly what the matcher
supports.

---------

Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-29 12:43:19 -07:00
7y-9 1e42dd77ca fix: avoid duplicate volume.list parent headers (#10126)
Co-authored-by: Codex <noreply@openai.com>
2026-06-29 11:31:45 -07:00
qzhello 378f9a64ff fix: apply collectionPattern during detection in volume.fix.replication (#10115)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers

* fix: apply collectionPattern during detection in volume.fix.replication

* use existing wildcard.MatchesWildcard for collection matching

It returns a plain bool, so drop the up-front filepath.Match validation
and the path/filepath import that only existed to handle its error.

* trim verbose comments to terse one-liners

* drop redundant per-path collection guards

Detection already filters by replicas[0].info.Collection. The repair guard
re-checked pickOneReplicaToCopyFrom's collection (a different replica), so a
mixed-collection volume could pass detection yet be skipped in repair without
decrementing the counter, spinning the -apply loop. deleteOneVolume keeps its
collectionIsMismatch safety.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-26 00:48:29 -07:00
Chris Lu 2efc0e1656 ec: recover EC shards whose .ecx index lives only on a peer server (#10108)
* ec: recover EC shards whose .ecx index lives only on a peer server

A volume server that boots with EC shard files on disk but no .ecx index
on any local disk cannot mount the shards, so the master never learns
about them. ec.rebuild works off master-registered shards, so it sees the
volume as short and gives up even though the shard data is intact.

Add an operator-triggered recovery: VolumeEcShardsMount gains a
recover_missing_index flag that makes the volume server fetch the missing
.ecx (plus .ecj/.vif) from a peer holding it and mount the on-disk shards.
ec.rebuild runs this across the cluster before planning, so orphaned
shards register and the rebuild sees the true shard set.

.ecx is an immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal that differs across holders, so the
recovered node adopts the source peer's deletion view, like a balanced or
rebuilt shard does.

* ec: mirror missing-index recovery into the Rust volume server

Port the #10104 recovery to seaweed-volume so the Rust volume server
self-heals the same layout: EC shards on disk with the .ecx index only on
a peer. Adds collect_ec_volumes_missing_index / mount_recovered_ec_shards
to the store, recover_missing_ec_indexes (master LookupEcVolume + peer
CopyFile fetch + mount) to the server, and the recover_missing_index flag
on VolumeEcShardsMount.

.ecx is the immutable encode-time index, identical on every holder. .ecj
is a per-holder deletion journal, so the recovered node adopts the source
peer's deletion view, matching the Go path.
2026-06-25 10:38:14 -07:00
Chris Lu 95427b5573 security: add BearerPrefix constant for Authorization headers (#10101)
Introduce security.BearerPrefix ("Bearer ", RFC 6750) and use it
everywhere an "Authorization: Bearer <token>" header is constructed,
replacing the scattered "BEARER "/"Bearer " string literals. SeaweedFS
matches the scheme case-insensitively when parsing (security.GetJwt), so
behavior is unchanged; this removes the magic string and settles the
casing on the standard form. The parser's upper-case comparison stays as
is on purpose.
2026-06-24 19:36:42 -07:00
Lisandro Pin 30f2dd5040 Weed shell ec.rebuild: Allow targeting rebuild to specific volume IDs. (#10087) 2026-06-24 08:40:29 -07:00
7y-9 ddd11e44f9 feat: support marking volumes by collection (#9585)
* feat: add collection.mark shell command

Add collection.mark to mark all existing normal volume replicas in a collection as readonly or writable. The command runs in preview mode by default and requires -apply to execute changes. It reuses existing volume mark RPCs, supports default collection aliases, skips EC shards, and adds unit tests for option parsing and target collection logic.

* Revert "feat: add collection.mark shell command"

This reverts commit 50c2bbf94c.

* feat: support marking volumes by collection

Add a -collection option to volume.mark so operators can mark every normal volume replica in a collection using existing topology data and volume mark RPCs.

The change keeps the single-volume path unchanged and adds tests for collection target selection, EC shard exclusion, and argument validation.

Co-authored-by: Codex <noreply@openai.com>

* volume.mark: reuse eachDataNode for collection traversal

* volume.mark: continue past per-volume failures and report progress

Collection marking aborted on the first failed RPC, leaving the
collection half-marked with no record of which volumes succeeded.
Mark every reachable volume, print per-volume progress to the writer,
and return an aggregated error naming the failures.

* volume.mark: let -collection _default target the unnamed collection

Other volume commands use the _default sentinel to match volumes with
no named collection; volume.mark could not reach them at all. Map
_default to the empty collection name in the filter.

---------

Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com>
Co-authored-by: Codex <noreply@openai.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-23 11:27:43 -07:00
msementsov 70d9dd5afe volume.balance: add -volumesPerExec to cap moves per run
Limit the number of volume moves performed in one command execution; re-run to continue. 0 = unlimited.
2026-06-23 10:48:33 -07:00
qzhello 9de9dbaa83 fix(shell): exclude failed EC shard copies from rebuild recoverability gate (#10043)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers

* fix(shell): exclude failed EC shard copies from rebuild recoverability gate

prepareDataToRecover incremented the remote-shard counter before the copy
RPC, so in apply mode a failed VolumeEcShardsCopy was still counted toward
the DataShardsCount recoverability gate. The gate could then pass with
fewer real shards than required, deferring the failure to the deeper
generateMissingShards/reconstruct step and reporting an inflated shard
count in the "not enough shards" error.

Count the remote shard only after a successful copy (apply mode) or when
planning (dry-run), and rename wouldCopy to recoverableRemoteShards for
clarity. Add a regression test covering an apply-mode copy failure.

* fix(shell): clean up copied EC shards when the recoverability gate fails

A runtime copy failure can trip the gate after earlier copies already
succeeded, stranding those working shards on the rebuilder. Return the
copied shard ids on the error path and run the cleanup defer even when
recovery fails, so the temp shards get deleted.

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-06-22 11:23:23 -07:00
qzhello 4f9393889c feat(shell): Support batched EC encode and multi-volume selection in ec.encode (#10030)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* fix(shell): use exact match for volume.balance -racks/-nodes filter

The old strings.Contains-based filter quietly included any id that was a
  substring of the user-supplied flag value (e.g. -racks=rack10 also matched
  rack1). Replace it with an exact-match set parsed from the comma-separated
  flag value, and add regression tests for both -racks and -nodes paths.

  Also fix a small typo in the "remote storage" error returned by
  maybeMoveOneVolume.

* refactor(shell): drop nil sentinel in splitCSVSet, use len() in callers

* feat(shell): support batched EC encode and multi-volume selection

Add -volumeIds (comma-separated) and -batchSize flags to ec.encode.

When -batchSize > 0, volumes are processed in independent batches, each
committed separately: encode -> rebalance -> verify -> delete originals.
This bounds the working set and lets source volumes be reclaimed without
waiting for the entire set to finish, at the cost of per-batch rebalancing.
Because each batch deletes its originals, a failure in a later batch is
unrecoverable for already-completed batches.

To let the single-volume, multi-volume, and collection paths share one
per-batch routine, the re-balance scope is now always derived from the
volumes actually selected for encoding (collectCollectionsForVolumeIds),
rather than every collection matching the -collection regex. Practical
effect: with -collection, a collection that matches the pattern but
contributes no encodable volumes is no longer re-balanced as a side effect.
The -volumeId path is unchanged; -batchSize=0 (default) preserves the
original single-pass flow.

The per-batch routine reuses the existing assertEncodableRegularVolumes
guard, doEcEncode skipped-node handling, and verifyEcShardsBeforeDelete
retry loop. The capacity pre-flight check takes the already-fetched
topology instead of issuing another VolumeList to the master per batch.

Also clarify the -collection flag description to note it accepts a regex
pattern, matching the existing command help.

-volumeId and -volumeIds are mutually exclusive; ids in -volumeIds are
validated and de-duplicated.
2026-06-22 01:22:20 -07:00
msementsov 60ecdd7a2f Logs typos (#10018) 2026-06-19 09:09:01 -07:00
7y-9 bc827704d5 fix(shell): return fs.verify topology errors (#9982)
* fix(shell): return fs.verify topology errors

Problem: fs.verify can silently ignore master topology lookup failures before verifying files.

Root cause: commandFsVerify.Do returned parseErr after collectVolumeIds failed, but parseErr is nil once path parsing succeeds.

Fix: Return the actual collectVolumeIds error so VolumeList and master client failures stop the command.

Co-authored-by: Codex <noreply@openai.com>

* remove the tests

---------

Co-authored-by: Codex <noreply@openai.com>
2026-06-16 15:45:29 -07:00
Chris Lu b13463880c s3tables: scope management authorization to the caller's identity (#9961)
* s3tables: resolve account-less identities to a distinct principal

Static identities with no account block default to the shared admin
account, so getAccountID returned "admin" for every such user and the
permission checks treated them all as the admin principal. Only keep the
admin account when the identity actually carries an admin action;
otherwise fall back to the unique identity name.

* s3tables: limit the open-by-default fallback to anonymous access

The legacy permission path allowed any request that no policy explicitly
denied whenever default-allow was on, which is the zero-config default.
That let an authenticated identity without table permissions reach table
resources owned by others. Restrict the fallback to requests with no
identity or the anonymous identity; authenticated callers must pass an
explicit action or policy check. Zero-config and anonymous access are
unchanged.

* s3tables: drop the no-op ListTableBuckets account gate

The top-level check passed the principal as its own owner, so it always
allowed. Per-bucket filtering in the loop is the real authority; remove
the dead gate and the now-unused locals.

* s3tables: derive the Iceberg catalog's default-allow from auth state

The Iceberg catalog reuses the S3 Tables Manager, which hardcoded
default-allow on. Authenticated callers were enforced only because the
identity struct happens to propagate into the handler; if it were ever
dropped, a secured catalog would fall open. Mirror the S3 port and set
the Manager's default-allow from the authenticator, so an authenticated
caller is enforced regardless. Shell and admin keep their own trusted
Manager. Regression test covers the struct, name-only, and admin paths.

* s3tables: drop redundant ACTION_ADMIN string conversion

ACTION_ADMIN is an untyped string constant, so the conversion is a no-op.

* s3tables: enforce name-only authenticated callers, add trusted bypass

defaultAllowFor treated a request with no identity object as anonymous,
but the Manager path forwards only the identity name (not the struct).
A name-only authenticated caller could therefore be misclassified as
anonymous and allowed under the open default. Treat a server-set identity
name as authenticated too, and add an explicit trusted flag for the local
shell/admin tooling that legitimately bypasses authorization.

* s3tables: trim verbose comments
2026-06-14 13:55:36 -07:00
Chris Lu 7e608c877a refactor(ec_balance): make the balance planner per-volume ratio-capable (#9960)
* refactor(ec_balance): make the balance planner per-volume ratio-capable

Thread a per-volume EC ratio through the balance planner: Plan resolves each
volume's data/parity from a new Options.VolumeRatio (falling back to the
collection Ratio, then the build default, when it reports 0), and keys the
global phase's ratio maps by volume instead of collection. The shell and
worker balance paths build the per-volume lookup from each shard's heartbeat
via the new ecbalancer.VolumeShardRatio.

In OSS this is behavior-preserving: VolumeShardRatio returns 0 because the
per-volume data_shards/parity_shards heartbeat fields are an enterprise
feature, so every volume falls back to the collection ratio -- the existing
standard-scheme behavior. The refactor keeps the shared planner in sync with
the enterprise fork, which overrides VolumeShardRatio to classify and spread
a mixed-ratio collection by each volume's own data/parity split.

* perf(ec_balance): hoist the collection ratio out of the per-volume loop

The collection ratio is constant for every volume in a collection, so
resolve it once per collection instead of per volume; a custom Ratio func
may do map lookups or locking. Addresses a review comment.
2026-06-14 11:33:31 -07:00
Chris Lu 4fb3e22a01 fix(tiering): never delete a shared remote object while replicas still reference it (#9942)
* tiering: stop a shared remote object being deleted while replicas still point at it

A remote-tiered volume's .dat content lives only in one cloud object that all
N replica .vif files point at. Deleting that object while destroying any one
replica, or before a downloaded replica is durable, bricks the survivors.

- volume.tier.move cleanup now deletes old replicas with keepRemoteData=true so
  surviving replicas keep the shared object. Document why the alreadyPlaced
  anchor needs no replica sync (same-object replicas are byte-identical).
- VolumeTierMoveDatFromRemote now fsyncs the downloaded .dat, fsyncs the
  containing directory, trims the .vif (fsynced) and swaps to the local DiskFile
  BEFORE deleting the remote object, on both the keep-remote and delete paths.
  Only the final DeleteFile is gated by keep_remote_dat_file, so a keep-remote
  download leaves the replica served from local disk rather than the shared
  object, and a crash before delete merely leaks the object.
- volume.tier.download keeps the shared object for every replica except the
  last, which deletes it.
- s3 and rclone download paths fsync the .dat before close.

* storage: swap the volume data backend under the data lock

The tier-download swap closed v.DataBackend and assigned the new local DiskFile
without holding dataFileAccessLock, racing concurrent reads/writes (use of a
closed file / nil deref). Add an exported Volume.SwapDataBackend that performs
the close-and-replace under the lock, and call it from the tier download.

* server: skip directory fsync on Windows in the tier download path

os.Open(dir).Sync() is unsupported on Windows and returns an error, which would
fail VolumeTierMoveDatFromRemote entirely there. Skip the directory fsync on
Windows, matching how the storage-side helper tolerates the unsupported case.

* shell: make multi-replica tier.download resilient to already-local replicas

If a multi-replica download is interrupted and retried, a replica made local
in the prior attempt returns "already on local disk", which aborted the whole
command and left the remaining remote replicas dangling. Treat that case as a
skip-and-continue so a retry completes the rest.

* server: assert downloaded .dat content, not just length, in the tier test

A length-only check passes even if the bytes are corrupted; compare the full
content of the local .dat against the original.
2026-06-13 20:09:00 -07:00
Chris Lu c2591b4395 fix(replication): verify-before-destroy in VolumeCopy, check.disk, and over-replication trim (#9943)
* volume: verify before destroy in VolumeCopy and replication repair

Four data-safety fixes around copy/repair paths that could destroy or
resurrect data before verifying the source or survivors.

(a) VolumeCopy no longer deletes a pre-existing local replica up front.
The delete is deferred until ReadVolumeFileStatus on the source succeeds,
so a transient source outage (or a retry after one) can no longer wipe a
healthy destination replica. Gated on source readability only; size/count
comparisons are intentionally not used because they invert legitimately
after divergent vacuum/compaction. Mirrored in the Rust volume server.

(b) volume.check.disk no longer resurrects vacuumed-deleted needles. A
key present-and-live on the source but entirely absent on the target is
ambiguous: it may be a genuine missing write, or a needle deleted on the
target and then vacuumed (its index entry and any tombstone are gone). An
individual needle AppendAtNs has no monotonic relation to a vacuum
watermark, so the old cutoff heuristic could not tell them apart. Without
positive proof the absence is a missing write, the safe default is to NOT
push it back. Tradeoff: a real missing write may go unrepaired until a
tombstone-aware path exists, but we never raise back deleted data.

(c) Over-replication trim no longer resurrects needles or removes the
wrong replica. The pre-delete sync now runs read-only (divergence check
only) instead of writing the doomed replica's needles into the survivor.
pickOneReplicaToDelete only ever removes the smallest of multiple healthy
writable replicas; it refuses the trim when doing so would leave only
read-only/integrity-flagged survivors, since file_count>0 alone cannot
prove the survivor's .dat is readable.

(d) Incomplete-volume (.note) cleanup keeps the shared .vif when an .ecx
for the same vid coexists on the disk, so removing an interrupted regular
copy cannot strip a coexisting EC volume's info file. VolumeCopy now
surfaces .note write/remove errors instead of ignoring them. In the Rust
volume server (where a persisting note is actually reachable) the .note
check moves below the empty-stub sweep and EC validation, keeps the .vif
on EC coexistence, and the mount path fails when a .note still persists.

* shell: scope the over-replication writable-survivor guard to the trim path only

The writable-survivor guard (never trim down to a read-only survivor) lived
inside the shared pickOneReplicaToDelete, so it also gated the misplaced-volume
relocation via pickOneMisplacedVolume -- a misplaced read-only volume (e.g. a
full one) would silently stop being rebalanced. Extract pickSmallestReplica
for the relocation path (which deletes-and-recreates and must act on read-only
replicas), and keep the writable-survivor guard only in pickOneReplicaToDelete
used by the over-replication trim.

* seaweed-volume: recompute keep_vif after invalid-EC cleanup in the .note path

keep_vif used the pre-validation ecx_exists snapshot, so when the EC-validation
step above removed the invalid .ecx/shards, the .note cleanup still preserved a
now-orphaned .vif. Re-check .ecx existence at cleanup time, matching the Go
hasEcxFile re-check.

* shell: keep placement when picking an over-replication victim to delete

The trim picked the smallest writable replica without regard to placement, so
it could delete the only replica in a required failure domain (e.g. with "100"
and replicas dc1 + two in dc2, deleting dc1 leaves both survivors in dc2).
Prefer a writable replica whose removal still satisfies placement, falling back
to the smallest writable only when none does.
2026-06-13 20:05:33 -07:00
Chris Lu 3718301599 shell: stop ec.encode/ec.rebuild from destroying live EC shards (no crash needed) (#9939)
* shell: stop ec.encode/ec.rebuild from destroying live EC shards

Three operator-triggered shell paths could destroy data with no crash:

ec.encode -volumeId on an already-EC volume tore down its shards before
failing. The volume-id path never checked the id was a regular volume:
the collection lookup scans only VolumeInfos (so an EC-only id maps to
""), and volumeLocations succeeds via the EC-location fallback, so
clearPreexistingEcShards full-teardown-deleted every shard cluster-wide
before doEcEncode failed. An EC volume has no .dat, so this is its only
copy. Add assertEncodableRegularVolumes: each requested id must be a
regular volume in the topology snapshot; an EC-only or unknown id is
refused before any teardown. A volume present as both a regular .dat and
stale orphan shards (a failed-encode retry) still passes. This closes
the operator-rerun/script-retry path; a worker racing the snapshot is a
fencing problem handled separately.

ec.rebuild dry-run (the default, without -apply) still issued real
VolumeEcShardsDelete RPCs: prepareDataToRecover appended every
would-copy shard to copiedShardIds even though the copy was skipped, and
the cleanup defer deleted that set unconditionally. Now a dry-run copies
nothing and records nothing to delete (a separate would-copy counter
drives the recoverability check so the dry-run still reports its plan),
and the cleanup runs only under -apply.

ec.rebuild could also self-destruct a live shard: localShardsInfo was
overwritten per disk instead of unioned, so a shard the rebuilder holds
on a non-last disk looked remote, got copied onto itself (in-place
O_TRUNC) and then node-wide deleted. Union local shards across all
disks, and never copy/delete a shard whose only listed holder is the
rebuilder itself.

* shell: address ec destructive-guards review comments

- countLocalShards: union shards across all of the rebuilder's disks so
  slot accounting matches what prepareDataToRecover treats as local;
  first-match counting overstated slotsNeeded on multi-disk rebuilders
- VolumeEcShardsCopy: resolve SourceDataNode via
  pb.NewServerAddressFromDataNode instead of the raw node id, which may
  not be a dialable host:port
- assertEncodableRegularVolumes: skip nil DiskInfo map entries, matching
  the other topology walks in this file; rename ecOnly to hasEcShards
  since the map marks any volume with shards, not only shard-only ones
2026-06-12 22:30:17 -07:00
Chris Lu 34f9b91d69 fix(storage): never let an empty .dat delete healthy distributed EC shards (#9930)
* fix(storage): never let an empty .dat delete healthy distributed EC shards

A leftover empty .dat stub (a phantom from the pre-fix loader; zero
needles) next to a distributed EC volume's local shards made startup
classify the volume as an interrupted local encode: validateEcVolume
requires >= dataShards local shards when a .dat is present, fails with
the 1-2 shards a distributed volume keeps per disk, and the cleanup
deletes those shards -- the only copies of that part of the volume.
Repeated across restart waves this destroys enough shards cluster-wide
to make the volume unrecoverable.

Go:
- loadExistingVolume: hoist the empty-stub sweep above the EC presence
  checks. Previously the .vif-next-to-.ecx guard returned before the
  sweep ever ran, so exactly the dangerous layout (stub + .ecx + local
  shards) kept its stub and then lost its shards in loadAllEcShards.
- validateEcVolume / checkDatFileExists: treat a .dat <= a superblock
  (zero needles) as absent. An empty .dat cannot be the encode source,
  so it must never gate shard deletion; this also covers stubs without
  a .vif, which the sweep cannot prove are EC leftovers.

Rust mirror (seaweed-volume): the same gate in validate_ec_volume and
check_dat_file_exists (the Rust sweep already ran before validation);
the volume-load skip keeps a plain existence check so fresh,
needle-less volumes still load.

Regression tests in Go and Rust reproduce the production layout (a
zero-byte .dat beside .ecx/.ecj and two shards of a 10+4 volume, with
and without a .vif) and fail without the fix with the shards deleted.

* fix(ec): gate source volume deletion on a recoverable shard set

After EC encode, the shell command and the (plugin) worker task refused
to delete the source volume unless every shard was present, and aborted
otherwise -- leaving the source .dat next to live shards, exactly the
mixed state the startup cleanup mishandles.

Replace the full-set requirement with a recoverability gate shared by
both callers (RequireRecoverableShardSet): deleting a non-empty source
.dat requires at least dataShards distinct shards cluster-wide. Below
that the source is kept and the encode fails as before. A degraded but
recoverable set (>= dataShards, < total) now proceeds with a warning
instead of aborting: the missing shards can be rebuilt from the
survivors, while keeping the source would preserve the dangerous mixed
state. Empty stub replicas are still swept unguarded (OnlyEmpty) -- an
empty .dat has nothing to lose.

dataShards/totalShards stay parameters so enterprise custom EC ratios
share the helper verbatim.

* test(ec): use recoverable shard verification gate
2026-06-11 20:26:20 -07:00
Chris Lu 42030381ae shell: volume.tier.move can move volumes between data centers (#9925)
* shell: volume.tier.move can move volumes between data centers

-fromDataCenter scopes volume selection to volumes with a replica in
that data center. -toDataCenter constrains move destinations and
replication fulfillment. With identical disk types both flags are
required, moving full volumes between data centers on the same tier.

* shell: assert node identity in data center filter test

* shell: tier move resumes when the volume is already on the target

A replica already on the target tier and data center, typically left by
an interrupted earlier run, anchors the move: skip the copy and only
complete replication fulfillment and old replica cleanup. Previously
such volumes hit the no-destination path and the stale source replicas
were never removed.
2026-06-11 10:46:34 -07:00
Chris Lu 79ac279fe1 fix(ec): don't mix EC shards from different encode runs (#9880)
* feat(ec): add encode_ts_ns to EC shard metadata and the shard read RPC

EcShardConfig and VolumeEcShardReadRequest gain an int64 encode_ts_ns
(encode time in unix nanos). It rides in .vif and the read request so a
read can be scoped to the encode run that produced the index.

* fix(ec): stamp each encode and reject cross-run shard reads

Generate stamps EncodeTsNs into the volume's .vif. Reads carry it to the
shard's owning volume (resolved together via FindEcVolumeWithShard, so a
multi-disk server validates the disk that actually serves the bytes) and
reject a shard from a different encode run, recovering from parity. A
zero on either side (pre-upgrade volume) skips the guard.

* fix(ec): stamp the encode identity on the worker-generated .vif

The worker-local encode path now writes EncodeTsNs (and the resolved EC
ratio) into the .vif, so the read guard is not silently off for volumes
encoded by the maintenance worker.

* fix(ec): wipe stale EC artifacts before re-encoding

VolumeEcShardsGenerate evicts any in-memory EcVolume for the volume and
removes its on-disk shard/index/sidecar files before writing fresh ones,
so a retried encode never builds on a partial prior run and the unlink
frees the inodes instead of leaving open fds serving old bytes.

* fix(ec): unmount EC shards across all disks

UnmountEcShards walked only the first disk holding the shard, leaving a
duplicate copy mounted on a sibling disk (split-disk reconciled volumes)
still serving and heartbeating. Traverse every disk and emit one
deletion delta per disk.

* fix(ec): delete orphan shards without a local .ecx

deleteEcShardIdsForEachLocation gated shard-file removal on a local .ecx,
so it could not clean an orphan .ecNN left by a failed copy on a disk
with no index. Delete the requested shard files unconditionally; the
index-file (.ecx/.ecj/.vif) routing stays gated as before.

* fix(ec): clear stale EC shards cluster-wide before re-encoding

ec.encode unmounts and deletes EC shards for the target volumes on every
node before regenerating: fatal for the shards the topology reports
(mounted leftovers), best-effort for the rest (a sweep that catches
unmounted failed-copy orphans). A down node is a no-op.

* fix(ec): don't nil EC fds on close so reads can't race eviction

A reader resolves an EcVolume/shard under the lock then reads after it is
released, so an eviction that nils ecxFile/ecdFile would race that read
and panic. Close the fds without nilling the fields: the field is now
write-once (no data race) and a concurrent read hits a closed fd, getting
a clean error that the caller recovers from parity.

* fix(ec): wipe stale EC artifacts on every disk and surface failures

The pre-encode wipe only deleted beside the source volume, so a stale
shard on a sibling disk survived and could be mounted against the new
index at reconcile. Sweep every disk. Removal also ignored os.Remove
errors, reporting a failed cleanup as success and letting a stale shard
join the next generation; surface the first real failure (treating
already-gone as success) from removeStaleEcArtifacts and the shard delete.

* fix(ec): log when a local shard is skipped for a different encode run

The cross-run guard returned errShardNotLocal, indistinguishable in logs
from a genuinely-absent shard. Add a V(1) line naming both EncodeTsNs so
operators can tell "wrong encode generation" from "shard not here".

* fix(ec): surface metadata removal failures in the shard delete path

deleteEcShardIdsForEachLocation still dropped os.Remove errors on the
.ecx/.ecj/.vif/sidecar cleanup. A surviving stale .ecx is the orphan-index
condition this path prevents, so route those through removeFileIfExists and
return the first real failure instead of reporting cleanup as success.

* fix(ec): fail orphan cleanup when a reachable node's delete fails

The pre-encode orphan sweep swallowed every error for unreported (node,
volume) pairs. That is only safe for an unreachable node, which cannot
receive this encode's new generation. A reachable node whose delete
genuinely failed (permission/IO) keeps an orphan shard that a later copy
re-stamps with the new run's volume-level .vif identity, so the read guard
would accept stale data. Surface those; stay best-effort only for
unreachable nodes (gRPC Unavailable / no status).

* fix(ec): guard ecjFile under its lock in the EC delete path

EcVolume.Close nils ecjFile under ecjFileAccessLock; a delete that resolved
its .ecx lookup before a concurrent eviction (the generate-time
UnloadEcVolume) could then reach the journal append with a nil fd. Bail
with a clear "volume closed" error under the lock instead.

* fix(ec): reject an unstamped shard when the caller has an encode identity

The read guard required both identities nonzero, so a current (stamped)
caller accepted a holder with identity 0 and could be served a stale
pre-upgrade shard. Reject when the caller is stamped and the holder
differs (including unstamped); stay lenient only when the caller itself
has no identity (pre-upgrade reader). A skipped shard recovers from parity.

* fix(ec): full-teardown delete so cluster cleanup wipes a whole generation

The pre-encode cluster sweep deleted only the listed canonical shards on
remote nodes, leaving index/sidecar (and, on builds with versioned
generations, those too) behind. Add a full_teardown flag to
VolumeEcShardsDelete that evicts the volume and wipes every EC artifact for
it on every disk via removeStaleEcArtifacts; the shell and worker pre-encode
cleanup paths set it. Other delete callers (balance/decode/repair) are
unchanged.

* fix(ec): take ecjFileAccessLock before the nil-check in Sync and Close

Sync and Close read ev.ecjFile before acquiring ecjFileAccessLock while
Close nils it under the lock, a data race on the field. Take the lock
first, then nil-check inside, in both.

* fix(ec): acknowledge full_teardown so a pre-upgrade server can't fake success

An old volume server silently ignores full_teardown and returns success
for an ordinary delete, so the caller wrongly believes the generation was
wiped and copies a fresh gen-0 onto an unwiped node. Echo full_teardown_done
in the response; the worker destination cleanup fails when it is absent, and
the shell cluster sweep fails for a reported (mounted) leftover while staying
best-effort for an unreported node. encode_ts_ns stays an accepted transient
(an old server just skips the new read guard, no regression).

* fix(ec): fail the pre-encode sweep for any reachable node that can't ack teardown

A reachable pre-upgrade server ignores full_teardown and returns success
without wiping an orphan, which a later copy then folds into the new
generation. Treat a missing full_teardown_done ack as fatal for every
reachable node (best-effort only for a gRPC-unreachable one), not just for
topology-reported pairs.

* fix(ec): return the served shard identity and validate it client-side

The encode identity was only enforced server-side, so a pre-upgrade server
ignored the request field and served bytes unchecked. Echo the served
shard's EncodeTsNs on every read response chunk and have the client reject a
mismatch (including 0 from an old server), so the guard holds regardless of
server version; a rejected read recovers from parity.

* fix(ec): reject a short/empty remote shard read instead of serving zeros

doReadRemoteEcShardInterval accepted an immediate EOF or a short stream and
returned success with a partly zero-filled, unvalidated buffer (the server
stamps the identity only on chunks that carry bytes). A non-deleted interval
must arrive whole: require n == len(buf), exempting the is_deleted
short-circuit (n=0), matching readLocalEcShardInterval's local check. A short
read now fails so the caller recovers from parity.

* test(ec): fake volume server echoes the full_teardown acknowledgement

The worker now fails a teardown delete that isn't acknowledged (so a
pre-upgrade server can't silently skip the wipe). The fake server's no-op
VolumeEcShardsDelete returned an empty response, which the worker read as a
skipped teardown and aborted the encode. Echo full_teardown_done.

* feat(ec): mirror the encode-run identity guard + full_teardown into the Rust volume server

The Go volume server stamps an encode-run identity (encode_ts_ns) into the .vif
and rejects a read served from a shard of a different run; full_teardown wipes a
whole generation and acknowledges it. The Rust volume server had none of it.
Mirror the shared logic: load encode_ts_ns from the .vif onto the EcVolume,
stamp it on every read response, and reject a request/response mismatch on both
the server and the distributed-read client (recovering from parity); handle
full_teardown by evicting the volume and wiping every EC artifact on each disk,
echoing full_teardown_done so the caller can detect a server that ignored it.

* fix(ec): remove a stale .vif on full teardown of a shard-only node

A shard copy installs shards + .ecx before .vif, so an interrupted copy after a
teardown could mount the new files under the previous run's identity / version /
shard ratio / dat_file_size carried by the surviving .vif. Remove .vif during
full teardown, gated on .idx absence so a source-volume holder keeps its live
.vif. In Rust this lives in a teardown-only helper so the reconcile / load-
fallback paths (which share the base removal) still preserve .vif.

* fix(ec): treat a missing teardown ack as fatal, not as an unreachable node

isNodeUnreachable returned true for any non-gRPC-status error, so a reachable
pre-upgrade server's missing full_teardown_done ack (a plain error) was
classified unreachable and the unreported pair was silently skipped. Classify
only a real codes.Unavailable as unreachable, and wrap the missing ack in a
sentinel the sweep treats as fatal regardless. A genuinely down node still
surfaces as Unavailable from the RPC and stays best-effort.

* fix(ec): reject a short shard read in the local EC needle reader

read_ec_shard_needle ignored the byte count from shard.read_at and appended the
whole pre-sized buffer, so a truncated shard's zero-filled tail passed the later
length check and parsed as garbage. Require n == buf.len() per interval, erroring
on a short read like the local interval reader already does.

* fix(ec): probe reachability before skipping a node that returns Unavailable

The pre-encode sweep skipped any node whose teardown delete returned
codes.Unavailable, but a reachable volume server in maintenance mode also
returns that code for the maintenance-gated delete, so its stale EC files were
left behind on a node that can still receive the new generation. Confirm with a
non-maintenance-gated empty-target Ping: skip only when the node fails the probe
too (genuinely unreachable).

* fix(ec): use try_exists for the teardown .vif .idx guard

The teardown-only .vif removal gated on Path::exists(), which returns false on a
permission/IO stat error, so a stat failure on a present .idx would read as a
shard-only node and delete the live source volume's .vif. Gate on
try_exists() == Ok(false) instead, preserving the sidecar on any stat error.

* fix(ec): only skip a sweep node when a Ping confirms it is transport-down

The pre-encode sweep skipped a node whenever its teardown delete and a liveness
Ping both failed, but it treated ANY Ping error as down — an application-level
Internal/ResourceExhausted, or Unimplemented from a pre-Ping server, left a
reachable node's stale generation in place. Classify the Ping tri-state and skip
only when it transport-fails with codes.Unavailable; a reachable or inconclusive
node stays fatal.

* fix(ec): exclude sweep-skipped nodes from the encode's rebalance

The pre-encode sweep skips a genuinely-down node best-effort, but the rebalance
then recollected the current topology — a node that recovered between the two
could become a copy target and receive the new generation while still holding
its stale, never-cleared shards. Have the sweep return the skipped set and
exclude those nodes from the rebalance for this encode, so a node we could not
clean cannot receive the new generation. Standalone ec.balance is unaffected.

* fix(ec): re-sweep recovered nodes before generation so they aren't stranded

A node skipped as down by the pre-encode sweep is excluded from the rebalance,
but it can recover and become the generation host — mounting all shards locally,
then being excluded from distribution. Union-only verification accepts all
shards on one node and deletes the originals: a single point of failure. Re-sweep
the skipped nodes just before generation; one whose teardown now succeeds leaves
the skipped set and rebalances normally, while a node still down stays skipped.

* fix(ec): abort the encode if a selected source is still skipped after re-sweep

The re-sweep un-skips a recovered node, but the source was selected before it and
a node can stay down through the re-sweep then recover just in time to be the
generation host — mounting all shards locally while still excluded from the
rebalance, which union-only verification accepts before deleting the originals.
Abort the encode when a selected source remains skipped after the re-sweep.

* fix(ec): batch delete returns retriable 503 when a volume became EC mid-batch

If a volume is not EC at the batch-delete classification but is encoded to EC and
its .dat deleted before the regular-volume mutation, the mutation returns an exact
"not found" that the filer chunk-GC treats as completed, dropping the delete.
Recheck EC presence under the mutation lock and return a retriable 503 with the
"try again" token so the filer requeues it onto the EC path.

* fix(ec): recheck EC state before the regular batch-delete mutation

ec.encode mounts EC shards (copied from the .dat) before deleting the originals,
so a volume can be EC while its .dat still exists. The batch delete only rechecked
EC after a NotFound, so a successful regular-volume delete in that window wrote a
tombstone to the soon-removed .dat — the delete was lost and the needle resurrected
from the pre-tombstone shards. Recheck has_ec_volume under the write lock before
delete_volume_needle and return a retriable 503 so the filer requeues onto the EC path.

* fix(volume): make the metrics push test independent of test order

test_push_metrics_once asserted the pushed body contains the request-counter
family without ever touching the counter — a CounterVec with no children emits
nothing, so the assertion only held when another test had already created a
labelset in the shared registry. Create one in the test itself.
2026-06-10 22:31:18 -07:00
Lisandro Pin 6b4d20a6f3 volume.scrub and ec.scrub shell commands: make the display of scrub details optional. (#9911)
On volumes failing scrubs, the detail output can get very verbose, which makes
reading results difficult. Most users won't care about this information to
begin with - just whether or not volumes pass scrub tests.

This MR gates the display of scrub result details behind a `--details` flag.
2026-06-10 13:29:07 -07:00
7y-9 d569dd686f fix(shell): move files into existing destination directories (#9887)
* fix(shell): move files into existing destination directories

Problem: fs.mv /src/file /dst/dir treats an existing destination directory as a destination file path, so it renames the source to /dst/dir instead of moving it into /dst/dir/file.

Root cause: commandFsMv builds the destination LookupDirectoryEntryRequest with Directory and Name swapped, so the destination directory lookup misses.

Fix: Populate LookupDirectoryEntryRequest with Directory=destinationDir and Name=destinationName before deciding whether the destination is a directory.

Reproduction: env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1

Validation: gofmt -w weed/shell/command_fs_mv.go weed/shell/command_fs_mv_test.go; git diff --check; git diff --cached --check; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -run TestFsMvMovesIntoExistingDestinationDirectory -count=1; env GOCACHE=/private/tmp/seaweedfs-go-cache go test ./weed/shell -count=1

* Update weed/shell/command_fs_mv_test.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-06-08 23:42:13 -07:00
Chris Lu 8cc10460b4 fix(remote): correct content and permissions when syncing/caching remote objects (#9879)
* fix(remote): reject short reads when caching remote objects

A short read from the remote (stale listing size, truncated or flaky
response) was silently zero-padded: the S3 and Azure clients pre-size
the buffer and discard the downloaded byte count, and the chunk is
recorded with the requested size. The cached file then matched the
expected size but its tail was NULL, and the entry was marked cached
so it never re-fetched.

Check the byte count against the requested size in both clients, and
add a backend-agnostic guard in FetchAndWriteNeedle. The cache now
fails loudly and the entry stays remote-only for a later retry.

* fix(remote): match S3 default modes when syncing remote metadata

Remote object listings carry no POSIX mode, so synced entries were
created with a hardcoded 0644. Against a SeaweedFS remote, whose S3
layer writes objects as 0660 and auto-creates directories as 0771
(0660|0111), the mounted copy ended up 0644/0755 and the permissions
visibly diverged from the source.

Default to the S3 modes instead (files 0660, directories 0771). The
filer derives parent-dir modes from the child as fileMode|0111, so
fixing the file default also brings the directories into line.

Directory mtimes still reflect sync time: S3 listings don't enumerate
directories, so the remote's directory timestamps aren't available.
2026-06-08 13:55:53 -07:00
Chris Lu f0d2a0d417 Treat co-located volume servers as one fault domain when balancing and allocating (#9854)
* admin/topology: carry the volume server address on DiskInfo

The planning DiskInfo exposed only the node id, which can be an opaque label rather than ip:port. Record the address too so callers can resolve the physical machine a disk sits on.

* ec.balance: spread a volume's shards across machines, not just nodes

Volume servers sharing a host are one fault domain, but the within-rack spread treated them as independent nodes, so one box could end up holding more shards of a volume than EC can afford to lose. Add a machine (host) tier between rack and node: the within-rack pass spreads each volume across machines, and the global load phase no longer re-concentrates a volume onto a machine it already sits on. Host defaults to the node id, so clusters with one server per host are unchanged.

* ec placement: prefer machines holding fewer of a volume's shards

EC allocation and repair picked the least-loaded node in a rack with no regard for which physical machine it sits on, so a volume's shards could pile onto several servers of one box. Rank candidate nodes by their machine's shard count first, then the node's own. The machine is derived from the volume server address carried on DiskInfo, falling back to the node id, matching how the balancer resolves it.

* volume.balance: don't move a replica onto a machine already holding one

isGoodMove only rejected a move onto the same data node, so two replicas could land on two volume servers of one box and a single machine failure would lose both. Reject a target whose host already holds another replica of the volume. Best-effort: balancing simply skips and tries the next target.

* volume allocation: spread same-rack replicas across machines

PickNodesByWeight filled the same-rack replica picks by weight alone, so replicas could co-locate on one box. Prefer candidates on not-yet-used hosts, falling back when too few distinct machines exist. Data-center and rack tiers have no host, so their ordering is unchanged.

* ec.balance: harden machine spread against re-concentration and capped machines

Two cases where the machine-aware spread could still leave a volume badly placed:

- The global load phase could move a shard of a volume onto a machine that
  already held it, raising that machine's count and undoing the within-rack
  spread (a 4/4/3/3 layout could become 3/5/3/3, past parity for 10+4). Limit
  the load-only fallback to same-machine moves, which leave a machine's count
  unchanged; cross-machine concentration is no longer allowed for load alone.

- The within-rack spread chose a destination machine by free slots alone, so if
  that machine's only nodes were already at the SameRackCount cap it skipped the
  move instead of trying another machine. Require a machine to have a node that
  can actually take the shard before selecting it.

* reduce comments across the machine-affinity change

Trim narration down to the non-obvious why; one terse line where a block was overkill.

* ec.balance: gate machine spread on fault-tolerance feasibility

Spreading a volume evenly across machines only helps when there are enough that
each can stay within EC's parity tolerance (numMachines >= ceil(total/parity)).
With fewer -- or wildly unequal -- machines it can't make a machine loss
survivable anyway, and forcing it fights capacity: e.g. a cluster of 12 volume
servers on one host and 2 on another would have half of every volume crammed onto
the 2-server box. So spread across machines only when it's achievable; otherwise
fall back to per-node spread and let capacity/global balancing decide.

The global load phase applies the same test: it protects a volume's machine spread
(no cross-machine move that raises a machine's count past the source's) only where
that spread is achievable, so heterogeneous clusters still level by fullness.

* ec.balance worker: group servers by host when planning

The worker built its planner topology without recording each server's host, so
automated ec.balance treated ports on one machine as independent nodes and could
concentrate a volume's shards on one physical box. Set the host from the volume
server address, matching the shell path.

* volume.balance worker: don't move a replica onto a machine holding one

The worker compared only node ids, and the replica map dropped the server address,
so it could move replicas onto different ports of one machine. Carry the host on
ReplicaLocation (from the server address) and reject a target whose host already
holds another replica of the volume. Best-effort, matching the shell.

* ec.balance: judge machine-spread feasibility by the rack's shards

The within-rack and global feasibility checks compared the whole volume's shard
count against a rack's machine count, so a rack holding only part of a volume after
cross-rack spreading -- e.g. 7 of a 10+4 volume across 2 machines -- was wrongly
judged infeasible and fell back to node spread, which could pile 6 shards onto one
host, past parity. Gate on the rack's own shard count of the volume instead.

* ec.balance: spread a volume's shards across machines by combined count

EC recovers from any loss within parity regardless of shard type, so what bounds a
machine's exposure is its total shards of the volume, not data and parity
separately. Spreading the two independently let each type's remainder land on the
same machine -- ceil(d/M)+ceil(p/M) can exceed ceil(total/M), e.g. a 5/3 split where
4/4 was achievable, past parity. Balance the combined count in one pass; disk-level
data/parity anti-affinity stays in pickBestDiskOnNode.

* ec.balance: don't let the imbalance threshold skip an over-parity machine

The within-rack spread gated on relative skew ((max-min)/avg > threshold), so a
worker threshold of 0.5 skipped an exactly-50%-skewed layout like 5/4/3 for a 10+4
volume, leaving 5 shards -- past parity -- on one machine. The even cap
(ceil(shards/groups)) is the real bound and the move loop already sheds only what
exceeds it, so drop the threshold gate from the within-rack phase (machine and node):
a balanced rack stays a no-op while any over-cap machine is always fixed.

* ec.balance: keep the imbalance threshold for the node fallback

Dropping the threshold from the whole within-rack phase made the node fallback too
eager: it runs only when machine fault tolerance is unachievable, so it is cosmetic
load distribution that should defer to the global utilization phase. Without the
gate it would, for a one-server-per-host 6/4 split at threshold 0.5, schedule a count
move that worsens utilization balance. Restore the threshold there; machine spreading
keeps bypassing it, since that bound is durability, not cosmetic skew.
2026-06-07 14:14:45 -07:00
Chris Lu 6e16994615 s3: make lifecycle TTL fast path per-bucket opt-in (#9825)
Stamping an Expiration.Days rule as a volume TTL at write time bakes an
irreversible TTL into the object: removing or lengthening the rule later
can't un-expire it, unlike worker-driven expiration. The metadata-only
delete it enables also skips per-chunk DeleteFile, so dead bytes linger in
a not-yet-expired TTL volume with no deleted-byte accounting until the
whole volume ages out.

Gate the resolver on a per-bucket flag, off by default; toggle with the
s3.bucket.lifecycle.fastpath shell command. Default writes take the worker
path: real deletes that honor current policy and let vacuum reclaim space.
2026-06-06 11:20:15 -07:00
Chris Lu ca81c0c525 fix(ec): pass per-volume data-shard count to the parity-shard split (#9781)
* fix(ec): pass per-volume data-shard count to the parity-shard split

ShardsInfo.DeleteParityShards/MinusParityShards looped ids 10..13, assuming
the fixed 10+4 layout. For a non-default ratio this splits data vs parity
wrong — a wide ratio (12+4, 16+6) drops real data ids >= 10, which breaks
ec.decode. They now take a dataShards argument (<= 0 falls back to
DataShardsCount) and clear ids dataShards..MaxShardCount. ec.decode threads
the data-shard count from collectEcNodeShardsInfo to both split call sites,
and admin LogicalSize passes DataShardsCount.

Also: EC cleanup now sets an explicit per-disk storage impact
(-len(ShardIds)) instead of falling back to the TotalShardsCount constant,
so freed-capacity accounting matches the shards actually removed.

OSS is always 10+4, so behavior is unchanged here; this keeps the split
ratio-correct and the API aligned with the enterprise per-volume override.
Adds parity-split ratio tests.

* ec: clear parity shards in one locked pass

Address review: DeleteParityShards looped si.Delete, taking the lock once per
id. shards is sorted by Id and shardBits is a bitmap, so mask off the high
bits and truncate the sorted slice at the first parity id (binary search) under
a single lock. Preserves the dataShards<=0 -> DataShardsCount default.
2026-06-01 19:25:15 -07:00
Chris Lu 2386fa550a grpc: don't tear down the shared master connection on a caller's own timeout (#9775)
A Canceled/DeadlineExceeded from the caller's per-request context was
treated like a dead channel: it closed the shared cached ClientConn and
cancelled every other in-flight RPC on it with "the client connection is
closing". Under a burst of concurrent chunk assigns (e.g. a large S3
multipart upload) one slow assign hitting its 10s attempt timeout could
poison the connection for all the rest, cascading into a flood of 500s.

Thread the caller's context into shouldInvalidateConnection and only
invalidate on Canceled/DeadlineExceeded while that context is still live,
which isolates the genuine stale-channel signal (a peer restart behind a
k8s Service VIP). To carry the context, add a ctx parameter to the
existing WithGrpcClient, WithMasterClient, and WithMasterServerClient; the
master assign and volume-lookup paths pass their per-attempt context and
every other caller passes context.Background().
2026-06-01 15:11:02 -07:00
Chris Lu 8c60408bfb s3: auto-enforce bucket quota read-only both ways (#9774)
* s3: auto-enforce bucket quota read-only both ways

Quota read-only only ever flipped when an admin re-ran
s3.bucket.quota.enforce, so a bucket that went over quota stayed
read-only forever even after usage dropped back under.

Fold enforcement into the per-minute, leader-locked bucket-size loop
the s3 gateway already runs for metrics: it now flips each bucket's
read-only flag to match its quota in both directions, rewriting
filer.conf only when a flag actually changes. The set/clear decision
lives in one shared FilerConf.ApplyBucketQuotaReadOnly helper so the
shell command and the gateway can't drift.

* only manage read-only when a quota is set, never clobber manual locks

* trim comments
2026-06-01 13:11:18 -07:00
Chris Lu f9ee49b03e shell: volume.fsck must not skip the system-log subtree (#9764)
shell: only skip system-log subtree in fs.meta.save, not fsck/verify

The SystemLogDir skip lived in the shared BFS traversal, so volume.fsck
built its in-use set without the /topic/.system/log chunks and flagged
every referenced log needle as orphan. -reallyDeleteFromVolume would then
delete live log data and leave dangling filer entries. Gate the skip
behind a flag that only fs.meta.save sets.
2026-06-01 09:54:22 -07:00
Chris Lu 9658f309d2 EC bitrot detection: per-shard checksum sidecars (#9761)
* ec: add EC bitrot checksum protobuf

EcBitrotProtection/EcShardChecksums/ChecksumAlgorithm sidecar messages,
copy_ecsum_file and unsafe_ignore_sidecar fields, and a CHECKSUM scrub mode.

* ec: bitrot checksum sidecar format, validation, and per-volume load

Per-shard CRC32C block checksums in an optional <base>.ecsum sidecar with a
self-integrity header; validation, rolling builder, backfill primitive, and
EcVolume load on mount + removal on destroy.

* ec: capture per-shard checksums at encode; verify-and-exclude on rebuild

WriteEcFilesWithContext returns the protection computed inline during encoding.
generateMissingEcFiles verifies present inputs against the sidecar, excludes
corrupt ones, regenerates in place, and re-verifies; fail-closed unless
unsafe_ignore_sidecar, removing all generated outputs on failure.

* ec: read-only checksum scrub with Reed-Solomon arbiter

ChecksumScrub verifies each local shard against the sidecar and reconstructs
flagged shards from the clean shards so stale-sidecar false positives are not
reported. Wired to the gRPC CHECKSUM mode and ec.scrub -mode checksum.

* ec: server-side bitrot sidecar write, copy, cleanup, and opportunistic backfill

Write .ecsum at fresh encode; propagate it with copy_ecsum_file (tolerant);
remove it on full delete and decode; rebuild honors unsafe_ignore_sidecar and
opportunistically backfills a sidecar when all shards are reachable.

* ec: volume server bitrot config flags

-ec.bitrotChecksum (default on) and -ec.bitrotBlockSizeMB (default 16).

* fix(ec_bitrot): bound -ec.bitrotBlockSizeMB before the int64 multiply

Validate the MiB value is in [1, 1024] before multiplying by 1 MiB, so a huge
flag value cannot overflow int64 and slip past the power-of-two check, and a
block size cannot collapse a sidecar to a few oversized blocks.

* fix(ec_bitrot): distribute the .ecsum sidecar from the worker encode path

The worker EC encode wrote the generation-0 sidecar locally but never added it
to shardFiles, so DistributeEcShards never shipped it and the distributed
holders came up unprotected. Append it to shardFiles and map the ecsum shard
type to its extension in the sender so it travels with the shards.

* fix(ec_bitrot): remove orphaned sidecars when the generation is gone

Gate sidecar removal on existingShardCount==0 alone rather than also requiring a
stray .ecx. A sidecar whose shards have all been deleted is orphaned and must be
removed even when no .ecx remains, or it leaks. .ecx/.ecj/.vif removal stays
gated on hasEcxFile as before.

* fix(ec_bitrot): do not fold checksum blocks scanned into TotalFiles

ChecksumScrub's first return is blocks scanned, not files. Discard it so the
scrub response's TotalFiles (a needle/file count) is not inflated by the block
count for CHECKSUM mode.

* test(ec_bitrot): clean up generated .ecsum sidecars in removeGeneratedFiles

* fix(ec_bitrot): reject an oversized sidecar payload before the uint32 cast

The header stores payload_len as a uint32; bound the payload before the
conversion so a pathological manifest cannot truncate the length field and
corrupt the sidecar. A real manifest is a few KB, so this never trips.

* fix(ec_bitrot): cap -ec.bitrotBlockSizeMB at 64 MiB

The block size becomes the per-shard scratch buffer the scrub/backfill path
allocates, so an over-large value (e.g. 1 GiB) is a memory hazard per concurrent
scrub worker. Lower the upper bound from 1024 to 64 MiB.

* fix(ec_bitrot): add -ecUnsafeIgnoreSidecar to weed tool fix -ecx

The -ecx recovery path reconstructs missing shards via RebuildEcFilesWithContext,
which fails closed on a malformed/stale .ecsum. Without an override flag an
operator could not complete the rebuild without manually deleting the sidecar.
Expose -ecUnsafeIgnoreSidecar (default false) and thread it through.

* fix(ec_bitrot): bound sidecar payload with a direct int constant; drop readFull

Guard len(payload) against a plain int constant (1 GiB) before the allocation
instead of a uint64 MaxUint32 compare, so the allocation-size value is provably
bounded (clears the CodeQL overflow alert) and the math import is no longer
needed. Inline os.File.ReadAt with io.EOF handling in verifyShardFileBlocks and
remove the now-redundant readFull helper (os.File.ReadAt fills the slice or
errors).

* test(ec_bitrot): use slices.Contains instead of a hand-rolled containsU32

* refactor(ec): fold the EcFiles WithContext variants into the base functions

RebuildEcFiles now takes the *ECContext directly (nil => derive from .vif as
before) and WriteEcFiles takes it too (nil => default), removing the parallel
RebuildEcFilesWithContext / WriteEcFilesWithContext names. Callers that had an
explicit context drop the WithContext suffix; the default-context callers pass
nil. No behavior change.

* refactor(ec): pass BackgroundECContext instead of nil to Write/RebuildEcFiles

Add a non-nil BackgroundECContext placeholder (analogous to context.Background())
and have callers with no specific layout pass it instead of a nil *ECContext.
WriteEcFiles resolves a zero/background context to the default ratio and
RebuildEcFiles resolves it from the .vif, so behavior is unchanged.

* fix(ec_bitrot): make BackgroundECContext a func; RebuildEcFiles fails closed on bad .vif

- BackgroundECContext is now a function returning a fresh *ECContext, so callers
  cannot mutate a shared singleton or race on it (and it mirrors context.Background,
  which is also a function).
- RebuildEcFiles now propagates the MaybeLoadVolumeInfo error: a present-but-
  unreadable .vif fails closed instead of silently rebuilding with the default
  ratio (which would corrupt a custom-ratio volume). Pass an explicit ctx to override.
2026-05-31 18:52:44 -07:00
Chris Lu fdfeb4063c shell: warn in volume.list when a volume id spans collections (#9759)
* shell: warn in volume.list when a volume id spans collections

A reused volume id, the result of the master handing out an id already
used by another collection (for example after losing its max-volume-id
counter on restart), makes collection.delete destroy the wrong
collection's data and makes any bare-id lookup, move, or vacuum
ambiguous. volume.list now scans the full topology and warns on ids
present in more than one collection so the clash is visible before any
destructive operation.

* volume.list: track duplicate ids lazily, sort with slices.Sort

Allocate the per-id collection set only on the first cross-collection clash
instead of one set per volume, so allocations scale with duplicates rather
than the volume count.
2026-05-31 11:52:39 -07:00
Chris Lu 5955972fe6 fix(shell): verify volume.merge output before overwriting replicas (#9731)
* fix(shell): verify volume.merge output before overwriting replicas

volume.merge overwrote every replica with the merged copy without checking it was complete. Read back the merged copy and refuse to overwrite unless it holds at least as many live needles as the most complete source replica, leaving the originals intact on a short or empty merge.

* fix(shell): keep merged volume until all replicas are rebuilt

On a copy failure partway through the overwrite loop, the temporary merged copy was deleted along with the half-rebuilt replicas. Stop deleting it until every replica has been rebuilt; on failure the verified copy is kept so the merge can be re-run to completion.

* refactor(shell): reuse readVolumeStatus in ensureVolumeReadonly

* fix(shell): guard against nil volume status response
2026-05-28 19:29:25 -07:00
Chris Lu 24e664d651 fix(shell): don't halt volume.fsck purge on a stuck read-only volume (#9714)
* fix(shell): don't halt volume.fsck purge on a stuck read-only volume

A failed VolumeMarkWritable on one volume aborted the entire fsck purge
run; per-volume errors now log and continue so remaining volumes still
get purged.

* fix(shell): unify volume.fsck per-volume skip logging at the caller

Return the mark-writable error from purgeOneVolume instead of logging
in two places — the caller already prints "skip purging volume N: %v"
and defers still fire on the error return.

* fix(shell): collect volume.fsck purge-skipped volumes and report at end

Track volume IDs whose purge was skipped (mark-writable failure or
other per-volume errors) and print a sorted summary so operators don't
have to scrape the run log to find them. Deletes for those volumes are
already skipped; this just makes them explicit.
2026-05-27 17:49:35 -07:00
Chris Lu d4e39b499b EC placement: shared replica-placement resolver, snapshot + Place core, capacity fixes, tiering (#9621)
* Add shared super_block.ResolveReplicaPlacement; use it in ec_balance

* Add ecbalancer.FromActiveTopology snapshot constructor for EC encode/repair

* Add ecbalancer.Place greenfield/repair placement core (strict + durability-first)

* topology: add GetEffectiveAvailableEcShardSlots; FromActiveTopology uses shard-granular free slots

GetDisksWithEffectiveCapacity flattens reserved shard slots into volume slots via
integer truncation, so an in-flight EC task reserving a non-multiple-of-
DataShardsCount number of shards was lost from the snapshot and freeSlots was
over-reported. GetEffectiveAvailableEcShardSlots subtracts the full reservation
impact at shard granularity.

* ecbalancer.Place: reject nodes without a free disk of the requested type

FromActiveTopology keeps all disk types in the snapshot, so an SSD-only request
could be routed to a node with only HDD capacity (pickBestDiskOnNode then returns
disk 0 on the wrong tier). Filter rack/node selection to those with a free disk
of the requested type.

* ecbalancer.Place: enforce ReplicaPlacement DiffDataCenterCount (per-DC shard cap)

* ecbalancer: enforce DiffDataCenterCount in balance (cross-DC phase + cross-rack DC cap)

Adds a cross-DC corrective phase that drains data centers holding more than
DiffDataCenterCount shards of a volume, and a per-DC cap on cross-rack move
targets. Both are no-ops when DiffDataCenterCount is unset, so balance output is
unchanged for non-DC placements.

* topology: ratio-aware EC shard slots and provisional empty-disk slot

GetEffectiveAvailableEcShardSlots now takes the target collection's data-shard
count, so a 4+2 volume's larger shards are not over-counted at 10 per volume slot;
and it keeps the one provisional slot for freshly started empty servers that
report max=0, matching getEffectiveAvailableCapacityUnsafe. FromActiveTopology
threads the ratio through.

* ecbalancer.Place: explicit disk-type filter signal (fix HDD vs any ambiguity)

HardDriveType normalizes to "", which collided with "" meaning any disk. Add
Constraints.FilterDiskType and normalize both sides so a hdd request matches disks
reported as "" and never leaks to SSD, while filter=false still means any.

* ecbalancer: add clearShardAccounting for repair snapshot reconciliation

Clears one disk's copy of a shard from per-domain accounting and recomputes the
node-level union (preserving a kept copy on another disk of the same node), without
crediting capacity. Repair uses it to drop to-be-deleted copies before placing
missing shards.

* ecbalancer: don't cap cross-DC target racks when DiffRackCount is unset

len(racks)+1 wrongly limited each target rack (3 in a 2-rack cluster), so draining
a DC could stop short of the DiffDataCenterCount cap. Use MaxShardCount+1 as the
effectively-unlimited default.

* topology/ecbalancer: ratio-correct EC capacity accounting

Reservation shard slots (default ShardsPerVolumeSlot units) are now converted to
the target ratio before subtracting, and existing EC shards are charged by size
(targetDataShards/shardDataShards) so a 2+1 shard isn't counted as one 10+4 slot.
Per-shard ratio lookup is behind shardDataShards (OSS uses the standard ratio).

* ecbalancer.Place: candidate tiering and eligible-rack caps

Adds a per-disk eligibility/preference abstraction so Place supports:
- preferred-tag whole-plan retry (try disks carrying the earliest tags first,
  widen to all only if a tier cannot place every shard; reports
  SpilledOutsidePreferredTags),
- soft disk-type spill via DiskTypePolicy (Any/Prefer/Require): Prefer fills the
  preferred type then spills, reporting SpilledToOtherDiskType; Require filters,
- even per-rack caps that divide by racks holding an eligible disk, so a tiered
  cluster (e.g. SSDs in 2 of 4 racks) isn't capped impossibly low.
Disk tags carried via Node.AddDiskTags + FromActiveTopology.

* ecbalancer: export ClearShardAccounting for repair snapshot reconciliation

* ecbalancer: address review feedback (ratio rounding, bitmap walk, same-DC moves)

- topology/ecbalancer: round shard-reservation and existing-shard footprint up
  when converting to target-ratio shard slots, so a sub-slot reservation is not
  truncated to zero and free capacity is not overstated for low-data-shard
  layouts (targetDataShards < ds).
- erasure_coding: add ShardBits.All iterator and use it across the balancer,
  cross-DC phase, and placement scoring instead of scanning 0..MaxShardCount and
  probing Has on every id.
- ecbalancer: allow same-DC cross-rack moves when a DC already sits at its
  DiffDataCenterCount cap; a same-DC move leaves the DC total unchanged. Add a
  regression test that fails without the guard.
- ecbalancer cross-DC phase: pick targets via the eligible-aware
  pickNodeInRackEligible/pickBestDiskEligible helpers so the disk-type filter is
  honored and a 0 disk id is not mistaken for a valid selection.

* ecbalancer: test ecShardSlotsOnDisk fractional round-up

Cover the mixed-ratio path (targetDataShards < existing data shards) so a
shard's fractional footprint is never floored to zero and free capacity is not
overstated. Exercises the round-up via the targetDataShards parameter; OSS uses
the standard ratio at runtime while the enterprise build hits it with real
per-volume ratios.

* ecbalancer: assert node B rack in TestFromActiveTopology

* ecbalancer: split Destination into separate DataCenter and bare Rack

Replace the composite "dc:rack" Rack field on Destination with separate
DataCenter and bare Rack values, matching topology.DiskInfo and the worker-task
convention. Callers (and tests) read the data center directly instead of parsing
the composite with strings.SplitN.

* shell ec.balance: use utilization-based global balancing (parity with worker)

The shell's global rebalance phase balanced by raw shard count; switch it to
fractional fullness (shards/capacity), as the worker already does. On uniform
capacity the two agree; on heterogeneous capacity it fills nodes proportionally
instead of driving small-capacity nodes toward full.

Updates the heterogeneous-capacity regression test to assert even fullness
(~equal shards/capacity per node) rather than even shard count.

* ecbalancer: bounded-proportional per-DC shard spread

DiffDataCenterCount was enforced only as a ceiling (drain-to-cap), which could
leave a within-cap-but-lopsided DC distribution under a loose cap (e.g. 10/4 of 14
with cap=10). Now the cross-DC phase, the cross-rack DC guard, and Place all target
boundedMaxPerDC = min(DiffDataCenterCount, max(ceil(total/numDCs), parityShards)):
shards spread proportionally across DCs, but no tighter than the durability floor
(once each DC holds <= parityShards a DC loss is recoverable, so further spreading
only adds cross-DC/WAN traffic). No-op when DiffDataCenterCount is 0; identical to
before when the cap is the binding constraint.

* ecbalancer: drop DiffDataCenterCount enforcement for EC placement

The 1-byte volume ReplicaPlacement packs xyz into x*100+y*10+z<=255, so the DC
digit can only be 0-2 -- far too small to be a meaningful per-DC EC shard cap (a
cap of 1-2 would demand 7-14 DCs for a 10+4 volume). It's volume replica-placement,
not an EC spec. Removes the cross-DC balance phase, the DC guard in the cross-rack
phase, and the per-DC cap in Place (and the just-added bounded-proportional logic);
EC relies on the RP-independent rack/node even spread instead. Rack/node caps
(DiffRackCount/SameRackCount) are unchanged. Per-domain EC caps are left for a real
EC placement spec.

* ecbalancer: enforce per-disk durability cap; symmetric reserve/release

Place now refuses to put more than parityShards shards of a volume on a single
disk (pickBestDiskEligible skips a disk once it holds parityShards of the volume,
a hard cap not relaxed even in durability-first). Previously Place assigned by
free capacity, so a skewed near-full cluster could pile >parityShards onto one
disk -> losing it loses the volume; only distinct-disk count was checked. This
covers encode and repair (both route through Place); the caller skips/leaves the
volume rather than minting an unrecoverable layout.

Also makes reserveShard decrement freeSlots unconditionally, symmetric with
releaseShard's unconditional increment (the old guarded decrement could credit a
phantom slot on release if a shard were ever reserved onto a full disk).

* ecbalancer: add Topology.ReleaseVolumeShards (clear + credit) for greenfield encode

Releases all of a volume's shards from the snapshot and credits the freed disk
capacity, so a greenfield encode can plan as if stale EC shards from a prior failed
attempt are gone. Safe to credit because the encode task deletes stale shards
(cleanupStaleEcShards) before distributing the new ones. Distinct from
ClearShardAccounting (repair), which does not credit.

* ecbalancer: ReleaseVolumeShards credits node freeSlots, not just disks

releaseShard only increments per-disk freeSlots, but rack capacity is summed from
node freeSlots (buildRacks) and node freeSlots gates node eligibility. Crediting
only disks left a node/rack looking full after releasing stale shards, so a
greenfield encode still couldn't use the freed capacity. Now credits the node by
the total disk-slots freed.

* ecbalancer: correct PlacementMode docs (encode uses durability-first)

PlaceStrict was labeled '(encode)' but encode uses PlaceDurabilityFirst. Clarify
that durability-first is used by both encode and repair, reports relaxations in
PlaceResult.Relaxed, and never relaxes the per-disk durability cap.

* ecbalancer: treat SameRackCount as a direct per-node shard cap

The 3rd ReplicaPlacement digit now caps shards per node at exactly the digit
value, matching how DiffRackCount (2nd digit) caps per rack, instead of allowing
digit+1 per node. This makes the per-rack and per-node caps consistent and
matches the documented "digits cap EC shards per rack and per node" semantics;
e.g. 011 now means at most one shard per rack and one per node.
2026-05-22 20:22:09 -07:00
Chris Lu cd15ae1395 fix(ec): bring ec.encode worker and EC/volume helpers to parity with shell (#9599)
* refactor(volume): extract replica sync/select into shared volume_replica package

Move the volume replica reconciliation helpers (status, union builder,
SyncAndSelectBestReplica, ReadNeedleMeta) out of the shell into a new
weed/storage/volume_replica package so both the shell (ec.encode, volume.tier.move,
volume.check.disk) and the EC encode worker can reuse them. No behavior change.

* fix(ec): bring ec.encode worker to parity with the shell

- Sync replicas and encode the most-complete one (via the shared
  volume_replica.SyncAndSelectBestReplica) instead of a possibly-stale replica,
  marking all replicas readonly first. Prevents silent data loss when a stale
  replica is encoded and the originals deleted.
- Skip remote/tiered volumes in detection (shell ec.encode excludes them).
- Min-node safety gate: refuse to encode when cluster nodes < parity shards.
- Align default thresholds with the shell (fullness 0.95, quiet 1h).

* fix(vacuum): plugin path honors min_volume_age_seconds override

deriveVacuumConfig hard-coded MinVolumeAgeSeconds=0, dropping any configured
value. Read it from worker config (default 0, matching the shell/master vacuum
which has no age gate) so an explicit override is honored.

* address review feedback

- config.go: align GetConfigSpec schema defaults (quiet_for_seconds=3600,
  fullness_ratio=0.95) with the runtime defaults so UI/bootstrap flows match the
  shell (coderabbitai).
- ec_task.go: roll back readonly when markReplicasReadonly fails partway, so
  already-marked replicas don't stay readonly (coderabbitai).
- volume_replica: pass the caller's replica statuses into buildUnionReplica instead
  of re-fetching them, and skip the per-needle ReadNeedleMeta RPC when the source
  replica is read-only (gemini-code-assist).

* test(plugin_workers/ec): make fixtures eligible under the new defaults

The default EC encode thresholds were raised to match the shell (fullness 0.95,
quiet 1h), but the plugin-worker integration fixtures still used 90%-full /
10-minute-old volumes, so detection found no eligible volumes and the tests failed
in CI. Bump the eligible fixtures to 96% full and 2h old.
2026-05-21 02:16:28 -07:00
Chris Lu 391f543ff2 fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes

The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.

Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.

* fix(ec): attribute EC balance source disk per shard and reject same-node moves

On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.

Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.

* fix(ec): spread EC balance moves across destination disks

Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
  - fewer total EC shards on the disk,
  - far fewer shards of the same volume on the disk (spread a volume's shards
    across disks for fault tolerance), and
  - data/parity anti-affinity (a data shard avoids disks holding the volume's
    parity shards and vice versa).

Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.

* fix(ec): bring EC balance worker to parity with shell ec.balance

The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:

- Cross-rack and within-rack now run a two-pass balance: data shards spread
  first, then parity shards spread while avoiding racks/nodes that already hold
  the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
  and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
  constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
  keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
  10+4 here), replacing the previously hardcoded constant at the call sites.

Selection is deterministic (sorted keys) to keep behavior reproducible.

* refactor(ec): extract shared ecbalancer package for shell and worker

The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.

- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
  within-rack data/parity two-pass with anti-affinity, global per-rack balance,
  and diversity-aware disk selection) over a caller-built Topology snapshot and
  returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
  task proposals; the shell builds it from its EcNode model and executes Moves
  via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
  each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
  (worker balances by fractional fullness; shell by raw count) and
  GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
  in one pass).

The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.

* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor

Address regressions and gaps from the ecbalancer extraction:

- Shell ec.balance honors -maxParallelization again: planned moves run phase by
  phase (preserving cross-phase dependencies) with bounded concurrency within a
  phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
  updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
  instead of combined totals, so a data/parity skew is corrected even when the
  per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
  custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
  move, so limited-capacity targets are no longer over-planned.

* fix(ec): make EC dedup keeper deterministic and capacity-aware

When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.

* fix(ec): restore global volume-diversity and per-volume move serialization

Two more behaviors lost in the ecbalancer refactor:

- Global rack balancing again prefers moving a shard of a volume the destination
  does not hold at all before adding another shard of an already-present volume
  (two-pass, mirroring the old balanceEcRack), keeping each volume's shards
  spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
  while still running different volumes in parallel, so concurrent moves of the
  same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.

* fix(ec): key EC balance shards by (collection, volume id)

A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.

* fix(ec): credit freed capacity from dedup before later balance phases

Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).

* test(ec): add multi-disk EC balance integration test

Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.

* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes

collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.

Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.

* test(ec): end-to-end multi-disk EC balance shard-loss regression

Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.

* fix(ec): bring EC balance worker/plugin path to parity with shell

- Per-volume serialization and phase order: key the plugin proposal dedupe by
  (collection, volume) instead of (volume, shard, source), so the scheduler runs
  only one of a volume's moves at a time (within a run and against in-flight jobs).
  Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
  because the planner emits a volume's moves in phase order, they now execute in
  order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
  a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
  instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
  read it into the worker config, mirroring ec.balance -shardReplicaPlacement.

* test(ec): rename worker in-process test (not a real integration test)

The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.

* ci(ec): remove redundant ec-integration workflow

ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.

* fix(ec): worker falls back to master default replication for EC balance

For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.

The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.

* fix(ec): plugin path populates master default replication too

The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.

* docs(ec): empty shard replica placement uses master default, not even spread

The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
2026-05-20 23:31:21 -07:00
Chris Lu 4385b86bf1 fix(shell): volumeServer.evacuate no longer panics on a nil volume (#9587)
adjustAfterMove now removes the moved volume from the source disk's
VolumeInfos in place: it swaps the entry with the last one and nils the
tail. evacuateNormalVolumes ranges directly over that same slice, so the
niled tail slot is later read as a nil *VolumeInformationMessage and the
move attempt panics on vol.DiskType.

Iterate over a snapshot of the slice so in-place removals during a move
cannot leave nil holes in the loop.
2026-05-20 10:27:00 -07:00
Chris Lu e332b97d52 fix(shell): volume.balance no longer drains all volumes onto one server (#9579)
* fix(shell): volume.balance no longer drains all volumes onto one server

The density-based capacity function reads per-disk VolumeInfos sizes, but
adjustAfterMove only updated VolumeCount and the selectedVolumes map. The
planner re-read a stale topology after every move, so the source node's
density never dropped and it kept moving volumes until that node was empty.

Move the volume's size accounting between disks after each planned move so the
density recomputes and the loop converges to an even distribution.

* refactor(shell): O(1) volume removal and direct disk lookup in adjustAfterMove

removeVolumeInfo swaps with the last element instead of shifting, and the disk
is fetched by key rather than ranging the DiskInfos map.
2026-05-20 01:39:23 -07:00
Chris Lu 41b6ad002b fix(volume.list): show one entry per physical disk on multi-disk nodes (#9541)
* fix(volume.list): show one entry per physical disk on multi-disk nodes

DataNodeInfo.DiskInfos is keyed by disk type, so several same-type
physical disks on one node collapse to a single map entry at the master.
volume.list iterated that map directly and reported one "Disk hdd ...
id:0" line per node, hiding the per-disk volume and shard layout. EC
operators on multi-disk volume servers had no way to verify which
physical disk a shard landed on.

Lift the per-physical-disk split into a DiskInfo.SplitByPhysicalDisk()
method on the proto type so consumers outside admin/topology can use
it. Apply it in writeDataNodeInfo so the verbose Disk block shows one
entry per physical disk, ordered by DiskId. Capacity counters are
split evenly across reconstructed disks since the wire format doesn't
carry per-disk capacity yet.

This is a display-only change. ActiveTopology already did the split on
its own and is now updated to call the shared helper.

* fix(volume.list): preserve totals, count active/remote exactly, dedupe header

Address review feedback on the per-physical-disk split:

- share() truncated remainders so reconstructed per-disk counters could
  sum to less than the original aggregate (10 / 3 = 3+3+3). Distribute
  the remainder to the lowest disk ids so MaxVolumeCount and
  FreeVolumeCount sum exactly back to the node totals.
- ActiveVolumeCount and RemoteVolumeCount are derivable per disk from
  the VolumeInfos already grouped by DiskId, so count them exactly
  (ReadOnly=false and RemoteStorageName!="" respectively) instead of
  approximating with an even split.
- writeDataNodeInfo's per-disk callback fired the DataNode header on
  every iteration after the split, so a node with 6 physical disks
  emitted 6 DataNode headers. Guard the callback with headerPrinted so
  the header still appears at most once per node.
- Sort split disks deterministically using explicit DiskId comparison
  to avoid int overflow risk on 32-bit systems.
- Tighten the volume.list test substring to "id:N\n" so unrelated
  tokens like "ec volume id:101" don't accidentally match the id:1
  needle, and assert the rack callback fires once.
2026-05-18 14:43:44 -07:00
Chris Lu 37e6263efe fix(shell): attach admin JWT for filer IAM gRPC calls (#9536)
When jwt.filer_signing.key is set, the filer's IamGrpcServer requires
a Bearer token on every IAM RPC. The shell's s3.* IAM commands dialed
without that header and failed with Unauthenticated. Route them through
a small helper that mints a token from the same key viper-loaded from
security.toml and appends it as outgoing metadata, matching the credential
grpc_store pattern.
2026-05-18 13:42:32 -07:00
Chris Lu 3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
Chris Lu 79859fc21d feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd (#9468)
* feat(s3/versioning): grep-able heal logs + scan-anomaly diagnostics + audit cmd

Three diagnostic additions on top of #9460, all aimed at making the next
production incident faster to triage than the one we just spent hours on.

1. [versioning-heal] grep prefix on every heal-related log line, with a
   small fixed event vocabulary (produced / surfaced / healed / enqueue /
   drain / retry / gave_up / anomaly / clear_failed / heal_persist_failed
   / teardown_failed / queue_full). One grep gives operators a single
   event stream across the produce-to-drain lifecycle.

2. Escalate the "scanned N>0 entries but no valid latest" case in
   updateLatestVersionAfterDeletion from V(1) Infof to a Warning that
   names the orphan entries it saw. This is the listing-after-rm
   inconsistency signature that pinned down 259064a8's failure — it
   should not be invisible at default log levels.

3. New weed shell command `s3.versions.audit -prefix <path> [-v] [-heal]`
   that walks .versions/ directories under a prefix and reports the
   stranded population. With -heal it clears the latest-version pointer
   in place on stranded directories so subsequent reads return a clean
   NoSuchKey instead of replaying the 10-retry self-heal loop.

* fix(s3/versioning): audit pagination, exclusive categories, ctx-aware retry

Address PR review:

1. s3.versions.audit walked only the first 1024-entry page of each
   .versions/ directory, false-positiving "stranded" on large dirs.
   Loop until the page returns < 1024 entries, advancing startName.

2. clean and orphan-only categories double-counted when a directory
   had no pointer and at least one orphan: incremented both. Make them
   mutually exclusive so report totals sum to versionsDirs.

3. retryFilerOp's worst-case ~6.3s backoff was a bare time.Sleep,
   non-interruptible by ctx. A server shutdown / client disconnect
   would wait out the budget per in-flight delete. Thread ctx through
   deleteSpecificObjectVersion -> repointLatestBeforeDeletion /
   updateLatestVersionAfterDeletion -> retryFilerOp; backoff now uses
   a select{<-ctx.Done(), <-timer.C}. HTTP handlers pass r.Context();
   gRPC lifecycle handlers pass the stream ctx.

   New test pins the behavior: cancelling ctx mid-backoff returns
   ctx.Err() in <500ms instead of blocking ~6.3s.

* fix(s3/versioning): clearStale outcome + escape grep-able log fields

Two coderabbit follow-ups:

1. Successful pointer clear should suppress `produced`.
   updateLatestVersionAfterDeletion's transient-rm fallback called
   clearStaleLatestVersionPointer best-effort, then unconditionally
   returned retryErr. The caller (deleteSpecificObjectVersion) saw the
   error and emitted `event=produced` + enqueued the reconciler, even
   though clearStaleLatestVersionPointer had just driven the pointer to
   consistency and the next reader would get NoSuchKey via the
   clean-miss path. Make clearStaleLatestVersionPointer return cleared
   bool; on success the caller returns nil so neither produced nor the
   reconciler enqueue fires. Concurrent-writer aborts, re-scan errors,
   and CAS mismatches still report false so genuinely stranded state
   keeps surfacing.

2. Escape user-controlled fields in heal log lines.
   versioningHealInfof / Warningf / Errorf interpolated raw bucket /
   key / filename / err text into a single-space-separated line. An S3
   key (or error string from gRPC) containing whitespace, newlines, or
   `event=...` could split one event into multiple tokens and spoof
   fake fields downstream. Sanitize each arg in the helper: safe
   values pass through; anything with whitespace, quotes, control
   chars, or backslashes is replaced with its strconv.Quote form. No
   caller changes — the format strings remain unchanged.

Tests pin both behaviors: sanitization table covers the field
boundary cases; an end-to-end shape test confirms a key containing
`event=spoof` stays inside a single quoted token.
2026-05-13 10:48:58 -07:00