1908 Commits

Author SHA1 Message Date
Chris Lu
85ca3cb757 filer: warm-up + fail-closed cooling for POSIX locks on owner (re)start (#9673)
After a (re)start the owner defers would-be grants for posixLockWarmup
while mounts re-assert, trusting only locally-visible conflicts, so it
does not double-grant from empty state; a deferred grant is a retry for
SetLkw and EAGAIN for non-blocking SetLk, never a spurious grant. Cooling
now fail-closes: if the previous owner is unreachable during a ring
change, defer rather than risk a double-grant. readyAt is atomic so the
handler reads it without locking.
2026-05-25 13:14:05 -07:00
Chris Lu
a3c0baa9b0 filer: cooling-off dual-read for POSIX locks during ring changes (#9672)
While the ring changed within the last snapshot interval, a fresh owner
asks the key's previous owner (LockRing.PriorOwner) whether it still
holds a conflicting lock before granting TRY_LOCK or answering GET_LK, so
it does not double-grant before re-assertion rebuilds its local state.
The probe is marked cooling_probe so the previous owner answers from
local state without recursing. PriorOwner uses the snapshot's prebuilt
ring rather than rebuilding a hash ring per call.
2026-05-25 12:34:15 -07:00
Chris Lu
f8caaa4464 mount,filer: re-assert POSIX locks via keepalive (ownership migration + restart) (#9668)
* mount: renew POSIX lock leases via keepalive

The mount tracks the inode keys it holds locks on and a background loop
renews its session lease (KEEP_ALIVE) with each key's owner filer every
5s, within the filer's 15s TTL. A live mount is never reaped; a dead one
stops renewing and owners reclaim its locks. Tracking is a superset:
holds are added on grant and dropped only on owner release, so a still
held lock is never under-renewed.

* mount,filer: re-assert held POSIX locks via keepalive

The owner filer holds POSIX advisory locks as in-memory soft state, so a key's
owner change (ring rebalance) or an owner restart lost or stranded them: the new
or restarted owner was blind to existing holders and would double-grant.

Make the keepalive carry the mount's held lock ranges per key. The mount mirrors
its own granted locks (posixOwn), and each tick re-asserts them to the key's
current owner, which rebuilds that session's locks from the assertion — self
-healing after a takeover or restart. The owner arbitrates re-asserted locks
against other sessions so it never double-grants; a lock that lost a migration
race is reported, not forced. A bare keepalive (no ranges) still just renews.
2026-05-25 01:02:45 -07:00
Chris Lu
c97b69f8a4 filer: session lease + reaping for POSIX locks (#9666)
* filer: session lease + reaping for POSIX locks

A mount renews its session lease by keepalive (new KEEP_ALIVE op); the
owner filer records last-seen per session and a background sweeper reaps
the locks of leased sessions that stop renewing — a dead or partitioned
mount. Only sessions that have renewed are leased, so this is inert until
mounts run with -posixLock.

* mount: route POSIX advisory locks to the owner filer (-posixLock) (#9665)

mount: route POSIX advisory locks to the owner filer under -dlm

With -dlm, GetLk/SetLk/SetLkw and the flush/release cleanup paths go to
the inode's owner filer via the PosixLock RPC instead of the local table,
so flock/fcntl are honored across mounts. Advisory locking rides the same
switch as whole-file write coordination — and is therefore off under
writeback cache, which implies single-writer. The mount calls its filer
and relies on filer-side forwarding to reach the owner. Keys are the inode
identity (HardLinkId else path); SetLkw is client-side polling with the
FUSE cancel channel (no server wait queue); a per-mount session id
namespaces owners; a local hint avoids a release RPC on every close.

* mount,filer: bound posix-lock release RPCs and stop the reaper on shutdown

The unlock/release RPCs run off the syscall path (close/flush) and used
context.Background() with no deadline, so a slow or unreachable filer could
hang close() indefinitely; bound them to 5s (they still aren't cancelled by
an interrupt). The lease-reaping sweeper now selects on a stop channel that
FilerServer.Shutdown closes, instead of looping for the process lifetime.
2026-05-25 00:00:59 -07:00
Chris Lu
fef49c2d75 filer: routed PosixLock RPC over the in-memory authority (#9664)
* filer: in-memory POSIX lock authority (Manager)

Concurrent multi-inode authority over the per-inode Set: a Set per opaque
inode key (path, or hl:<HardLinkId>) plus a session->keys index so a dead
mount's locks reap in O(locks held). Lock state stays in memory like the
distributed lock manager's, off the replicated meta-log. TryLock/Unlock/
GetLk/ReleasePosixOwner/ReleaseFlockOwner/ReleaseSession; empty sets and
stale index entries are pruned on release.

* filer: routed PosixLock RPC over the in-memory authority

Adds the PosixLock RPC (try/unlock/get_lk + the flush/release owner
drops) that the owner filer answers from its in-memory Manager. The
request key is the inode identity ring key; a non-owner filer forwards
one hop (is_moved-bounded), mirroring ObjectTransaction, so the owner's
table stays the single authority under a stale ring view. Strictly
non-blocking; SetLkw polling lives in the mount.
2026-05-24 22:50:42 -07:00
Chris Lu
2a4923e7e8 ObjectTransaction: filer-side forwarding via route_key (#9659)
A non-owner filer forwards the whole transaction to the ring owner of route_key, so the owner's per-path lock stays the single serialization point even when the caller's ring view is stale. is_moved bounds forwarding to one hop. The gateway stamps route_key on every routed builder via the shared objectRouteKey helper. Completes taking S3 object mutations off the distributed lock.
2026-05-24 14:21:06 -07:00
Chris Lu
1f0c366583 s3: route metadata-only self-copy off the distributed lock (#9638)
A non-versioned metadata-only self-copy (CopyObject with source == destination
and the REPLACE directive) is a read-modify-write of one entry, which is why it
held the distributed lock. It now routes to the owner as a serialized
PATCH_EXTENDED: the owner merges the new managed metadata (set the replacements,
delete the dropped keys) onto a fresh read of the entry under its per-path lock,
so a concurrent change to non-managed keys (legal hold, retention, version id) is
preserved instead of clobbered, and bumps mtime.

PATCH_EXTENDED gains touch_mtime for the mtime bump. Versioned and suspended
self-copies create a new version (already routed via the copy finalize) and the
no-owner bootstrap keep the lock.
2026-05-24 12:32:57 -07:00
Chris Lu
fa7056dc6f s3: route object-lock version-specific deletes off the distributed lock (#9657)
A version-specific DELETE (real version or the null version, including
object-lock WORM-checked ones and governance-bypass) now runs as one routed
transaction on the object's owner instead of holding the distributed lock.

For a real version: recompute the .versions pointer excluding the version
(repoint-before-delete, so a crash leaves a recoverable orphan rather than a
dangling pointer), then delete the version file, under the object's per-path lock.
The null version is the regular object entry, deleted directly (no pointer).

Object-lock buckets gate the delete on the version's WORM guards evaluated on the
owner: legal hold (always) + retention (while not elapsed). Governance bypass
scopes the retention guard to COMPLIANCE mode, so the filer allows a
governance-mode delete while still denying compliance and legal hold — the
gateway never reads the version.

Three primitives make this expressible:
- ObjectTransaction.condition_key: evaluate the condition against a named entry
  (the version) while the lock stays on lock_key (the object).
- Recompute.exclude_name: omit a child from the scan, to repoint before delete.
- WriteCondition.Clause gate_key/gate_value: scope IF_EXTENDED_TIME_ELAPSED to a
  mode, expressing governance bypass without a gateway-side read.
2026-05-24 11:41:08 -07:00
Chris Lu
db954b5503 s3: route versioned PutObject finalize off the DLM (#9631)
s3: route versioned PutObject finalize off the distributed lock

A versioned write's finalize (flip the .versions pointer to the newest version,
demote the prior latest) now runs as a single RECOMPUTE_LATEST ObjectTransaction
on the object's owner filer, under its per-path lock, instead of the unserialized
updateLatestVersionInDirectory. The version file is written first; the owner
re-derives the pointer by scanning the directory.

RECOMPUTE_LATEST gains size_to_key / mtime_to_key to cache the chosen version's
size and mtime on the pointer, and demote_key / demote_value to stamp the
displaced prior latest (NoncurrentSinceNs for lifecycle) when the pointer moves.

Falls back to updateLatestVersionInDirectory when no owner is known yet.
2026-05-24 03:10:30 -07:00
Chris Lu
f037fc4dce s3: dial the object lock's primary filer directly (#9626)
* s3: dial the object lock's primary filer directly

The S3 object write lock builds a fresh short-lived lock per write, each
starting at the seed filer. When the seed isn't the key's hash-ring primary
the filer forwards the request to the primary, and in multi-cluster setups
that forward crosses clusters on every write.

Give the lock client a view of the filer lock ring, fed by the master's
LockRingUpdate broadcasts the gateway already receives, so it dials the
primary directly. The view tracks filer membership by version; a stale view
stays correct because the filer still forwards as a fallback.

Also send the initial ring snapshot to S3 clients, not just filers.

* s3: subscribe to lock-ring updates before starting the master loop

The master delivers the initial LockRingUpdate once, on connect. Registering the
callback after KeepConnectedToMaster started left a window where that first
update could arrive before the handler was set and be dropped, delaying the ring
view until the next membership change. Build the lock client and register the
callback in the masters block before launching the loop; the filers block reuses
that client (or creates a plain one when no masters are configured).

* lock_manager: build the hash ring in a deterministic server order

rebuildRing ranged over the server set (a map), whose iteration order is
randomized per process. On a vnode hash collision the last writer into
vnodeToServer wins, so two nodes holding the same server set could resolve the
collision to different servers and disagree on the primary for keys near that
slot. Now that the S3 gateway also computes PrimaryForKey, such a disagreement
would route the same key to different filers and defeat per-path serialization.

Iterate the servers in sorted order so the ring is identical on every node with
the same set, regardless of discovery order.

* lock_manager: skip redundant ring rebuilds, trim comments

SetRing now ignores a non-zero version at or below the current one once a ring
exists, so repeated LockRingUpdate broadcasts on reconnect no longer rebuild the
ring.

* s3: hold the lock-ring client on the server for route-by-key

Store the object-write lock client on S3ApiServer so handlers can resolve a
key's owner filer via PrimaryForKey.
2026-05-24 00:40:43 -07:00
Chris Lu
b4d2224e97 filer: let PATCH_EXTENDED replace Entry.content (#9654)
* filer: let PATCH_EXTENDED replace Entry.content

PATCH_EXTENDED merges extended attributes under the per-path lock, reading the
entry fresh, so concurrent patches to different keys don't clobber each other.
Some single-key state lives in Entry.content rather than an extended attribute
(e.g. the S3 bucket metadata blob). Add set_content/content to the mutation so a
patch can replace content the same way -- read fresh, set content, preserve the
rest -- letting a content write and an extended-attribute write on the same
entry serialize on the lock instead of racing whole-entry rewrites.

* Update weed/server/filer_grpc_server.go

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* filer: test set_content FileSize sync; note chosen content-patch approach

Cover the FileSize behavior of a set_content patch: a file's size follows the
new content length (including when it shrinks), a directory's stays zero. Also
document, in the bucket-config design, that extending PATCH_EXTENDED with
set_content is the implemented path for content-backed config.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-23 21:43:43 -07:00
Chris Lu
83195fc111 filer: reuse the caller's fetched entry in CreateEntry (#9645)
CreateEntry starts with a FindEntry to load the current entry. A conditional
CreateEntry already fetched that entry to evaluate the precondition under the
per-path lock, so the create repeated the lookup.

Add an existing *Entry parameter: when non-nil it is used as the current entry
and the internal lookup is skipped; nil keeps the lookup. The gRPC CreateEntry
handler passes the entry it fetched for the precondition, removing the redundant
read while the lock is held. All other callers pass nil.
2026-05-23 21:40:41 -07:00
Chris Lu
091aad59dc filer: add ObjectTransactionBatch for multi-key object writes (#9649)
A multi-object delete spans many keys that route to different owner filers. The
gateway groups keys by owner and sends one batch per owner; the filer applies
each transaction under its own per-path lock, independent of the others.

A failed transaction (precondition or mutation error) is reported in its own
response without aborting the rest, matching S3 multi-object semantics where
each key succeeds or fails on its own. There is no cross-key atomicity, which S3
batch delete does not require.
2026-05-23 21:09:02 -07:00
Chris Lu
e2203b2a0b filer: add extended-attribute guard clauses for object-lock (#9648)
Routing object-lock buckets off the distributed lock needs the retention and
legal-hold check to run atomically with the write, under the per-path lock. Move
just the comparison into the filer, not the S3 semantics: two generic clause
kinds on an extended attribute.

IF_EXTENDED_NOT_EQUAL blocks while extended[ext_key] equals ext_value (a legal
hold). IF_EXTENDED_TIME_ELAPSED blocks while extended[ext_key], read as a unix-
second deadline, is in the future against the filer's clock (retention); a
malformed deadline fails safe. The caller composes these from the object-lock
state and, for a governance bypass, simply omits the retention clause once the
bypass is authorized -- the filer makes no authorization decision and keeps no
S3 knowledge.
2026-05-23 19:38:08 -07:00
Chris Lu
e71bac55e9 filer: add RECOMPUTE_LATEST mutation to ObjectTransaction (#9647)
Deleting a specific version that happens to be the latest needs the new latest
re-derived from the remaining versions, and that scan must run under the same
lock as the delete. The gateway can't do it atomically across RPCs.

Add a RECOMPUTE_LATEST mutation: it scans a directory under the transaction
lock, picks the child that sorts last (descending) or first by name, copies the
mapped extended keys from it into a pointer entry, and stores its name under
name_to_key. An empty directory clears the pointer keys. The filer stays
mechanical and S3-agnostic: the caller, which knows the versioning scheme,
supplies the sort direction and the key mappings. A missing pointer entry is a
no-op, so a replayed transaction is idempotent.
2026-05-23 18:29:46 -07:00
Chris Lu
bf022ca018 filer: add ObjectTransaction for atomic multi-entry object writes (#9646)
A versioned object write touches several entries that must change together: the
main object, a delete marker or version file, and the latest pointer on the
.versions directory. Holding a distributed lock across separate RPCs to do this
is what the per-path lock was meant to replace, but a single CreateEntry only
covers one entry.

Add ObjectTransaction: a request carries a lock_key (the object path), an
optional WriteCondition, and an ordered list of mutations (PUT / DELETE /
PATCH_EXTENDED). The filer holds the per-path lock on lock_key for the whole
call, checks the condition against the entry at lock_key, then applies the
mutations in order. Callers route the object's writes to its owner filer so the
lock is authoritative across all of the object's entries.

DELETE and PATCH of an absent entry are no-ops, so a replayed transaction is
idempotent. PUT entries are metadata-scoped; data-bearing writes (chunks) are
written before the transaction, as today.
2026-05-23 17:34:30 -07:00
Chris Lu
b18d3dc96c filer: evaluate a write precondition in CreateEntry (#9650)
Add an optional WriteCondition to CreateEntryRequest. When set, the filer
evaluates it against the current entry while holding the per-path lock, so the
check and the write are atomic on this filer, and returns PRECONDITION_FAILED
when it does not hold. The caller must route the key's writes to the owner filer
for the check to be authoritative.

A condition is a list of clauses that all must hold (logical AND). One clause is
the common case; several express what a single comparison cannot: an ETag set
(If-Match / If-None-Match with multiple values), weak-ETag comparison, and
compound conditions. ETag comparison mirrors the S3 gateway's precedence (stored
Seaweed ETag attribute, then the Md5/chunk fallback) and follows RFC 7232
strong/weak rules, so results match without coupling the filer to S3 handling.

Condition parsing and evaluation live in filer_grpc_server_condition.go.
2026-05-23 16:29:14 -07:00
Chris Lu
bce76e6e21 filer: serialize same-path mutations with a per-path lock (#9639)
CreateEntry is a FindEntry-then-write with no lock, so concurrent creates to the
same path race: OExcl can admit two creators, and a conditional check-then-act
has no atomicity. Add a per-path exclusive lock (util.LockTable, which evicts
idle keys so it stays bounded) on the FilerServer and take it in CreateEntry, so
the existence check and the write are atomic on this filer.

This is the local serialization point that lets callers route a key's writes to
its owner filer and drop the distributed lock for that key. AppendToEntry keeps
its distributed lock for now; it can move to the per-path lock once its callers
route to the owner.
2026-05-23 14:22:42 -07:00
Chris Lu
9021225591 master: accept volume-server Ping targets on follower masters (#9614)
cluster.check asks every master to ping every volume server, but the
Ping gate validated volume-server targets only against the local
topology. Only the leader receives volume-server heartbeats, so a
follower's topology is empty and every probe through it failed with
"unknown ping target ... of type volumeServer".

Fall back to the volume-server set the master learns over its own
MasterClient subscription to the leader, the same source the filer gate
already trusts. The anti-SSRF intent is preserved: Ping still only dials
recognized cluster members.
2026-05-21 10:19:59 -07:00
Chris Lu
5af7d12f04 fix(filer.sync): keep sync_offset fresh while the source is read-only (#9589)
* fix(filer.sync): keep sync_offset fresh while the source is read-only

sync_offset holds the timestamp of the last replicated source event, so
monitoring derives lag from now-sync_offset. A read-only source emits no
metadata events, so the gauge froze at the last write and the derived lag
grew without bound, making thresholds unusable.

The source filer now sends an idle heartbeat carrying its current time
while a subscriber is caught up to the buffer head. filer.sync uses it to
advance the gauge, so now-sync_offset reflects real lag. Heartbeats are
opt-in (client_supports_idle_heartbeat), are never written to the metadata
log, and do not move the resume checkpoint, so a restart still resumes
from the last real event.

* fix(filer.sync): gate idle heartbeat on the read cursor, not SinceNs

In metadata-chunks mode persisted entries replay as log file refs and
never reach eachLogEntryFn, so lastSeenTsNs stays put and a caught-up
subscriber with an old SinceNs would never get a heartbeat. Use the
read cursor (lastReadTime), which advances in that mode too, max'd with
lastSeenTsNs so the in-memory backlog-then-idle case still works while
the cursor returned to the caller has not yet updated.
2026-05-20 11:26:37 -07:00
Chris Lu
77ac781bbd fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers (#9568)
* fix(ec): VolumeEcShardsInfo walks every disk on multi-disk servers

When a volume server holds EC shards for the same vid across more than
one disk, each DiskLocation registers its own EcVolume entry and
Store.FindEcVolume returns whichever one it hits first. The shard-info
RPC iterated only that single EcVolume's Shards, so the response missed
every shard mounted on a sibling disk.

The worker's verifyEcShardsBeforeDelete sums the per-server responses
into a union bitmap and refuses to delete the source volume when the
union falls short of dataShards+parityShards. On multi-disk
destinations, the union was systematically under-counted and source
deletion got blocked even though all shards were physically present and
mounted.

Walk every DiskLocation in the handler and emit the deduplicated union
of all shards. The .ecx-backed fields (file counts, volume size) still
come from a single EcVolume since every disk's entry opens the same
.ecx via NewEcVolume's cross-disk fallback.

Tests:
- TestVolumeEcShardsInfo_AggregatesAcrossDisks unit test in
  weed/server/.
- test/volume_server/grpc/ec_verify_multi_disk_test.go integration test
  drives the full generate -> mount -> redistribute -> restart ->
  reconcile path and asserts both VolumeEcShardsInfo and
  VerifyShardsAcrossServers + RequireFullShardSet (the production
  source-deletion gate) report all 14 shards.
- ec_multi_disk_lifecycle_test.go tightened: replaces the
  "VolumeEcShardsInfo only sees one disk's EcVolume" workaround with a
  full-shard-set assertion.

* review: use ShardBits bitmask + cap-pre-allocation for shard dedup
2026-05-19 14:58:56 -07:00
Chris Lu
68794fb94c fix(ec_distribute): remove partial files on copy stream error (#9543)
* fix(ec_distribute): remove partial files on copy stream error

writeToFile opens the destination with O_TRUNC and streams into it. On
a mid-stream receive / write / cancellation error it returned the
failure but left the destination behind in whatever state had been
written so far — typically 0 bytes when the source errored before
sending any FileContent. VolumeEcShardsCopy distributes .ecx by
calling doCopyFile, so this same stub-leaving behaviour produced the
0-byte .ecx files seen on EC encoding failures: the source claims a
non-zero ModifiedTsNs (so the existing "source not found" cleanup
doesn't fire), the stream then errors immediately, and the receiver
ends up with a 0-byte .ecx that downstream code mistook for a valid
empty index.

Clean up the partial file on every error path that returns from the
streaming loop (receive, write, and cancellation). Skip cleanup when
isAppend=true so resumable appends keep their existing content. As
defense in depth, VolumeEcShardsCopy also stats the .ecx after copy
and removes / errors on a 0-byte result so the orchestrator can pick
a different source.

The Rust volume server has only the source side of CopyFile (no
client-side stream-to-disk consumer) and no .ecx subsystem yet, so
this fix has no Rust mirror.

* fix(ec_distribute): close file before remove, fail fast on stat error

Address review feedback:

- writeToFile's mid-stream removeIncomplete called os.Remove while the
  destination file handle was still open. On Windows os.Remove fails
  while a handle is open, so the cleanup wouldn't run there. Wrap the
  handle close in a once-only helper, call it from removeIncomplete
  and from the existing "source not found" cleanup, and keep a deferred
  close as the safety net for the normal-return path.
- VolumeEcShardsCopy's post-copy .ecx check silently passed when
  os.Stat returned an error: doCopyFile had reported success but if
  the file was already gone, unreadable, or somehow a directory, the
  orchestrator only learned at mount time with no useful context.
  Treat any non-nil stat error and any directory result as a copy
  failure here and surface it immediately.
2026-05-18 15:19:51 -07:00
Chris Lu
6b94701213 mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board

Replaces noisy startup/shutdown logs with a single in-place progress
table on a TTY (or one line per state change off-TTY). Each component
renders as `pending -> starting -> ready` during startup and
`stopping -> stopped` during shutdown, with elapsed time on transition.

Also folds in a few cleanups uncovered while making this readable:

- route the admin.go startup prints through glog so quietMiniLogs()
  filters them under mini but standalone weed admin still shows them
- generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK
  and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key
  conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under
  the data folder so restarts reuse the same key
- demote worker/master gRPC Recv 'context canceled' to V(1); those are
  the normal shutdown signal, not Errors/Warnings
- drop the 'Optimized Settings' block and the 'credentials loaded from
  environment variables' message from the welcome banner
- only show the credentials setup hints when no S3 identities exist
  (new s3api.HasAnyIdentity accessor backed by an atomic.Bool)
- use S3_BUCKET in the credentials hint so it pairs with
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
- reorder running-services list to master / volume / filer / webdav /
  s3 / iceberg / admin

* mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors

loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3
won't encrypt data under a KEK that the next restart can't reproduce
(which would orphan whatever was written this run). The caller already
treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM
just stay disabled for this run.

startAdminServer's serve goroutine used to only log ListenAndServe
failures, so a bind error left the caller blocked on ctx.Done() with
no listener. Forward the error through a buffered channel and select
on it alongside ctx.Done().

* ci(s3-proxy-signature): match weed mini's new progress-board ready line

The readiness probe grepped for "S3 (gateway|service).*(started|ready)",
which matched weed mini's old "S3 service is ready at ..." line. Mini
now emits "  S3           ready (Xs)" from its progress board, so the
old pattern misses and the test timed out at the 30-second wait.

Widen the alternation to also accept "S3\s+ready". The curl HEAD
fallback already covers any remaining cases.
2026-05-17 19:13:09 -07:00
Chris Lu
2a41e76101 fix(ec): blanket-clean every destination over the full shard range (#9512)
* fix(ec): blanket-clean every destination over the full shard range

The previous cleanup pass walked t.sources only, with the shard ids the
topology had reported at detection time. In the wild, a destination can
end up with EC shards mounted that the topology snapshot didn't list —
shards on a sibling disk that hadn't heartbeated, or shards left over
from a concurrent attempt's mount step. FindEcVolume still returns
true, so the next ReceiveFile trips the mounted-volume guard.

Cleanup now unions t.sources (with ShardIds) and t.targets and issues
unmount + delete over [0..totalShards-1] on each. Both RPCs are
idempotent on missing shards, so the wider sweep is free.

Two new tests cover the gap: shards mounted beyond what t.sources
lists, and a target-only destination with no source row.

* log(ec): include disk_id in EC unmount/delete/refusal log lines

The current logs identify the volume and shard but leave disk_id off,
which makes the cross-server cleanup story hard to follow when
multiple disks of one server hold pieces of the same volume:

  UnmountEcShards 4121.1                              -> add disk_id
  ec volume video-recordings_4121 shard delete [1 5]  -> add per-loc disk_id
  volume server X:Y deletes ec shards from 4121 [...] -> add disk_id
  ReceiveFile: ec volume 4121 is mounted; refusing... -> add disk_ids

ReceiveFile's refusal now names the disk_ids actually holding the
mount so operators can see whether the next cleanup pass needs to
target a sibling disk. Added Store.FindEcVolumeDiskIds /
Store::find_ec_volume_disk_ids as the supporting primitive.

Mirrored in seaweed-volume/src/ (unmount log in Store::unmount_ec_shard,
heartbeat delete log in diff_ec_shard_delta_messages, refusal in the
ReceiveFile handler).

* test(ec): stub VolumeEcShardsUnmount/Delete on the fake volume server

The plugin-worker EC tests boot a fake volume server that embeds
UnimplementedVolumeServerServer. After the worker started calling
VolumeEcShardsUnmount + VolumeEcShardsDelete pre-distribute, the
default Unimplemented response surfaced as fourteen "method not
implemented" errors and TestErasureCodingExecutionEncodesShards
failed. Both RPCs are no-ops here — nothing on the fake server has
mounted state or persisted shard files to remove.
2026-05-17 11:31:37 -07:00
Chris Lu
62821964dd filer/iam-grpc: make admin Bearer auth opt-in (fixes #9509) (#9514)
PR #9442 made the filer refuse to register the IAM gRPC service unless
jwt.filer_signing.key was set in security.toml, which broke the admin
UI Users/Groups/Policies pages for every deployment that ships without
a security.toml — weed mini, plain Helm, vanilla weed filer. The Users
tab returns Unimplemented and the page is unusable. Issues #9504,
#9505 and #9509 all trace to this gap.

The rest of the filer's gRPC surface is unauthenticated by default;
treat IAM the same way. The service now always registers, and the
auth gate is a no-op when no signing key is configured. When the key
is set, every RPC still requires an admin-signed Bearer token, matching
the post-#9442 behaviour. Operators who expose the filer gRPC port
beyond a trusted network should set the key on both filer and admin.

The admin client (IamGrpcStore.withIamClient) already skips attaching
the authorization metadata when its key is empty, so no changes there.
2026-05-15 13:15:20 -07:00
Chris Lu
bfb2661fec fix(tests): make 32-bit GOARCH tests build and run (#9507)
fix(tests): make 32-bit GOARCH tests build and run (#9503)

verifyTestFilerClient had bare int64 atomic counters after a map header,
so atomic.AddInt64 panicked with "unaligned 64-bit atomic operation" on
linux/386. Switch to atomic.Int64, which the stdlib guarantees is
8-byte aligned on all platforms.

rpc_version_filter_test.go passed the untyped constant 0xdeadbeef to
t.Errorf, where it default-promoted to int and overflowed 32-bit int.
Bind it to a typed uint32 const used in both the comparison and the
error message.
2026-05-14 20:55:37 -07:00
Chris Lu
3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
Chris Lu
d5c0a7b153 fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response

LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.

* fix(volume/ec): probe every DiskLocation when serving local shard reads

reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.

Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.

* test(volume/ec): cover the multi-disk EC lifecycle end-to-end

Two integration tests against a real volume server with two data dirs:

TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.

TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.

Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.

* review: trim verbose comments

* review: drop racy fast-path, use locked findEcShard directly

gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.

Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
2026-05-13 13:56:20 -07:00
Chris Lu
f51468cf73 Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)"

This reverts commit f28c7ce6df.

The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.

After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .

Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
2026-05-12 18:22:21 -07:00
Chris Lu
43a8c4fdca Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"

This reverts commit 21054b6c18.

The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.

The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.

* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic

The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.

Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
2026-05-12 16:00:44 -07:00
Chris Lu
f28c7ce6df master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:

- Reject heartbeats whose Ip does not match the gRPC peer's source
  address. Loopback peers are still trusted; operators behind a proxy
  can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
  and drop foreign re-claims. Non-EC volume claims are bounded by the
  replica copy count so legitimate replicas still register. EC
  ownership is keyed by (vid, shard_id) so the same vid can legitimately
  be split across many peers as long as their EcIndexBits are disjoint;
  rejected bits are cleared from the bitmap and the parallel ShardSizes
  array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
  so disconnect cleanup is O(M) in what that peer held rather than O(N)
  over the whole map.

Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.

The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
2026-05-12 15:38:52 -07:00
Chris Lu
10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu
21054b6c18 volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use
it to gate destructive volume admin RPCs. IsWhiteListed keeps its
allow-all-when-empty semantics for HTTP compatibility.

For TCP peers with an empty whitelist, off-host callers are rejected but
loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly
cohabits with the master/filer on a single host and in integration-test
clusters; the loopback exception keeps cluster-internal admin traffic
working without -whiteList while still locking out off-host attackers.

Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check
entirely. When `weed server` runs master+volume+filer in a single process
the master dials the volume server in-process and the peer address surfaces
as "@", which has no parseable IP. Such a caller shares our OS process and
cannot be spoofed by a remote attacker, so we treat it as trusted by
construction.

The gate also tolerates a nil guard (developmental / embedded path) and only
enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR
fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.
2026-05-12 12:35:27 -07:00
Chris Lu
69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu
5e8f99f40a filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires
a Bearer token in the authorization metadata, signed with the filer
write-signing key. The service refuses to register on a filer that has
no jwt.filer_signing.key set, so the unauthenticated default is gone:
operators who use these RPCs must configure the key and attach a token
on every call.

Bearer scheme matching is case-insensitive (RFC 6750), every handler
nil-checks req before dereferencing it, and tests now cover the
expired-token path.
2026-05-12 10:11:08 -07:00
Chris Lu
05ed5c9ae8 filer: scope JWT allowed_prefixes to path components (#9439)
The allowed_prefixes check used a literal byte-prefix match, so a token
scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar
sibling paths. Match on /-separated path components after path.Clean
normalisation instead.
2026-05-12 10:10:48 -07:00
Chris Lu
532b088262 fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423)

When EC shards land on a target whose disk type differs from the
source volume's, master heartbeats wrongly reported under the target
disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the
target server applies it to the in-memory EcVolume via SetDiskType so
the mount notification and steady-state heartbeat both carry the
source's disk type. Empty value falls back to the location's disk
type (used by disk-scan reload paths).

The override is not persisted with the volume — disk type stays an
environmental property and .vif remains portable.

* fix(ec): plumb source disk type through plugin worker (#9423)

Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved),
populate it from the metric the detector already collects, thread it
through ec_task into the MountEcShards helper, and forward it on the
VolumeEcShardsMount RPC.

* fix(ec): mirror source disk type plumbing in rust volume server (#9423)

The volume_ec_shards_mount handler now forwards source_disk_type into
mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it
overrides ec_vol.disk_type (and each mounted shard's disk_type) via
the new set_disk_type method; empty value keeps the location's disk
type, so disk-scan reload and reconcile paths are unchanged.

Also picks up two pre-existing proto drifts that 'make gen' synced
from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds
in remote.proto).

* feat(ec): bias placement toward preferred disk type (#9423)

Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType.
When PreferredDiskType is non-empty, SelectDestinations partitions
suitable disks into matching/fallback tiers and runs the rack/server/
disk-diversity passes on the matching tier first; the fallback tier
is only consulted if the matching pool can't satisfy ShardsNeeded.
PlacementResult.SpilledToOtherDiskType lets callers warn on spillover.

Empty PreferredDiskType keeps the existing single-pool behavior.

* fix(ec): plumb source disk type into placement planner (#9423)

diskInfosToCandidates now copies DiskInfo.DiskType into the placement
candidate, and ecPlacementPlanner.selectDestinations forwards
metric.DiskType as PreferredDiskType so EC shards land on disks
matching the source volume's disk type when possible. A glog warning
fires when placement had to spill to other disk types.

* test(ec): integration coverage for source-disk-type plumbing (#9423)

store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a
shard physically lives on an HDD location, MountEcShards is called
with sourceDiskType="ssd", and the test asserts that the in-memory
EcVolume, the mounted shard, the NewEcShardsChan notification, and
the steady-state heartbeat all report under the source's disk type.
A companion test pins the empty-source path so disk-scan reload
keeps the location's disk type.

detection_disk_type_test exercises the worker plumbing: with a
cluster of nodes carrying both HDD and SSD disks, planECDestinations
must place every shard on SSD when metric.DiskType="ssd"; with only
one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout
via spillover (and log a warning).

* revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423)

make gen pulled two pre-existing OSS changes into the rust proto
tree (LockRingUpdate / by_plugin in master.proto,
listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as
scope creep — none of the rust EC fix references those fields.
Restore both files to origin/master so this branch only touches
EC-related symbols.

* fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423)

partitionByDiskType used raw string comparison, so a PreferredDiskType
of "hdd" never matched candidates whose DiskType is "" (the
HardDriveType sentinel that weed/storage/types uses). EC encoding of
an HDD source would spill onto any HDD reporting "" even when the
cluster has plenty of matching capacity. Normalize both sides
through normalizeDiskType, which lowercases and folds "" → "hdd",
mirroring types.ToDiskType without taking a dependency on it.

selectFromTier's rack-diversity pass also kept revisiting racks the
preferred tier had already used when running on the fallback tier,
which negated PreferDifferentRacks on spillover. Skip racks already
in usedRacks so fallback placements still spread onto new racks.

* fix(ec): empty-source remount must not clobber existing disk type (#9423)

mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount,
disk-scan reload, orphan-shard reconcile). After an RPC sets the
source-derived disk type, any later call passing source_disk_type=""
was resetting ec_vol.disk_type back to the location's value, which
reintroduces the heartbeat drift this PR is meant to fix. Only
default to the location's disk type when the EC volume is fresh
(no shards mounted yet); otherwise leave the recorded type alone so
empty-source reloads preserve whatever the original mount RPC set.
2026-05-11 20:21:50 -07:00
Chris Lu
b2d24dd54f volume: require admin auth on BatchDelete (#9438)
Run BatchDelete through checkGrpcAdminAuth like the other destructive
volume-server RPCs (VolumeDelete, DeleteCollection, vacuum, EC, ...),
so a whitelist-configured server denies non-admin callers.
2026-05-11 13:50:48 -07:00
Chris Lu
2b21d19e4c volume: require admin auth on ReadAllNeedles and VolumeNeedleStatus (#9437)
Both RPCs hand out raw needle bytes / cookies. Run them through
checkGrpcAdminAuth like the rest of the volume-server admin handlers.
2026-05-11 13:50:19 -07:00
Minsoo Kim
a1e5eb9dad Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes

* Fix filer UI navigation for URL-sensitive object prefixes

* Clarify filer UI path escaping test name

Rename the legacy filer UI
  path test to describe the actual behavior being checked.

  The printpath helper preserves timestamp characters that are valid in URL path
  components, while the PR fix is focused on query-string escaping for path and cursor
  parameters.
2026-05-06 19:14:36 -07:00
Chris Lu
1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu
2417ba0354 fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints

Three destructive VolumeServer gRPC endpoints (DeleteCollection,
VolumeDelete, VolumeServerLeave) had no authentication checks, unlike
their HTTP counterparts which are protected by the Guard whitelist.

Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth
helper on VolumeServer that extracts the peer IP from gRPC context and
validates it against the guard whitelist. Gate all three endpoints
behind this check.

* fix(volume): tolerate unparseable gRPC peer address in admin auth check

S3 Filer Group integration tests were failing with
PermissionDenied "bad peer address: address @: missing port in address"
when DeleteCollection ran across the in-process gRPC connection
between filer and volume server — the peer addr surfaces as "@" there
and net.SplitHostPort can't parse it. The check rejected before
IsWhiteListed could exercise its allow-all path for empty-whitelist
deployments.

Hand the raw peer string to IsWhiteListed when SplitHostPort fails.
With no whitelist configured (the test environment's mode) it accepts;
with a whitelist configured the unparseable host won't match anything
and the call still gets denied as it should.

Adds three regression tests for IsWhiteListed pinning the empty-config
allow-all, populated-list reject-unknown, and signing-key-only allow-
all branches that the gRPC admin helper relies on.

* refactor(security): dedup checkWhiteList through IsWhiteListed

The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the
same lookup logic in two places; future drift was just a matter of
time. Have checkWhiteList delegate so the membership semantics live
in exactly one function.

Behaviour is unchanged: the new path still returns nil for
isEmptyWhiteList (signing-key-only mode) and still rejects unknown
hosts when a whitelist is configured.

Addresses gemini medium review on PR #8876.

* fix(volume): protect remaining state-altering gRPC admin endpoints

DeleteCollection, VolumeDelete, and VolumeServerLeave were the
truly-destructive endpoints, but AllocateVolume, VolumeMount,
VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and
VolumeMarkWritable also modify server state and should sit behind
the same whitelist gate. Read-only endpoints (VolumeStatus,
VolumeServerStatus, VolumeNeedleStatus, Ping) stay open.

The check is a no-op when no whitelist is configured (the default),
so existing deployments keep working; operators who lock down their
volume servers via guard.white_list now get consistent coverage.

Addresses gemini security-high review on PR #8876.

* fix(volume): typed peer addr + audit log for gRPC admin auth

Prefer a typed *net.TCPAddr when extracting the peer IP — string
parsing was already a fallback for the in-process case but using the
typed form first is cleaner and skips an unnecessary parse on the
common path. Log failed authorization attempts at V(0) so an operator
running with a whitelist sees the host that was rejected (and the
raw remote address in case the IP lookup itself was the failure
mode), matching what the HTTP Guard already does.

Addresses gemini medium review on PR #8876.

* fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints

Five more master/admin-driven destructive operations live outside
volume_grpc_admin.go and were missing the same whitelist gate:

- VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup
- ScrubVolume
- VolumeEcShardsDelete

VacuumVolumeCheck stays open (read-only). BatchDelete also stays
open: it's the data-plane multi-object delete called from the S3 API
and filer, not an admin operation; gating it would break ordinary S3
DeleteObjects calls.

Addresses gemini security-high review on PR #8876.

* fix(volume): simplify no-peer-info branch in gRPC admin auth

The IsWhiteListed("") fallback was defending against a scenario
that doesn't actually arise — real gRPC connections always populate
peer info. Drop the branch and just deny when peer info is missing,
which is the safer default and matches "if we don't know who the
caller is, refuse".

* fix(volume-rust): mirror gRPC admin auth on the rust volume server

The rust volume server has the same set of destructive admin
endpoints as the Go side and the same Guard infrastructure, but
nothing was wired together — every endpoint accepted unauthenticated
calls regardless of guard configuration. Same vulnerability class
the Go fix on this PR closes; this commit closes it on the rust
side too so the two stacks stay aligned.

Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer
SocketAddr off the tonic Request and runs Guard::check_whitelist on
its IP, then applies the helper to the same set the Go side covers:
DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount,
VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable,
VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit,
VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume,
VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete
stays open as a data-plane multi-object delete.
2026-05-04 21:14:55 -07:00
Chris Lu
d265274e13 fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291)
* fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy

weed nfs has exactly one export per process, so the MOUNT3 dirpath
argument has no second export to disambiguate against. Strict
comparison only translated PV-path typos into the inconsistent
"mount succeeds but empty" / "mount fails completely" split that
operators see.

Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO
line when it differs from the configured export, and always serve the
export root. Apply the same change to the UDP MOUNT3 path so kernel
clients defaulting to mountproto=udp see identical behaviour. Access
control still goes through -allowedClients / -ip.bind, and file-handle
scoping in FromHandle is unchanged so handles still cannot escape the
export.

Replace the prior single-path reject tests with table tests covering
the shapes operators commonly hit: root, parent, sibling, deeper child,
unrelated, empty, relative form, exact match, and trailing slash, at
the Handler.Mount, UDP MOUNT3, and full RPC layers.

* feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export

Make the dirpath argument meaningful when the client asks for a subtree
of the configured export. With -filer.path=/buckets, a client mounting
<server>:/buckets/data lands directly inside /buckets/data instead of
at the export root.

  - dirpath equals the export root: serve the export root.
  - dirpath strictly under the export, directory entry: serve that
    subdirectory; the returned filehandle encodes its inode.
  - dirpath strictly under the export, missing or non-directory: reject
    with NoEnt or NotDir.
  - dirpath outside the export: keep the rclone-style fallback to the
    export root.

TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount
call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle
is unchanged: handles are content-addressed by inode and resolve via
the inode index, so they remain stable across mounts and across
process restarts.

The trimmed permissive tests keep their outside-export shapes; new
subexport tests cover under-export directories, missing entries, and
non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and
through the full RPC stack.

* nfs: propagate request context through MOUNT3 resolution

Mount now accepts the gonfs context and threads it through
resolveMountFilesystem and lstatExportStatus so a slow filer call
during MOUNT cannot outlive a cancelled or timed-out request.

lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly
instead of billy.Filesystem.Lstat, which would otherwise drop the
context on the floor by calling fileInfoForVirtualPathWithOptions
with context.Background().

Lower the successful subexport-mount log from V(0) to V(1). The
fallback log stays at V(0) so operator typos still surface; the
success line is per-mount churn that adds up on NFS-CSI deployments.

* nfs: mirror TCP defensive checks on the UDP MOUNT3 path

Two transport-parity bugs the rabbit caught:

(1) The exact-export-root and outside-export branches were returning
MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus
on those same branches. If the configured -filer.path has been
removed from the filer, TCP returns NoEnt/ServerFault but UDP would
still hand out a synthetic root handle pointing at nothing. Add
rootMountStatus as the UDP analogue and call it on both branches.

(2) resolveSubexportFileHandle did filer I/O on the single UDP serve
loop with context.Background(). One slow filer round-trip would
block every later MOUNT packet. Wrap each MOUNT call's filer work in
context.WithTimeout(mountUDPLookupTimeout) and thread that ctx
through both rootMountStatus and resolveSubexportFileHandle.

Lower the successful subexport log to V(1) to match the TCP side.

* nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles

The existing UDP subexport assertions only checked the decoded inode
and kind. A regression that drifted the generation, exportID, or
encoding format on one transport but not the other would have slipped
through. Build the TCP Handler from the same Server, drive its Mount
with the same dirpath, and require ToHandle to match the raw UDP FH
bytes for every OK case.

* nfs: take MOUNT3 dirpath as string in resolveMountFilesystem

Convert req.Dirpath to string once at the call site instead of
sprinkling string(...) casts through every log line and conversion
inside the function. Behavior unchanged.

* nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers

Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted
at the configured export the first time anything asks for it, then
hands the same instance to every subsequent caller. newHandler() and
mountUDPServer.rootMountStatus() now both go through it, so:

  - Both transports observe the same chunk reader cache and chunk
    invalidator without depending on call order during startup.
  - The UDP defensive Lstat doesn't allocate a fresh wrapper per
    MOUNT request anymore; one struct lives for the life of the
    Server.

The sub-rooted seaweedFileSystem the subexport branch builds in
resolveSubexportFileHandle is still per-request because actualRoot
varies with the requested dirpath.

* nfs: drive rootFilesystem before reading sharedReaderCache on UDP

The UDP listener is started before serve() calls newHandler(), so an
under-export MOUNT3 request can reach resolveSubexportFileHandle before
Server.sharedReaderCache has been assigned. Reading it directly would
hand newSeaweedFileSystem a nil cache and the sub-fs would build a
throwaway ReaderCache that never gets shared with the TCP path.

Take rootFS off Server.rootFilesystem() (which drives the sync.Once
that initializes the shared cache) and read readerCache off that
instead, so subexport sub-fs instances always share the same reader
cache as rootFS regardless of which transport sees the first MOUNT.

* nfs: collapse exact-match and outside-export MOUNT branches

The two branches return the same filesystem (export root) and the
same status; only the log line differs. Combine the conditions and
guard the fallback log inline. Behavior unchanged.
2026-04-30 10:06:44 -07:00
Chris Lu
35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Lisandro Pin
3f3aaa7cc8 Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics...

  - `scrub_last_time_seconds`
  - `scrub_volume_failures`
  - `scrub_shard_failures`

...capturing overall volume scrub results, and allowing to construct alerts
and dashboards to monitor scrubbing progress.

Note that these metrics are aggregated at the volume/EC shard level, and not
intended for fine-grained tracking of scrubbing operations.
2026-04-28 12:34:02 -07:00
Chris Lu
e2c8791441 fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262)
* feat(nfs): add NFSv3-only RPC version filter

The upstream willscott/go-nfs library dispatches RPC calls by (program,
procedure) only — it does not validate the program version. A client
sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same
handler map as NFSv3 and gets routed to v3 SETATTR, which parses the
COMPOUND args as SETATTR3args and writes a malformed reply. The kernel
then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version
or transport protocol is not supported" without retrying v3.

This commit adds a listener wrapper that peeks the first RPC frame on
each new TCP connection. If the program is NFS or MOUNT and the version
is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported
range 3..3, per RFC 5531) directly to the socket and closes the
connection. v3 frames are replayed unchanged via a bufio reader so go-nfs
sees the original bytes. Unknown programs pass through so go-nfs's own
PROG_UNAVAIL handling stays in charge.

The filter is not yet wired into the server; the next commit activates
it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and
unknown-program pass-through.

* fix(nfs): wire NFSv3 version filter into the listener chain

Place the version filter after the optional client allowlist so that
unauthorized peers are still rejected first by IP/CIDR before we look at
RPC content. With the filter active, a Linux client doing the default
v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which
lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3
mountOptions that already work for rclone serve nfs against this
deployment.

* test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1

TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending
mountProgramID with version 1, so the test never actually covered the
"reject MOUNT v4" path it claims to exercise. The filter does reject any
non-v3 version uniformly, so the test still passed, but a future change
that tightened the version check (for example, only rejecting v4) would
let this test silently lie about coverage. Bump the call to version 4 so
the name matches what is actually exercised.

* refactor(nfs): reuse package RPC constants and io.ReadFull in version filter

The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH,
AUTH_NONE, the NFS/MOUNT program numbers) are already named in
portmap.go alongside the portmap responder. Reuse them here instead of
defining a parallel set in rpc_version_filter.go: keeping one source of
truth per package means a future correction in one spot can't drift away
from the other. The filter-only constants (peek timeout, peek length,
supportedNFSVer) stay local because they have no portmap analog.

In the test, drop the bespoke readFull loop in favor of io.ReadFull.
The custom version was a near-identical reimplementation that did not
return io.ErrUnexpectedEOF on short reads, so the standard library is
both shorter and more diagnostic-friendly.

* fix(nfs): move RPC peek off the Accept path

The previous wrapper called filterFirstRPCFrame inline inside
versionFilterListener.Accept, which meant a single slow or idle TCP
connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line
blocking against every other accept: gonfs.Serve calls Accept serially,
so each in-flight peek stalled the next legitimate client until the
deadline expired. An attacker who simply opens a TCP connection without
sending any RPC payload could trivially throttle accept throughput.

Restructure the wrapper so a background goroutine drives the inner
Accept loop and hands each raw conn to its own short-lived goroutine
that runs the peek. Validated conns are sent on a buffered-once channel,
which the wrapper's Accept reads from; rejected conns finish their
PROG_MISMATCH reply and disappear without ever reaching the channel.
This means N concurrent slow clients only block themselves, not the
N+1th fast client that connects after them.

Add Close coordination — sync.WaitGroup for the accept loop and per-conn
peek goroutines, plus a closed channel so Accept unblocks immediately on
shutdown — so the wrapper now satisfies the full net.Listener contract
instead of relying on the embedded listener.

Add a regression test that opens a slow conn (TCP only, never writes)
and a fast conn (sends a v3 frame) and asserts the fast conn reaches
the inner accept handler well below the peek timeout.

* test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close

The post-rejection check was only failing when conn.Read succeeded; any
error — including a deadline timeout because the server kept the socket
open — let the test pass. That defeats the point of the assertion: a
regression where the filter replies but forgets to close would slip
through silently.

Match against io.EOF explicitly. The TCP semantics are deterministic
here: the server writes PROG_MISMATCH, calls conn.Close(), the client
reads what's left in flight and then sees a clean FIN, which surfaces
as io.EOF on the next zero-byte read.

* fix(nfs): reject short first fragments before parsing RPC header fields

bufio.Reader.Peek(28) is willing to read across record boundaries to
satisfy the requested length, so a final fragment whose body is shorter
than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog
+ vers + proc) leaves the trailing peek bytes pointing at the next
RPC's framing or whatever bytes happen to follow on the wire. Indexing
hdr[16:24] for prog/vers in that state can spuriously reject (or pass
through) traffic based on data that doesn't belong to the request being
classified.

Drop those frames out of the filter early: if the first fragment can't
possibly hold a full CALL header, pass the connection straight to
go-nfs, which has its own framing-error handling for malformed input.

Add a regression test that crafts a 12-byte first fragment whose
trailing peek bytes are deliberately shaped like an NFSv4 CALL — without
the length check the filter sends a PROG_MISMATCH; with it, the conn
passes through silently. Verified by stashing the production-code change
and running the test in isolation: it fails as expected without the fix.

* fix(nfs): retry transient Accept() errors instead of treating any error as terminal

acceptLoop previously exited on the first error returned by the inner
listener's Accept(). That conflates two very different failure modes:
permanent shutdown (the listener was Close()d, OS-level fatal failure)
and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on
accept). The transient case should not take the entire NFS server down
— a single fd-table-full event would leave the deployment offline until
restart.

Classify the error: errors.Is(err, net.ErrClosed) is the permanent
signal we already wanted to surface to Accept(); everything else is
transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff
(50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The
backoff sleep is interruptible via the closed channel so Close() still
shuts the loop down promptly.

Add a regression test that wraps a real listener with one that injects
3 fake transient errors before delegating, and asserts Accept() still
delivers the next real connection. Verified the test fails on the old
"any error is terminal" loop and passes with this change.

* fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic

The filter was rejecting any CALL-shaped record with prog=100003 or
100005 and vers!=3, regardless of the rpcvers field. If the caller is
speaking some other protocol that happens to share the port — or just
sending garbled bytes — pretending to be an NFSv3 server replying
PROG_MISMATCH is misleading at best, and at worst fabricates a coherent
RPC reply for traffic we don't actually understand.

Add an rpcvers==2 check between the msg_type and prog/vers parses. Any
non-v2 record now passes through to go-nfs, whose RFC 5531 §9
RPC_MISMATCH handling is the correct place to reject mis-versioned RPC.

Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers
field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the
client and that the conn is delivered to the inner accept handler.
Verified the test fails on the previous code (filter still rejected on
prog/vers alone) and passes with the guard in place.

* fix(nfs): bound Close() latency by evicting in-flight prefilter conns

Close() does wg.Wait() to drain handleConn goroutines, but each of those
goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up
to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC
header. A client that completes the TCP handshake but never sends a
byte therefore stretched shutdown by 10s per such conn — a real
regression for stop/restart paths and for tests that just want to tear
the listener down.

Track raw (pre-peek) conns in versionFilterListener.inFlight as
handleConn enters, untrack on exit, and have Close() forcibly close
every tracked conn before wg.Wait. Closing the underlying conn breaks
its Peek immediately, so handleConn returns within a single scheduler
hop. trackInFlight also short-circuits if shutdown has already started,
so a conn accepted after signalClose can't slip past the eviction.

Black-box regression test opens 4 idle TCP-handshake-only conns, lets
their handleConn goroutines settle into Peek, and asserts Close()
returns under 2s. Verified: same test fails on the previous code with
Close taking ~9.9s; passes here at ~100ms.
2026-04-28 12:17:54 -07:00
Chris Lu
5fbe39320c fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) (#9245)
* fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212)

ec.rebuild only sets CopyEcxFile=true on the first shard sent to the
rebuilder; subsequent shards rely on VolumeEcShardsCopy / ReceiveFile
auto-select to land on the same disk. The old auto-select used
FindEcVolume (in-memory) to detect the "already has this volume" case.
Mid-rebuild, no EC volume has been mounted yet on the destination, so
FindEcVolume returns nothing and the fallback picks "any HDD with free
space" — which can split shards from their .ecx across disks of the
same node and feed the orphan-shard layout reported in #9212 / fixed
on the loader side in #9244.

Add Store.FindEcShardTargetLocation as the canonical placement
primitive: prefer a mounted EC volume, then a disk that has the .ecx
on disk, then any HDD, then any disk. DiskLocation.HasEcxFileOnDisk is
the new on-disk check, and it looks at IdxDirectory first with a
fallback to Directory to handle .ecx written before -dir.idx was
configured.

Both VolumeEcShardsCopy and ReceiveFile now route through the new
helper, dropping their duplicated 4-level fallback ladder. No protocol
changes; explicit DiskId callers are unaffected.

* fix(volume_server): treat directories named *.ecx as no-match in HasEcxFileOnDisk

os.Stat(".ecx") succeeds for both files and directories. If something
happens to leave a directory named X.ecx in the data or idx folder,
HasEcxFileOnDisk would currently report true and FindEcShardTargetLocation
would route shards to that disk — where NewEcVolume's eventual
OpenFile(O_RDWR) on the same path errors out.

Add a !info.IsDir() check on both stat sites. Cheap and conservative.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): collapse EC placement helper to a single pass

FindEcShardTargetLocation called FindFreeLocation up to four times. Each
call iterates s.Locations and acquires VolumesLen / EcShardCount RLocks
per disk — for a typical 4-disk node that's 32 RLock cycles per
placement decision.

Walk s.Locations once, score each disk by tier (mounted > .ecx-on-disk
> HDD > any-disk), break ties by free count. The free-slot math is
factored into a small helper that mirrors FindFreeLocation's formula
without re-entering the location's locks. Behaviour is unchanged: each
existing tier still wins over later tiers, and within a tier the disk
with the most free count still wins, matching the original max-tracking
in FindFreeLocation.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): thread dataShardCount as a parameter through EC placement

ecFreeShardCount and FindEcShardTargetLocation referenced
erasure_coding.DataShardsCount directly. Take it as a parameter so
custom-ratio builds (e.g. enterprise) can swap the default without
touching the helper itself, and so unit tests can pin a specific ratio
independent of the package constant. Default callsites in
VolumeEcShardsCopy and ReceiveFile now pass the package default
explicitly; tests pass a literal 10 for clarity.

* fix(volume_server): treat MaxVolumeCount=0 as unlimited in EC placement

ecFreeShardCount computed `MaxVolumeCount - VolumesLen()` and went
negative when MaxVolumeCount was 0 — the "unlimited disk" sentinel
already honoured by Store.hasFreeDiskLocation and friends. With a
negative free count, FindEcShardTargetLocation's `freeCount <= 0`
guard skipped the disk entirely, so unlimited disks could never receive
EC shards via the placement helper.

Special-case MaxVolumeCount<=0: report a synthetic large free count
that decrements with current usage, so unlimited disks are eligible
and tie-breaks still prefer the less-loaded one. Added
TestFindEcShardTargetLocation_HonoursUnlimitedDisk as the regression.

Reported in PR #9245 review by @gemini-code-assist.

* fix(volume_server): account in shard slots, not volume slots, in ecFreeShardCount

FindFreeLocation in store.go ends with `free /= DataShardsCount`,
converting "shard slots free" back to "volume-equivalent slots." The
truncation is harmless there, but my new ecFreeShardCount inherited
the same final divide and re-introduced exactly the orphan-shard
hazard #9245 was meant to prevent: with MaxVolumeCount=1,
VolumesLen=0, EcShardCount=1 the formula reports 0 even though the
disk has room for 9 more shards, so subsequent shards route off the
.ecx-owning disk into the HDD-fallback tier.

Drop the trailing divide and return the count directly in shard slots.
Same shape, finer granularity; tie-breaks still order by free count.
The unlimited branch's "used" calculation is updated to match (mix
volume-slots and shard-slots in shard units). Added
TestFindEcShardTargetLocation_TightProvisioningKeepsEcxDisk as the
regression.

Reported in PR #9245 review by @coderabbitai.
2026-04-27 15:59:57 -07:00
Lisandro Pin
2c404f66bc Export file_read_invalid_needles metric for REST read requests on invalid file IDs. (#9241)
Provides a straightforward metric to count read requests with incorrect file/needle IDs,
which can indicate client issues.

Note that the metric does not cover gRPC calls, as the current proto service API
does not support seeking files by ID.
2026-04-27 12:22:42 -07:00
Chris Lu
7f770b1553 fix(filer): return 503 + Retry-After when remote object not cached yet (#9236)
* extend cache-not-ready handling to filer HTTP path

Mirror the s3api change for the native filer HTTP handlers. When the
filer GET hits a remote-only object whose cache fill hasn't completed,
return 503 Service Unavailable with Retry-After: 5 instead of 500
Internal Error, and treat client disconnects as silent cancellations
rather than logging them as errors.

Adds an ErrCacheNotReady sentinel and a small helper used at the
prepareWriteFn-error sites in ProcessRangeRequest, so the same
classification (cancel / not-ready / other) applies to plain GETs,
single-range, and multi-range requests.

* clear Content-Range on prepareWriteFn error

The single-range path sets Content-Range before calling prepareWriteFn.
If prepareWriteFn fails, http.Error is about to write a fresh body for
503 or 500, but the stale Content-Range header would still go out and
no longer match. Drop it alongside Content-Length in the shared helper
so all current and future callers are covered.

* strip success-path headers and forward NotFound on prepareWriteFn error

When ProcessRangeRequest writes an error response, the previously-set
success headers (Content-Disposition, ETag, Last-Modified, in addition
to Content-Length/Content-Range) shouldn't ride along on the new body.
With ?dl=1 a stale Content-Disposition would even cause browsers to
save the error message under the object's filename. Strip them all in
the shared helper.

Also forward filer_pb.ErrNotFound through the cache-failure branch so a
mid-cache entry deletion surfaces as 404, not as a 503 retry-loop.
Permanent upstream cloud errors (403/404 from the cloud SDK) still come
back as opaque wrapped strings via FetchAndWriteNeedle and remain
mapped to 503; distinguishing those would need a wider refactor.
2026-04-27 01:58:33 -07:00
Lisandro Pin
93247d6de4 Export REST file_{read,write}_failures metrics on volume servers (#9215)
* Export gRPC `file_{read,write}_failures` metrics on volume servers.

Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

* Export REST `file_{read,write}_failures` metrics on volume servers.
2026-04-24 11:45:21 -07:00