1882 Commits

Author SHA1 Message Date
Chris Lu
3a8389cd68 fix(ec): verify full shard set before deleting source volume (#9490) (#9493)
* fix(ec): verify full shard set before deleting source volume (#9490)

Before this change, both the worker EC task and the shell ec.encode
command would delete the source .dat as soon as MountEcShards returned —
even if distribute/mount failed partway, leaving fewer than 14 shards
in the cluster. The deletion was logged at V(2), so by the time someone
noticed missing data the only trace was a 0-byte .dat synthesized by
disk_location at next restart.

- Worker path adds Step 6: poll VolumeEcShardsInfo on every destination,
  union the bitmaps, and refuse to call deleteOriginalVolume unless all
  TotalShardsCount distinct shard ids are observed. A failed gate leaves
  the source readonly so the next detection scan can retry.
- Shell ec.encode adds the same gate after EcBalance, walking the master
  topology with collectEcNodeShardsInfo.
- VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any
  source destruction is traceable in default-verbosity production logs.

The EC-balance-vs-in-flight-encode race is intentionally left for a
follow-up; balance should refuse to move shards for a volume whose
encode job is not in Completed state.

* fix(ec): trim doc comments on the new shard-verification path

Drop WHAT-describing godoc on freshly added helpers; keep only the WHY
notes (query-error policy in VerifyShardsAcrossServers, the #9490
reference at the call sites).

* fix(ec): drop issue-number anchors from new comments

Issue references age poorly — the why behind each comment already
stands on its own.

* fix(ec): parametrize RequireFullShardSet on totalShards

Take totalShards as an argument instead of reading the package-level
TotalShardsCount constant. The OSS callers continue to pass 14, but the
helper is now usable with any DataShards+ParityShards ratio.

* test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo

The new pre-delete verification gate calls VolumeEcShardsInfo on every
destination after mount, and the fake server's UnimplementedVolumeServer
returns Unimplemented — the verifier read that as zero shards on every
node and aborted source deletion. Build the response from recorded
mount requests so the integration test exercises the gate end-to-end.

* fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files

Mirror the Go-side change in weed/storage/volume_write.go: stat each
file before removing and emit an info-level log for .dat/.idx so a
destructive call is always traceable. The OSS Rust crate previously
unlinked them silently.

* fix(ec/decode): verify regenerated .dat before deleting EC shards

After mountDecodedVolume succeeds, the previous code immediately
unmounts and deletes every EC shard. A silent failure in generate or
mount could leave the cluster with neither shards nor a valid normal
volume. Probe ReadVolumeFileStatus on the target and refuse to proceed
if dat or idx is 0 bytes.

Also make the fake volume server's VolumeEcShardsInfo reflect whichever
shard files exist on disk (seeded for tests as well as mounted via
RPC), so the new gate can be exercised end-to-end.

* fix(ec): address PR review nits in verification + fake server

- Drop unused ServerShardInventory.Sizes field.
- Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits
  bound is explicit (Set already no-ops on overflow, this is for
  clarity).
- Nil-guard the fake server's VolumeEcShardsInfo so a malformed call
  doesn't panic the test process.
2026-05-13 19:29:24 -07:00
Chris Lu
d5c0a7b153 fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response

LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume
builds Location with only Url / PublicUrl / DataCenter, so callers fall back to
ServerToGrpcAddress (httpPort + 10000). On any deployment where that
convention does not hold — multi-disk integration tests, custom port layouts
— EC reads dial the wrong port and quietly degrade to parity recovery.

* fix(volume/ec): probe every DiskLocation when serving local shard reads

reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the
DiskLocation that physically owns it, so a multi-disk volume server can hold
shards for the same vid in two separate ecVolumes — one per disk — with .ecx
on whichever disk owned the original .dat. The read path only consulted the
single EcVolume FindEcVolume picked, so requests for shards on the sibling
disk fell through to errShardNotLocal and then to remote/loopback recovery.

Walk all DiskLocations after the first probe in both readLocalEcShardInterval
and the VolumeEcShardRead gRPC handler; the latter also covers the loopback
that recoverOneRemoteEcShardInterval falls back to when a peer dial fails.

* test(volume/ec): cover the multi-disk EC lifecycle end-to-end

Two integration tests against a real volume server with two data dirs:

TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read ->
drop .dat -> stop -> redistribute shards across disks -> restart -> verify
reconcileEcShardsAcrossDisks attached the orphan shards and reads still
work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild
pulls input from both disks -> reads still work.

TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478
reproducer at the cluster level: seed a healthy .dat on disk 0, plant the
on-disk footprint of an interrupted EC encode on disk 1, restart, and assert
pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0.

Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous
run's volume.log is rotated to volume.log.previous so a startup regression on
the second run does not lose the first run's diagnostics.

* review: trim verbose comments

* review: drop racy fast-path, use locked findEcShard directly

gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval
and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads
the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent
mount / unmount could race with it. findEcShard already walks every
DiskLocation under the right lock, so the fast-path adds nothing but the
race. Collapse both call sites to a single locked call.

Also note in RestartVolumeServer why the log-rotation error is swallowed:
absence on first call is benign; anything else surfaces in the next
os.Create in startVolume.
2026-05-13 13:56:20 -07:00
Chris Lu
f51468cf73 Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474)
Revert "master: bind heartbeat claims to the connecting peer (#9443)"

This reverts commit f28c7ce6df.

The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects
every hostname-based deployment. In docker-compose / k8s the volume
server is started with -ip=<service-name> and the gRPC peer surfaces
as the container/pod IP, so the two never match and every heartbeat
fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`.
The master therefore never learns about any volume, growth fails, and
fio writes against the mount return EIO.

After the #9440 revert merged (43a8c4fdc), the e2e workflow is still
failing for this reason; see
https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 .

Reverting to unblock e2e. A narrower re-do should accept the heartbeat
when heartbeat.Ip resolves (DNS) to the peer address, so the spoof
hardening can return without breaking hostname-based clusters.
2026-05-12 18:22:21 -07:00
Chris Lu
43a8c4fdca Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472)
* Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)"

This reverts commit 21054b6c18.

The fail-closed gate broke any multi-host cluster: in compose / k8s /
remote-host deployments the master's IP isn't loopback, so every
master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute,
vacuum, scrub, ...) is rejected with PermissionDenied unless the
operator manually configures -whiteList. The e2e workflow has been
failing since 10cc06333 with `not authorized: 172.18.0.2` on
AllocateVolume; downstream symptom is fio fsync EIO because zero
volumes can be grown.

The gate's intent was to lock down destructive admin tooling, but the
same RPCs are the master's normal mechanism for growing and managing
volumes. Reverting to restore cluster-internal operation; a narrower
re-do should distinguish operator/admin callers from the master peer
(e.g. trust IPs resolved from -master) before going back in.

* security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic

The revert in the previous commit also rolled back an unrelated bug fix
that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error
but did not continue, so the nil *net.IPNet was stored in whiteListCIDR
and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on
the next gRPC admin check.

Restore the continue. Orthogonal to the fail-closed semantics this PR
is reverting.
2026-05-12 16:00:44 -07:00
Chris Lu
f28c7ce6df master: bind heartbeat claims to the connecting peer (#9443)
SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on
the wire. Three changes tighten that:

- Reject heartbeats whose Ip does not match the gRPC peer's source
  address. Loopback peers are still trusted; operators behind a proxy
  can opt out with -master.allowUntrustedHeartbeat.
- Track which (ip, port) first claimed a volume id or an ec shard slot
  and drop foreign re-claims. Non-EC volume claims are bounded by the
  replica copy count so legitimate replicas still register. EC
  ownership is keyed by (vid, shard_id) so the same vid can legitimately
  be split across many peers as long as their EcIndexBits are disjoint;
  rejected bits are cleared from the bitmap and the parallel ShardSizes
  array is compacted in lock-step.
- Maintain reverse indexes owner -> volumes and owner -> ec shard slots
  so disconnect cleanup is O(M) in what that peer held rather than O(N)
  over the whole map.

Bindings are also released when a heartbeat reports that the peer no
longer holds an id, either via explicit Deleted{Volumes,EcShards}
entries or by omitting it from a full snapshot. Without this, a planned
rebalance that moved a vid or an ec shard from peer A to peer B would
leave B's heartbeats permanently filtered out until A disconnected,
breaking ec encode/decode flows that delete shards on the source as
soon as the move completes.

The (vid -> owners) binding still does not track which replica slot
each peer occupies, so the first N claims under the copy count win;
strict per-slot mapping is a follow-up.
2026-05-12 15:38:52 -07:00
Chris Lu
10cc06333b cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate
each server's Ping handler on cluster membership: masters check the
topology, registered cluster nodes, and configured master peers; volume
servers only accept their seed/current masters; filers accept tracked
peer filers, the master-learned volume server set, and configured
masters.

Use address-indexed peer lookups to keep Ping target validation O(1):
- topology maintains a pb.ServerAddress -> *DataNode index alongside
  the dc/rack/node tree, kept in sync from doLinkChildNode and
  UnlinkChildNode plus the ip/port-rewrite branch in
  GetOrCreateDataNode. GetTopology now returns nil on a detached
  subtree instead of panicking, so the linkage hooks can no-op safely.
- vid_map tracks a refcount per volume-server address so
  hasVolumeServer answers without scanning every vid location. The
  add path skips empty-address entries the same way the delete path
  already does, so a zero-value Location cannot leak a permanent
  serverRefCount[""] bucket.
- masters reuse a cached master-address set from MasterClient instead
  of walking the configured peer slice on every request.
- volume servers compare against a pre-built seed-master set and
  protect currentMaster reads/writes with an RWMutex, fixing the
  data race with the heartbeat goroutine. The seed slice is copied
  on construction so external mutation cannot desync it from the
  frozen lookup set.
- cluster.check drops the direct volume-to-volume sweep; volume
  servers no longer carry a peer-volume list, and the note next to
  the dropped probe is reworded to make clear that direct
  volume-to-volume reachability is intentionally not validated by
  this command.

Update the volume-server integration tests that drove Ping through the
new admission gate: success-path coverage now targets the master peer
(the only type a volume server tracks), and the unknown/unreachable
path asserts the InvalidArgument the gate now returns instead of the
old downstream dial error.

Mirror the same admission gate in the Rust volume server crate: a
seed-master HashSet built once at startup plus a tokio RwLock over the
heartbeat-tracked current master, both consulted in is_known_ping_target
on every Ping, with InvalidArgument returned for any target that isn't
a recognised master.
2026-05-12 13:00:52 -07:00
Chris Lu
21054b6c18 volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)
Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use
it to gate destructive volume admin RPCs. IsWhiteListed keeps its
allow-all-when-empty semantics for HTTP compatibility.

For TCP peers with an empty whitelist, off-host callers are rejected but
loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly
cohabits with the master/filer on a single host and in integration-test
clusters; the loopback exception keeps cluster-internal admin traffic
working without -whiteList while still locking out off-host attackers.

Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check
entirely. When `weed server` runs master+volume+filer in a single process
the master dials the volume server in-process and the peer address surfaces
as "@", which has no parseable IP. Such a caller shares our OS process and
cannot be spoofed by a remote attacker, so we treat it as trusted by
construction.

The gate also tolerates a nil guard (developmental / embedded path) and only
enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR
fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.
2026-05-12 12:35:27 -07:00
Chris Lu
69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu
5e8f99f40a filer: require admin-signed JWT on the IAM gRPC service (#9442)
Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires
a Bearer token in the authorization metadata, signed with the filer
write-signing key. The service refuses to register on a filer that has
no jwt.filer_signing.key set, so the unauthenticated default is gone:
operators who use these RPCs must configure the key and attach a token
on every call.

Bearer scheme matching is case-insensitive (RFC 6750), every handler
nil-checks req before dereferencing it, and tests now cover the
expired-token path.
2026-05-12 10:11:08 -07:00
Chris Lu
05ed5c9ae8 filer: scope JWT allowed_prefixes to path components (#9439)
The allowed_prefixes check used a literal byte-prefix match, so a token
scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar
sibling paths. Match on /-separated path components after path.Clean
normalisation instead.
2026-05-12 10:10:48 -07:00
Chris Lu
532b088262 fix(ec): preserve source disk type across EC encoding (#9423) (#9449)
* fix(ec): carry source disk type on VolumeEcShardsMount (#9423)

When EC shards land on a target whose disk type differs from the
source volume's, master heartbeats wrongly reported under the target
disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the
target server applies it to the in-memory EcVolume via SetDiskType so
the mount notification and steady-state heartbeat both carry the
source's disk type. Empty value falls back to the location's disk
type (used by disk-scan reload paths).

The override is not persisted with the volume — disk type stays an
environmental property and .vif remains portable.

* fix(ec): plumb source disk type through plugin worker (#9423)

Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved),
populate it from the metric the detector already collects, thread it
through ec_task into the MountEcShards helper, and forward it on the
VolumeEcShardsMount RPC.

* fix(ec): mirror source disk type plumbing in rust volume server (#9423)

The volume_ec_shards_mount handler now forwards source_disk_type into
mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it
overrides ec_vol.disk_type (and each mounted shard's disk_type) via
the new set_disk_type method; empty value keeps the location's disk
type, so disk-scan reload and reconcile paths are unchanged.

Also picks up two pre-existing proto drifts that 'make gen' synced
from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds
in remote.proto).

* feat(ec): bias placement toward preferred disk type (#9423)

Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType.
When PreferredDiskType is non-empty, SelectDestinations partitions
suitable disks into matching/fallback tiers and runs the rack/server/
disk-diversity passes on the matching tier first; the fallback tier
is only consulted if the matching pool can't satisfy ShardsNeeded.
PlacementResult.SpilledToOtherDiskType lets callers warn on spillover.

Empty PreferredDiskType keeps the existing single-pool behavior.

* fix(ec): plumb source disk type into placement planner (#9423)

diskInfosToCandidates now copies DiskInfo.DiskType into the placement
candidate, and ecPlacementPlanner.selectDestinations forwards
metric.DiskType as PreferredDiskType so EC shards land on disks
matching the source volume's disk type when possible. A glog warning
fires when placement had to spill to other disk types.

* test(ec): integration coverage for source-disk-type plumbing (#9423)

store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a
shard physically lives on an HDD location, MountEcShards is called
with sourceDiskType="ssd", and the test asserts that the in-memory
EcVolume, the mounted shard, the NewEcShardsChan notification, and
the steady-state heartbeat all report under the source's disk type.
A companion test pins the empty-source path so disk-scan reload
keeps the location's disk type.

detection_disk_type_test exercises the worker plumbing: with a
cluster of nodes carrying both HDD and SSD disks, planECDestinations
must place every shard on SSD when metric.DiskType="ssd"; with only
one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout
via spillover (and log a warning).

* revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423)

make gen pulled two pre-existing OSS changes into the rust proto
tree (LockRingUpdate / by_plugin in master.proto,
listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as
scope creep — none of the rust EC fix references those fields.
Restore both files to origin/master so this branch only touches
EC-related symbols.

* fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423)

partitionByDiskType used raw string comparison, so a PreferredDiskType
of "hdd" never matched candidates whose DiskType is "" (the
HardDriveType sentinel that weed/storage/types uses). EC encoding of
an HDD source would spill onto any HDD reporting "" even when the
cluster has plenty of matching capacity. Normalize both sides
through normalizeDiskType, which lowercases and folds "" → "hdd",
mirroring types.ToDiskType without taking a dependency on it.

selectFromTier's rack-diversity pass also kept revisiting racks the
preferred tier had already used when running on the fallback tier,
which negated PreferDifferentRacks on spillover. Skip racks already
in usedRacks so fallback placements still spread onto new racks.

* fix(ec): empty-source remount must not clobber existing disk type (#9423)

mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount,
disk-scan reload, orphan-shard reconcile). After an RPC sets the
source-derived disk type, any later call passing source_disk_type=""
was resetting ec_vol.disk_type back to the location's value, which
reintroduces the heartbeat drift this PR is meant to fix. Only
default to the location's disk type when the EC volume is fresh
(no shards mounted yet); otherwise leave the recorded type alone so
empty-source reloads preserve whatever the original mount RPC set.
2026-05-11 20:21:50 -07:00
Chris Lu
b2d24dd54f volume: require admin auth on BatchDelete (#9438)
Run BatchDelete through checkGrpcAdminAuth like the other destructive
volume-server RPCs (VolumeDelete, DeleteCollection, vacuum, EC, ...),
so a whitelist-configured server denies non-admin callers.
2026-05-11 13:50:48 -07:00
Chris Lu
2b21d19e4c volume: require admin auth on ReadAllNeedles and VolumeNeedleStatus (#9437)
Both RPCs hand out raw needle bytes / cookies. Run them through
checkGrpcAdminAuth like the rest of the volume-server admin handlers.
2026-05-11 13:50:19 -07:00
Minsoo Kim
a1e5eb9dad Fix UI prefix url encoding (#9344)
* Fix filer UI navigation for URL-sensitive object prefixes

* Fix filer UI navigation for URL-sensitive object prefixes

* Clarify filer UI path escaping test name

Rename the legacy filer UI
  path test to describe the actual behavior being checked.

  The printpath helper preserves timestamp characters that are valid in URL path
  components, while the PR fix is focused on query-string escaping for path and cursor
  parameters.
2026-05-06 19:14:36 -07:00
Chris Lu
1c0e24f06a fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335)
* fix(volume): don't fatal on missing .idx for remote-tiered volume

A .vif left behind without its .idx (orphaned by a crashed move, partial
copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the
whole volume server down on boot, killing every healthy volume on it
too. For remote-tiered volumes treat it as a per-volume load error so
the server can come up and the operator can clean up the stray .vif.

Refs #9331.

* fix(balance): skip remote-tiered volumes in admin balance detection

The admin/worker balance detector had no equivalent of the shell-side
guard ("does not move volume in remote storage" in
command_volume_balance.go), so it scheduled moves on remote-tiered
volumes. The "move" copies .idx/.vif to the destination and then calls
Volume.Destroy on the source, which calls backendStorage.DeleteFile —
deleting the remote object the destination's new .vif now points at.

Populate HasRemoteCopy on the metrics emitted by both the admin
maintenance scanner and the worker's master poll, then drop those
volumes at the top of Detection.

Fixes #9331.

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* fix(volume): keep remote data on volume-move-driven delete

The on-source delete after a volume move (admin/worker balance and
shell volume.move) ran Volume.Destroy with no way to opt out of the
remote-object cleanup. Volume.Destroy unconditionally calls
backendStorage.DeleteFile for remote-tiered volumes, so a successful
move would copy .idx/.vif to the destination and then nuke the cloud
object the destination's new .vif was already pointing at.

Add VolumeDeleteRequest.keep_remote_data and plumb it through
Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The
balance task and shell volume.move set it to true; the post-tier-upload
cleanup of other replicas and the over-replication trim in
volume.fix.replication also set it to true since the remote object is
still referenced. Other real-delete callers keep the default. The
delete-before-receive path in VolumeCopy also sets it: the inbound copy
carries a .vif that may reference the same cloud object as the
existing volume.

Refs #9331.

* test(storage): in-process remote-tier integration tests

Cover the four operations the user is most likely to run against a
cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by
registering a local-disk-backed BackendStorage as the "remote" tier and
exercising the real Volume / DiskLocation / EC encoder code paths.

Locks in:
- Destroy(keepRemoteData=true) preserves the remote object (move case)
- Destroy(keepRemoteData=false) deletes it (real-delete case)
- Vacuum/compact on a remote-tier volume never deletes the remote object
- EC encode requires the local .dat (callers must download first)
- EC encode + rebuild round-trips after a tier-down

Tests run in-process and finish in under a second total — no cluster,
binary, or external storage required.

* fix(rust-volume): keep remote data on volume-move-driven delete

Mirror the Go fix in seaweed-volume: plumb keep_remote_data through
grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume
→ Volume.destroy, and skip the s3-tier delete_file call when the flag
is set. The pre-receive cleanup in volume_copy passes true for the
same reason as the Go side: the inbound copy carries a .vif that may
reference the same cloud object as the existing volume.

The Rust loader already warns rather than fataling on a stray .vif
without an .idx (volume.rs load_index_inmemory / load_index_redb), so
no counterpart to the Go fatal-on-missing-idx fix is needed.

Refs #9331.

* fix(volume): preserve remote tier on IO-error eviction; fix EC test target

Two review nits:

- Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored
  volumes with keepRemoteData=false, so a transient local fault on a
  remote-tiered volume would also nuke the cloud object. Track the
  delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile()
  for IO-error evictions; TTL-expired evictions still pass false.

- TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but
  called them "parity" — by the klauspost/reedsolomon convention shards
  0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1
  are parity. Switch the loop to delete the parity range so the
  intent matches the indices.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-05-06 15:19:43 -07:00
Chris Lu
2417ba0354 fix(volume): add authentication to destructive gRPC admin endpoints (#8876)
* fix(volume): add authentication to destructive gRPC admin endpoints

Three destructive VolumeServer gRPC endpoints (DeleteCollection,
VolumeDelete, VolumeServerLeave) had no authentication checks, unlike
their HTTP counterparts which are protected by the Guard whitelist.

Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth
helper on VolumeServer that extracts the peer IP from gRPC context and
validates it against the guard whitelist. Gate all three endpoints
behind this check.

* fix(volume): tolerate unparseable gRPC peer address in admin auth check

S3 Filer Group integration tests were failing with
PermissionDenied "bad peer address: address @: missing port in address"
when DeleteCollection ran across the in-process gRPC connection
between filer and volume server — the peer addr surfaces as "@" there
and net.SplitHostPort can't parse it. The check rejected before
IsWhiteListed could exercise its allow-all path for empty-whitelist
deployments.

Hand the raw peer string to IsWhiteListed when SplitHostPort fails.
With no whitelist configured (the test environment's mode) it accepts;
with a whitelist configured the unparseable host won't match anything
and the call still gets denied as it should.

Adds three regression tests for IsWhiteListed pinning the empty-config
allow-all, populated-list reject-unknown, and signing-key-only allow-
all branches that the gRPC admin helper relies on.

* refactor(security): dedup checkWhiteList through IsWhiteListed

The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the
same lookup logic in two places; future drift was just a matter of
time. Have checkWhiteList delegate so the membership semantics live
in exactly one function.

Behaviour is unchanged: the new path still returns nil for
isEmptyWhiteList (signing-key-only mode) and still rejects unknown
hosts when a whitelist is configured.

Addresses gemini medium review on PR #8876.

* fix(volume): protect remaining state-altering gRPC admin endpoints

DeleteCollection, VolumeDelete, and VolumeServerLeave were the
truly-destructive endpoints, but AllocateVolume, VolumeMount,
VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and
VolumeMarkWritable also modify server state and should sit behind
the same whitelist gate. Read-only endpoints (VolumeStatus,
VolumeServerStatus, VolumeNeedleStatus, Ping) stay open.

The check is a no-op when no whitelist is configured (the default),
so existing deployments keep working; operators who lock down their
volume servers via guard.white_list now get consistent coverage.

Addresses gemini security-high review on PR #8876.

* fix(volume): typed peer addr + audit log for gRPC admin auth

Prefer a typed *net.TCPAddr when extracting the peer IP — string
parsing was already a fallback for the in-process case but using the
typed form first is cleaner and skips an unnecessary parse on the
common path. Log failed authorization attempts at V(0) so an operator
running with a whitelist sees the host that was rejected (and the
raw remote address in case the IP lookup itself was the failure
mode), matching what the HTTP Guard already does.

Addresses gemini medium review on PR #8876.

* fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints

Five more master/admin-driven destructive operations live outside
volume_grpc_admin.go and were missing the same whitelist gate:

- VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup
- ScrubVolume
- VolumeEcShardsDelete

VacuumVolumeCheck stays open (read-only). BatchDelete also stays
open: it's the data-plane multi-object delete called from the S3 API
and filer, not an admin operation; gating it would break ordinary S3
DeleteObjects calls.

Addresses gemini security-high review on PR #8876.

* fix(volume): simplify no-peer-info branch in gRPC admin auth

The IsWhiteListed("") fallback was defending against a scenario
that doesn't actually arise — real gRPC connections always populate
peer info. Drop the branch and just deny when peer info is missing,
which is the safer default and matches "if we don't know who the
caller is, refuse".

* fix(volume-rust): mirror gRPC admin auth on the rust volume server

The rust volume server has the same set of destructive admin
endpoints as the Go side and the same Guard infrastructure, but
nothing was wired together — every endpoint accepted unauthenticated
calls regardless of guard configuration. Same vulnerability class
the Go fix on this PR closes; this commit closes it on the rust
side too so the two stacks stay aligned.

Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer
SocketAddr off the tonic Request and runs Guard::check_whitelist on
its IP, then applies the helper to the same set the Go side covers:
DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount,
VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable,
VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit,
VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume,
VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete
stays open as a data-plane multi-object delete.
2026-05-04 21:14:55 -07:00
Chris Lu
d265274e13 fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291)
* fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy

weed nfs has exactly one export per process, so the MOUNT3 dirpath
argument has no second export to disambiguate against. Strict
comparison only translated PV-path typos into the inconsistent
"mount succeeds but empty" / "mount fails completely" split that
operators see.

Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO
line when it differs from the configured export, and always serve the
export root. Apply the same change to the UDP MOUNT3 path so kernel
clients defaulting to mountproto=udp see identical behaviour. Access
control still goes through -allowedClients / -ip.bind, and file-handle
scoping in FromHandle is unchanged so handles still cannot escape the
export.

Replace the prior single-path reject tests with table tests covering
the shapes operators commonly hit: root, parent, sibling, deeper child,
unrelated, empty, relative form, exact match, and trailing slash, at
the Handler.Mount, UDP MOUNT3, and full RPC layers.

* feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export

Make the dirpath argument meaningful when the client asks for a subtree
of the configured export. With -filer.path=/buckets, a client mounting
<server>:/buckets/data lands directly inside /buckets/data instead of
at the export root.

  - dirpath equals the export root: serve the export root.
  - dirpath strictly under the export, directory entry: serve that
    subdirectory; the returned filehandle encodes its inode.
  - dirpath strictly under the export, missing or non-directory: reject
    with NoEnt or NotDir.
  - dirpath outside the export: keep the rclone-style fallback to the
    export root.

TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount
call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle
is unchanged: handles are content-addressed by inode and resolve via
the inode index, so they remain stable across mounts and across
process restarts.

The trimmed permissive tests keep their outside-export shapes; new
subexport tests cover under-export directories, missing entries, and
non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and
through the full RPC stack.

* nfs: propagate request context through MOUNT3 resolution

Mount now accepts the gonfs context and threads it through
resolveMountFilesystem and lstatExportStatus so a slow filer call
during MOUNT cannot outlive a cancelled or timed-out request.

lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly
instead of billy.Filesystem.Lstat, which would otherwise drop the
context on the floor by calling fileInfoForVirtualPathWithOptions
with context.Background().

Lower the successful subexport-mount log from V(0) to V(1). The
fallback log stays at V(0) so operator typos still surface; the
success line is per-mount churn that adds up on NFS-CSI deployments.

* nfs: mirror TCP defensive checks on the UDP MOUNT3 path

Two transport-parity bugs the rabbit caught:

(1) The exact-export-root and outside-export branches were returning
MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus
on those same branches. If the configured -filer.path has been
removed from the filer, TCP returns NoEnt/ServerFault but UDP would
still hand out a synthetic root handle pointing at nothing. Add
rootMountStatus as the UDP analogue and call it on both branches.

(2) resolveSubexportFileHandle did filer I/O on the single UDP serve
loop with context.Background(). One slow filer round-trip would
block every later MOUNT packet. Wrap each MOUNT call's filer work in
context.WithTimeout(mountUDPLookupTimeout) and thread that ctx
through both rootMountStatus and resolveSubexportFileHandle.

Lower the successful subexport log to V(1) to match the TCP side.

* nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles

The existing UDP subexport assertions only checked the decoded inode
and kind. A regression that drifted the generation, exportID, or
encoding format on one transport but not the other would have slipped
through. Build the TCP Handler from the same Server, drive its Mount
with the same dirpath, and require ToHandle to match the raw UDP FH
bytes for every OK case.

* nfs: take MOUNT3 dirpath as string in resolveMountFilesystem

Convert req.Dirpath to string once at the call site instead of
sprinkling string(...) casts through every log line and conversion
inside the function. Behavior unchanged.

* nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers

Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted
at the configured export the first time anything asks for it, then
hands the same instance to every subsequent caller. newHandler() and
mountUDPServer.rootMountStatus() now both go through it, so:

  - Both transports observe the same chunk reader cache and chunk
    invalidator without depending on call order during startup.
  - The UDP defensive Lstat doesn't allocate a fresh wrapper per
    MOUNT request anymore; one struct lives for the life of the
    Server.

The sub-rooted seaweedFileSystem the subexport branch builds in
resolveSubexportFileHandle is still per-request because actualRoot
varies with the requested dirpath.

* nfs: drive rootFilesystem before reading sharedReaderCache on UDP

The UDP listener is started before serve() calls newHandler(), so an
under-export MOUNT3 request can reach resolveSubexportFileHandle before
Server.sharedReaderCache has been assigned. Reading it directly would
hand newSeaweedFileSystem a nil cache and the sub-fs would build a
throwaway ReaderCache that never gets shared with the TCP path.

Take rootFS off Server.rootFilesystem() (which drives the sync.Once
that initializes the shared cache) and read readerCache off that
instead, so subexport sub-fs instances always share the same reader
cache as rootFS regardless of which transport sees the first MOUNT.

* nfs: collapse exact-match and outside-export MOUNT branches

The two branches return the same filesystem (export root) and the
same status; only the log line differs. Combine the conditions and
guard the fallback log inline. Behavior unchanged.
2026-04-30 10:06:44 -07:00
Chris Lu
35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Lisandro Pin
3f3aaa7cc8 Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics...

  - `scrub_last_time_seconds`
  - `scrub_volume_failures`
  - `scrub_shard_failures`

...capturing overall volume scrub results, and allowing to construct alerts
and dashboards to monitor scrubbing progress.

Note that these metrics are aggregated at the volume/EC shard level, and not
intended for fine-grained tracking of scrubbing operations.
2026-04-28 12:34:02 -07:00
Chris Lu
e2c8791441 fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262)
* feat(nfs): add NFSv3-only RPC version filter

The upstream willscott/go-nfs library dispatches RPC calls by (program,
procedure) only — it does not validate the program version. A client
sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same
handler map as NFSv3 and gets routed to v3 SETATTR, which parses the
COMPOUND args as SETATTR3args and writes a malformed reply. The kernel
then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version
or transport protocol is not supported" without retrying v3.

This commit adds a listener wrapper that peeks the first RPC frame on
each new TCP connection. If the program is NFS or MOUNT and the version
is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported
range 3..3, per RFC 5531) directly to the socket and closes the
connection. v3 frames are replayed unchanged via a bufio reader so go-nfs
sees the original bytes. Unknown programs pass through so go-nfs's own
PROG_UNAVAIL handling stays in charge.

The filter is not yet wired into the server; the next commit activates
it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and
unknown-program pass-through.

* fix(nfs): wire NFSv3 version filter into the listener chain

Place the version filter after the optional client allowlist so that
unauthorized peers are still rejected first by IP/CIDR before we look at
RPC content. With the filter active, a Linux client doing the default
v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which
lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3
mountOptions that already work for rclone serve nfs against this
deployment.

* test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1

TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending
mountProgramID with version 1, so the test never actually covered the
"reject MOUNT v4" path it claims to exercise. The filter does reject any
non-v3 version uniformly, so the test still passed, but a future change
that tightened the version check (for example, only rejecting v4) would
let this test silently lie about coverage. Bump the call to version 4 so
the name matches what is actually exercised.

* refactor(nfs): reuse package RPC constants and io.ReadFull in version filter

The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH,
AUTH_NONE, the NFS/MOUNT program numbers) are already named in
portmap.go alongside the portmap responder. Reuse them here instead of
defining a parallel set in rpc_version_filter.go: keeping one source of
truth per package means a future correction in one spot can't drift away
from the other. The filter-only constants (peek timeout, peek length,
supportedNFSVer) stay local because they have no portmap analog.

In the test, drop the bespoke readFull loop in favor of io.ReadFull.
The custom version was a near-identical reimplementation that did not
return io.ErrUnexpectedEOF on short reads, so the standard library is
both shorter and more diagnostic-friendly.

* fix(nfs): move RPC peek off the Accept path

The previous wrapper called filterFirstRPCFrame inline inside
versionFilterListener.Accept, which meant a single slow or idle TCP
connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line
blocking against every other accept: gonfs.Serve calls Accept serially,
so each in-flight peek stalled the next legitimate client until the
deadline expired. An attacker who simply opens a TCP connection without
sending any RPC payload could trivially throttle accept throughput.

Restructure the wrapper so a background goroutine drives the inner
Accept loop and hands each raw conn to its own short-lived goroutine
that runs the peek. Validated conns are sent on a buffered-once channel,
which the wrapper's Accept reads from; rejected conns finish their
PROG_MISMATCH reply and disappear without ever reaching the channel.
This means N concurrent slow clients only block themselves, not the
N+1th fast client that connects after them.

Add Close coordination — sync.WaitGroup for the accept loop and per-conn
peek goroutines, plus a closed channel so Accept unblocks immediately on
shutdown — so the wrapper now satisfies the full net.Listener contract
instead of relying on the embedded listener.

Add a regression test that opens a slow conn (TCP only, never writes)
and a fast conn (sends a v3 frame) and asserts the fast conn reaches
the inner accept handler well below the peek timeout.

* test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close

The post-rejection check was only failing when conn.Read succeeded; any
error — including a deadline timeout because the server kept the socket
open — let the test pass. That defeats the point of the assertion: a
regression where the filter replies but forgets to close would slip
through silently.

Match against io.EOF explicitly. The TCP semantics are deterministic
here: the server writes PROG_MISMATCH, calls conn.Close(), the client
reads what's left in flight and then sees a clean FIN, which surfaces
as io.EOF on the next zero-byte read.

* fix(nfs): reject short first fragments before parsing RPC header fields

bufio.Reader.Peek(28) is willing to read across record boundaries to
satisfy the requested length, so a final fragment whose body is shorter
than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog
+ vers + proc) leaves the trailing peek bytes pointing at the next
RPC's framing or whatever bytes happen to follow on the wire. Indexing
hdr[16:24] for prog/vers in that state can spuriously reject (or pass
through) traffic based on data that doesn't belong to the request being
classified.

Drop those frames out of the filter early: if the first fragment can't
possibly hold a full CALL header, pass the connection straight to
go-nfs, which has its own framing-error handling for malformed input.

Add a regression test that crafts a 12-byte first fragment whose
trailing peek bytes are deliberately shaped like an NFSv4 CALL — without
the length check the filter sends a PROG_MISMATCH; with it, the conn
passes through silently. Verified by stashing the production-code change
and running the test in isolation: it fails as expected without the fix.

* fix(nfs): retry transient Accept() errors instead of treating any error as terminal

acceptLoop previously exited on the first error returned by the inner
listener's Accept(). That conflates two very different failure modes:
permanent shutdown (the listener was Close()d, OS-level fatal failure)
and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on
accept). The transient case should not take the entire NFS server down
— a single fd-table-full event would leave the deployment offline until
restart.

Classify the error: errors.Is(err, net.ErrClosed) is the permanent
signal we already wanted to surface to Accept(); everything else is
transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff
(50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The
backoff sleep is interruptible via the closed channel so Close() still
shuts the loop down promptly.

Add a regression test that wraps a real listener with one that injects
3 fake transient errors before delegating, and asserts Accept() still
delivers the next real connection. Verified the test fails on the old
"any error is terminal" loop and passes with this change.

* fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic

The filter was rejecting any CALL-shaped record with prog=100003 or
100005 and vers!=3, regardless of the rpcvers field. If the caller is
speaking some other protocol that happens to share the port — or just
sending garbled bytes — pretending to be an NFSv3 server replying
PROG_MISMATCH is misleading at best, and at worst fabricates a coherent
RPC reply for traffic we don't actually understand.

Add an rpcvers==2 check between the msg_type and prog/vers parses. Any
non-v2 record now passes through to go-nfs, whose RFC 5531 §9
RPC_MISMATCH handling is the correct place to reject mis-versioned RPC.

Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers
field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the
client and that the conn is delivered to the inner accept handler.
Verified the test fails on the previous code (filter still rejected on
prog/vers alone) and passes with the guard in place.

* fix(nfs): bound Close() latency by evicting in-flight prefilter conns

Close() does wg.Wait() to drain handleConn goroutines, but each of those
goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up
to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC
header. A client that completes the TCP handshake but never sends a
byte therefore stretched shutdown by 10s per such conn — a real
regression for stop/restart paths and for tests that just want to tear
the listener down.

Track raw (pre-peek) conns in versionFilterListener.inFlight as
handleConn enters, untrack on exit, and have Close() forcibly close
every tracked conn before wg.Wait. Closing the underlying conn breaks
its Peek immediately, so handleConn returns within a single scheduler
hop. trackInFlight also short-circuits if shutdown has already started,
so a conn accepted after signalClose can't slip past the eviction.

Black-box regression test opens 4 idle TCP-handshake-only conns, lets
their handleConn goroutines settle into Peek, and asserts Close()
returns under 2s. Verified: same test fails on the previous code with
Close taking ~9.9s; passes here at ~100ms.
2026-04-28 12:17:54 -07:00
Chris Lu
5fbe39320c fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) (#9245)
* fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212)

ec.rebuild only sets CopyEcxFile=true on the first shard sent to the
rebuilder; subsequent shards rely on VolumeEcShardsCopy / ReceiveFile
auto-select to land on the same disk. The old auto-select used
FindEcVolume (in-memory) to detect the "already has this volume" case.
Mid-rebuild, no EC volume has been mounted yet on the destination, so
FindEcVolume returns nothing and the fallback picks "any HDD with free
space" — which can split shards from their .ecx across disks of the
same node and feed the orphan-shard layout reported in #9212 / fixed
on the loader side in #9244.

Add Store.FindEcShardTargetLocation as the canonical placement
primitive: prefer a mounted EC volume, then a disk that has the .ecx
on disk, then any HDD, then any disk. DiskLocation.HasEcxFileOnDisk is
the new on-disk check, and it looks at IdxDirectory first with a
fallback to Directory to handle .ecx written before -dir.idx was
configured.

Both VolumeEcShardsCopy and ReceiveFile now route through the new
helper, dropping their duplicated 4-level fallback ladder. No protocol
changes; explicit DiskId callers are unaffected.

* fix(volume_server): treat directories named *.ecx as no-match in HasEcxFileOnDisk

os.Stat(".ecx") succeeds for both files and directories. If something
happens to leave a directory named X.ecx in the data or idx folder,
HasEcxFileOnDisk would currently report true and FindEcShardTargetLocation
would route shards to that disk — where NewEcVolume's eventual
OpenFile(O_RDWR) on the same path errors out.

Add a !info.IsDir() check on both stat sites. Cheap and conservative.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): collapse EC placement helper to a single pass

FindEcShardTargetLocation called FindFreeLocation up to four times. Each
call iterates s.Locations and acquires VolumesLen / EcShardCount RLocks
per disk — for a typical 4-disk node that's 32 RLock cycles per
placement decision.

Walk s.Locations once, score each disk by tier (mounted > .ecx-on-disk
> HDD > any-disk), break ties by free count. The free-slot math is
factored into a small helper that mirrors FindFreeLocation's formula
without re-entering the location's locks. Behaviour is unchanged: each
existing tier still wins over later tiers, and within a tier the disk
with the most free count still wins, matching the original max-tracking
in FindFreeLocation.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): thread dataShardCount as a parameter through EC placement

ecFreeShardCount and FindEcShardTargetLocation referenced
erasure_coding.DataShardsCount directly. Take it as a parameter so
custom-ratio builds (e.g. enterprise) can swap the default without
touching the helper itself, and so unit tests can pin a specific ratio
independent of the package constant. Default callsites in
VolumeEcShardsCopy and ReceiveFile now pass the package default
explicitly; tests pass a literal 10 for clarity.

* fix(volume_server): treat MaxVolumeCount=0 as unlimited in EC placement

ecFreeShardCount computed `MaxVolumeCount - VolumesLen()` and went
negative when MaxVolumeCount was 0 — the "unlimited disk" sentinel
already honoured by Store.hasFreeDiskLocation and friends. With a
negative free count, FindEcShardTargetLocation's `freeCount <= 0`
guard skipped the disk entirely, so unlimited disks could never receive
EC shards via the placement helper.

Special-case MaxVolumeCount<=0: report a synthetic large free count
that decrements with current usage, so unlimited disks are eligible
and tie-breaks still prefer the less-loaded one. Added
TestFindEcShardTargetLocation_HonoursUnlimitedDisk as the regression.

Reported in PR #9245 review by @gemini-code-assist.

* fix(volume_server): account in shard slots, not volume slots, in ecFreeShardCount

FindFreeLocation in store.go ends with `free /= DataShardsCount`,
converting "shard slots free" back to "volume-equivalent slots." The
truncation is harmless there, but my new ecFreeShardCount inherited
the same final divide and re-introduced exactly the orphan-shard
hazard #9245 was meant to prevent: with MaxVolumeCount=1,
VolumesLen=0, EcShardCount=1 the formula reports 0 even though the
disk has room for 9 more shards, so subsequent shards route off the
.ecx-owning disk into the HDD-fallback tier.

Drop the trailing divide and return the count directly in shard slots.
Same shape, finer granularity; tie-breaks still order by free count.
The unlimited branch's "used" calculation is updated to match (mix
volume-slots and shard-slots in shard units). Added
TestFindEcShardTargetLocation_TightProvisioningKeepsEcxDisk as the
regression.

Reported in PR #9245 review by @coderabbitai.
2026-04-27 15:59:57 -07:00
Lisandro Pin
2c404f66bc Export file_read_invalid_needles metric for REST read requests on invalid file IDs. (#9241)
Provides a straightforward metric to count read requests with incorrect file/needle IDs,
which can indicate client issues.

Note that the metric does not cover gRPC calls, as the current proto service API
does not support seeking files by ID.
2026-04-27 12:22:42 -07:00
Chris Lu
7f770b1553 fix(filer): return 503 + Retry-After when remote object not cached yet (#9236)
* extend cache-not-ready handling to filer HTTP path

Mirror the s3api change for the native filer HTTP handlers. When the
filer GET hits a remote-only object whose cache fill hasn't completed,
return 503 Service Unavailable with Retry-After: 5 instead of 500
Internal Error, and treat client disconnects as silent cancellations
rather than logging them as errors.

Adds an ErrCacheNotReady sentinel and a small helper used at the
prepareWriteFn-error sites in ProcessRangeRequest, so the same
classification (cancel / not-ready / other) applies to plain GETs,
single-range, and multi-range requests.

* clear Content-Range on prepareWriteFn error

The single-range path sets Content-Range before calling prepareWriteFn.
If prepareWriteFn fails, http.Error is about to write a fresh body for
503 or 500, but the stale Content-Range header would still go out and
no longer match. Drop it alongside Content-Length in the shared helper
so all current and future callers are covered.

* strip success-path headers and forward NotFound on prepareWriteFn error

When ProcessRangeRequest writes an error response, the previously-set
success headers (Content-Disposition, ETag, Last-Modified, in addition
to Content-Length/Content-Range) shouldn't ride along on the new body.
With ?dl=1 a stale Content-Disposition would even cause browsers to
save the error message under the object's filename. Strip them all in
the shared helper.

Also forward filer_pb.ErrNotFound through the cache-failure branch so a
mid-cache entry deletion surfaces as 404, not as a 503 retry-loop.
Permanent upstream cloud errors (403/404 from the cloud SDK) still come
back as opaque wrapped strings via FetchAndWriteNeedle and remain
mapped to 503; distinguishing those would need a wider refactor.
2026-04-27 01:58:33 -07:00
Lisandro Pin
93247d6de4 Export REST file_{read,write}_failures metrics on volume servers (#9215)
* Export gRPC `file_{read,write}_failures` metrics on volume servers.

Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

* Export REST `file_{read,write}_failures` metrics on volume servers.
2026-04-24 11:45:21 -07:00
Chris Lu
3d39324bc1 fix(nfs): make Linux mount -t nfs work without client workaround (#9199) (#9201)
* fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199)

The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and
does not register with portmap. Linux mount.nfs queries portmap on port 111
first, so the plain `mount -t nfs host:/export /mnt` form failed with
"portmap query failed" / "requested NFS version or transport protocol is
not supported" against a default `weed nfs` deployment.

- Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with
  TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and
  proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses.
  Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port.

- New CLI flag `-portmap.bind` (empty, disabled by default) to opt into
  the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE
  and must not collide with a system rpcbind.

- Extended `weed nfs -h` help with the two supported ways to mount from
  Linux (client-side portmap bypass, or server-side `-portmap.bind`).

- Startup log now prints a copy-pasteable mount command tailored to
  whether portmap is enabled.

Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP
round-trip against the real listener.

Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0`
the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock
host:/export /mnt`) now succeeds and both read and write work.

* fix(nfs): harden portmap responder per review feedback (#9201)

Addresses three review findings on the portmap responder:

- parseRPCCall: validate opaque_auth length against the record limit
  before applying the XDR 4-byte padding, so a near-uint32-max authLen
  can no longer overflow (authLen + 3) and bypass the bounds check.
  (gemini-code-assist)

- serveTCP/Close: track live TCP connections and evict them on Close()
  so shutdown does not block on idle clients waiting for the read
  deadline to trip. serveTCP also no longer tears the listener down on
  a non-fatal Accept error (e.g. EMFILE); it logs and retries after a
  small back-off. Replaces the atomic.Bool closed flag with a
  mutex-guarded one so closed, conns, and the shutdown transition stay
  consistent. (coderabbit, minor)

- handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s
  in-flight) so a peer that opens the privileged port 111 and stalls
  cannot pin a goroutine indefinitely. (coderabbit, major)

Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP
connection idle and asserts Close() returns within 2s (well under the
30s idle deadline) and that the client sees the eviction.

All existing tests still pass, including under -race.

* fix(nfs): keep portmap UDP responder alive on transient read errors (#9201)

- serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and
  continue instead of returning. Matches how serveTCP now treats
  non-fatal Accept errors so a transient network blip doesn't take
  UDP portmap down until restart. (coderabbit)

- Rename portmapAcceptBackoff -> portmapRetryBackoff now that both
  paths use it.

- pmapProcDump: fix the pre-allocation capacity to match the actual
  encoding (20 bytes per entry + 4-byte terminator), replacing the
  old over-estimate of 24 per entry. No behavior change; just
  documents intent. (coderabbit nit)

* docs(nfs): clarify encodeAcceptedReply body semantics (#9201)

The prior comment said body is "nil when the accept_stat is itself an
error", which was misleading: the PROG_MISMATCH branch already passes
an 8-byte mismatch_info body. Rewrite to enumerate which error
accept_stat values omit the body and call out PROG_MISMATCH as the
exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit)

* fix(nfs): make portmap retry backoff interruptible by Close() (#9201)

serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a
non-fatal listener error. If Close() races in during that sleep, the
goroutine can't be interrupted, so Close() has to wait out the
remaining backoff before wg.Wait() returns.

Add a done channel that Close() closes once, and replace both
time.Sleep calls with a select on ps.done + time.After. The window
was tiny in practice but the select makes shutdown strictly bounded
by Close()'s own work. (coderabbit nit)
2026-04-23 13:53:53 -07:00
steve.wei
1a7ab2ea82 fix(upload): keep Content-MD5 on 204 unchanged writes (#9198)
Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved.
2026-04-23 10:59:59 -07:00
Chris Lu
592d6d6021 fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) (#9193)
* fix(filer/remote): keep re-cache work alive past caller cancellation (#9174)

For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot
finish before the S3 gateway's initial cache wait elapses. When it
does, the gRPC ctx cancellation cascades into the filer's chunk
downloads, the error path calls DeleteUncommittedChunks on every chunk
already written, and the next retry starts over. boto3 splitting the
GET into concurrent ranges (or any client tear-down on first failure)
shortens the window between retries, so the loop never converges.

Detach the caller's ctx with context.WithoutCancel before invoking
the singleflight work so the download runs to completion regardless
of client cancellations. Subsequent waiters — via the in-flight
singleflight, or a fresh retry landing after completion — observe the
cached entry and stream normally.

Same detach pattern is used in filer_server_handlers_write.go:53 and
volume_server_handlers_write.go:51.

* simplify rationale comment

* switch to DoChan so handler can return on caller cancel

Do keeps the handler goroutine blocked for the full detached download
even after the client is gone. DoChan lets the handler select on
ctx.Done() and exit immediately; the singleflight goroutine continues
on bgCtx and the next request either joins it or finds the entry
cached.
2026-04-22 17:56:15 -07:00
Chris Lu
f438cc3544 fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard

ReceiveFile for an EC shard calls os.Create(filePath) which opens the
path with O_TRUNC. When the shard is already mounted, the in-memory
EcVolume holds a file descriptor against the same inode, so a second
ReceiveFile call for the same (volume, shard) truncates the live shard
file beneath the reader.

Reproducer: generate and mount shard 0 for a populated volume, capture
the on-disk size, then send a smaller payload for the same shard via
ReceiveFile. The current handler accepts the overwrite and leaves the
shard truncated in place; this test pins that behavior. When the fix
lands the server should reject (or rename-then-swap) and this test
must be inverted.

* fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard

ReceiveFile used os.Create on EC shard paths, which opens with
O_TRUNC and truncates in place. When an EC shard is already
mounted, the in-memory EcVolume holds file descriptors against the
same inodes, so the truncation corrupts the live shard beneath any
ongoing read. On retries of an EC task this produced the "missing
parts" class of errors in #9184.

The fix rejects any ReceiveFile for an EC volume that currently
has mounted shards. The caller must unmount before retrying —
silent truncation is never an acceptable outcome. Non-EC writes and
ReceiveFile for volumes that have never been mounted on this server
continue to work as before.

Tests:
- TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard,
  attempts an overwrite, asserts the error response and that the
  on-disk file and live reads are undisturbed.
- TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case
  contract that a first write to a target still succeeds.

* fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard

Mirror the Go-side change: reject receive_file for any EC volume that
currently has mounted shards on this server. std::fs::File::create
truncates in place and the in-memory EcVolume holds fds on the same
inodes, so an overwrite would corrupt live readers.
2026-04-22 16:47:01 -07:00
Lisandro Pin
fff243d463 Export gRPC file_{read,write}_failures metrics on volume servers. (#9177)
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-22 11:22:21 -07:00
Chris Lu
c4e1885053 fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug

The plugin-worker EC task sends shards via ReceiveFile, which picks
Locations[0] as the target directory regardless of the admin planner's
TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there
is no wire channel to honor the plan.

Adds StartSingleVolumeClusterWithDataDirs to the integration framework
so tests can launch a volume server with N data directories. The new
repro asserts the current (buggy) behavior: sending three distinct EC
shards via ReceiveFile leaves all three files in dir[0] and the other
dirs empty. When the fix adds disk_id to ReceiveFileInfo, this
assertion must flip to verify the planned placement is respected.

* fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement

Before this change, VolumeServer.ReceiveFile for EC shards always
selected the first HDD location (Locations[0]). The plugin-worker EC
task had no way to pass the admin planner's per-shard disk
assignment — ReceiveFileInfo carried no disk_id field — so every
received EC shard piled onto a single disk per destination server.
On multi-disk servers this caused uneven load (one disk absorbing all
EC shard I/O), frequent ENOSPC retries, and a growing EC backlog
under sustained ingest (see issue #9184).

Changes:
- proto: add disk_id to ReceiveFileInfo, mirroring
  VolumeEcShardsCopyRequest.disk_id.
- worker: DistributeEcShards tracks the planner-assigned disk per
  shard; sendShardFileToDestination forwards that disk id. Metadata
  files (ecx/ecj/vif) inherit the disk of the first data shard
  targeting the same node so they land next to the shards.
- server: ReceiveFile honors disk_id when > 0 with bounds
  validation; disk_id=0 (unset) falls back to the same
  auto-selection pattern as VolumeEcShardsCopy (prefer disk that
  already has shards for this volume, then any HDD with free space,
  then any location with free space).

Tests updated:
- TestReceiveFileEcShardHonorsDiskID asserts three shards sent with
  disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively.
- TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range
  disk_id rejection path.

* fix(volume-rust): honor disk_id in ReceiveFile for EC shards

Mirror the Go-side change: when disk_id > 0 place the EC shard on the
requested disk; when unset, auto-select with the same preference order
as volume_ec_shards_copy (disk already holding shards, then any HDD,
then any disk).

* fix(volume): compare disk_id as uint32 to avoid 32-bit overflow

On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a
high-bit uint32 to a negative int, bypassing the bounds check before the
index operation. Compare in the uint32 domain instead.

* test(ec): fail invalid-disk_id test on transport error

Previously a transport-level error from CloseAndRecv silently passed the
test by returning early, masking any real gRPC failure. Fail loudly so
only the structured ReceiveFileResponse rejection path counts as a pass.

* docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test

Documents the load-bearing assumption that shards are never mounted in
this test, so loc.FindEcVolume always returns false and auto-select
falls through to the first HDD. Saves future readers from re-deriving
the expected directory for the DiskId=0 case.

* fix(test): preserve baseDir/volume path for single-dir clusters

StartSingleVolumeClusterWithDataDirs started naming the data directory
volume0 even in the dataDirCount=1 case, which broke Scrub tests that
reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile /
CorruptEcxFile. Keep the legacy name for single-dir clusters; only use
the indexed "volumeN" layout when multiple disks are requested.
2026-04-22 10:30:13 -07:00
Chris Lu
7f67995c24 chore(filer): remove -mount.p2p flag; registry is always on (#9183)
The filer-side mount peer registry (tier 1 of peer chunk sharing) was
gated behind -mount.p2p (default true). Idle cost is negligible — a
tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the
surface area.

Removes the flag from weed filer, weed server (-filer.mount.p2p), and
weed mini, and always constructs the registry in NewFilerServer. Also
drops the now-dead nil guards in MountRegister/MountList/sweeper and
the TestMountRegister_DisabledIsNoOp case.
2026-04-21 23:00:11 -07:00
Chris Lu
c40db5a52d perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171)
* perf(filer): parallelize StreamMutateEntry with path-keyed scheduler

The server handler processed one mutation at a time per stream, capping a
mount's aggregate throughput at ~1/filer_store_service_time regardless of
client concurrency (see issue #9138). With 12 rclone processes this showed
as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via
unary CreateEntry.

Replace the serial for-loop with a per-request goroutine admitted by a
path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor
(weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind
taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier
and descendant-barrier rules. Cross-path mutations run in parallel; same-
path mutations serialize on arrival order; recursive delete and directory
rename act as subtree barriers; directory attribute bumps stay non-barrier
so they do not serialize file writes under them.

Correctness and safety:
- Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource
  use from a single noisy mount.
- syncStream wrapper serializes stream.Send across worker goroutines (gRPC
  Send is not concurrent-safe).
- Handler waits on in-flight workers before returning on recv EOF/error so
  no worker writes to a torn-down stream.
- First fatal Send error from any worker propagates as the handler's
  return, causing the stream to tear down.

Benchmark (2 ms simulated filer-store service delay, 12 client workers):
  serial    : 440 QPS
  sem only  : 4902 QPS (unsafe — reorders same-path ops)
  scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct)

The sem-only number shows the upper bound of raw parallelism; the
scheduler matches it on distinct paths (the realistic 12-rclone case) and
correctly falls back to serial when the workload demands ordering. Peak
concurrent mutations at the handler equals client worker count on the
distinct-path workload and pins to 1 on the same-path workload, as the
scheduler intends.

* perf(filer): decouple StreamMutateEntry admission from receive loop

The previous StreamMutateEntry handler called sched.Admit directly in the
Recv loop. A single request conflicting on path /hot would head-of-line
block stream.Recv, so later requests targeting unrelated paths could not
be received or admitted until /hot drained — cross-path parallelism then
depended on request ordering instead of being a property of the scheduler.

Spawn the worker goroutine immediately on Recv and move sched.Admit into
that goroutine. A new streamMutatePendingLimit (1024) caps total per-
stream outstanding goroutines (pending + active) so a client flooding a
conflicted path cannot explode goroutine count without bound.

Addresses #9171 review comment (coderabbitai, Major).

* fix(filer): reply with EINVAL on unknown StreamMutateEntry request type

Returning nil when req.Request is a future oneof variant or a malformed
request left the client's per-RequestId waiter blocked forever, because
no response was ever sent for that id. Reply with IsLast=true and EINVAL
so the waiter completes with a well-formed error.

Addresses #9171 review comments (gemini-code-assist, coderabbitai).

* fix(filer): make classifyMutation crash-free and correct for deletes

Two issues addressed together because they share one function:

1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without
   a nil guard; an empty create_request / update_request / rename_request
   from a misbehaving client crashed the scheduler. Guard each oneof
   variant and fall back to a "/" barrier; the handler then sends EINVAL
   via the unknown-request path.

2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry-
   Request does not carry IsDirectory, so the previous kindMutateFile
   classification for non-recursive deletes did not conflict with an in-
   flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same
   path — a race in scheduler terms. Classify every delete as
   kindMutateBarrierDir regardless of IsRecursive. The incremental cost
   of a descendant-wait for a non-recursive delete of a non-empty dir is
   negligible since that call fails at the store anyway.

Adds classifyMutation tests for malformed create/update, empty oneof,
and updates the delete-non-recursive case to the new expected kind.

Addresses #9171 review comments (coderabbitai Critical, Major).

* fix(filer): route renameStreamProxy.SendMsg through the wrapping Send

The default pass-through SendMsg on renameStreamProxy bypassed the
syncStream mutex and the StreamMutateEntryResponse wrapping: anything
the rename helpers happened to push via SendMsg would have been emitted
on the wire as the wrong protobuf type and could interleave with other
workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry
Recv loop and could steal unrelated mutation requests.

Route SendMsg through the wrapping Send (rejecting other payload types)
and fail RecvMsg explicitly — the rename logic is a strictly server-push
stream and never calls RecvMsg, so loud failure is safer than silent
stealing.

Addresses #9171 review comment (coderabbitai, Major).

* test(filer): run exactly ops in stream-mutate workloads

perGoroutine := ops / concurrency silently truncated the total when the
values were not divisible — e.g. 2400 ops with 64 workers actually ran
2368 and with 256 workers ran 2304, making the logged "ops per run"
inaccurate and introducing measurement noise that varied across the
concurrency sweep.

Introduce opsForWorker(g, concurrency, ops) which distributes the
remainder to the first (ops % concurrency) workers so the three
workloads (unary, stream sync, stream async) each dispatch exactly
`ops` operations. No changes to the timing methodology.

Addresses #9171 review comment (coderabbitai, Minor).

* fix(filer): enforce per-path FIFO admission in mutateScheduler

sync.Cond.Broadcast wakes every waiter; the first to re-acquire the
mutex wins, so two conflicting same-path admissions could be reordered
by the Go runtime even though they arrived serially on the stream. A
single stream is supposed to carry ordered mutations — the PR's original
#8770 claim — so admission must be FIFO per path.

Replace the single cond with a per-path FIFO queue. Each Admit enqueues
a waiter on every path it touches (primary, and on rename the secondary
too) and blocks on a ready channel. tryPromoteLocked admits any waiter
that is at the head of every queue it joined, passes pathConflictsLocked
against the active-state indexes, and is under concurrencyLimit. Done
removes the heads and re-runs tryPromoteLocked so waiters freed by the
completion move in arrival order.

Side effect: two non-barrier directory updates on the same path now
serialize instead of overlapping. filer.sync's MetadataProcessor
intentionally allows them to overlap because its events come from a
committed log where last-writer-wins coalescing is safe; streamed
mutations carry client operations whose order matters, so we drop that
optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier
release) and TestAdmitSamePathNonBarrierSerializes to cover both.

Also refreshed the kindMutateFile doc comment that still referenced the
pre-#1ecf805f5 "non-recursive delete" classification.

Addresses #9171 review comments (coderabbitai Critical, Minor).

* test(filer): make TestAdmitSamePathFIFO deterministic without sleeps

The previous arrival-ordering sync (send to `started` before calling
Admit, plus a 1 ms sleep) relied on the goroutine actually entering
Admit and reaching the per-path queue during that sleep. Under -race on
a loaded CI that is a real flake source, which is ironic for a test
whose job is catching non-deterministic wake-ups.

Observe the scheduler's own pathQueue length between spawns instead —
waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected
number of waiters (1 barrier holder + i+1 file waiters) is enqueued.
That's the exact event the test wants to synchronise on, so there is no
fudge factor. Verified by `go test -race -count=5`.

Addresses #9171 review comment (coderabbitai, Minor).
2026-04-21 11:25:09 -07:00
Chris Lu
141413ad76 fix(tests): make tests pass on 32-bit architectures (#9168) (#9170)
Two separate failures reported on 32-bit builds (void-linux 4.21):

- weed/server: errorStreamImpl.count (and the same pattern in slowStream
  plus local totalEventsSent/totalSends) was a bare int64 sitting after
  smaller fields, so on 386/ARMv7/mips32 it landed at a 4-byte-aligned
  offset and atomic.AddInt64 panicked with "unaligned 64-bit atomic
  operation". Switched the counters to atomic.Int64, which Go guarantees
  is 8-byte aligned on every architecture.

- weed/plugin/worker/iceberg: three equality-delete tests fail on 32-bit
  because the upstream github.com/apache/iceberg-go declares
  manifestEntry.EqualityIDs as *[]int while the Iceberg Avro schema
  defines equality_ids as long, and hamba/avro refuses to map Go int
  onto Avro long when int is 32-bit. Not fixable in seaweedfs, so guard
  the affected tests with a t.Skip() when unsafe.Sizeof(int) < 8 until
  the upstream type is changed to []int32/[]int64.
2026-04-20 22:48:01 -07:00
Chris Lu
e24a443b17 peer chunk sharing 2/8: filer mount registry (#9131)
* proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.

* filer: add mount-server registry behind -peer.registry.enable

Implements tier 1 of the peer chunk sharing design: an in-memory registry
of live weed mount servers, keyed by peer address, refreshed by
MountRegister heartbeats and served by MountList.

* weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy
  sweep on List plus a background sweeper goroutine for bounded memory.

* weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC
  handlers. When -peer.registry.enable is false (the default), both RPCs
  are silent no-ops so probing older filers is harmless.

* -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled
  wires it through.

Phase 1 is single-filer (no cross-filer replication of the registry);
mounts that fail over to another filer will re-register on the next
heartbeat, so the registry self-heals within one TTL cycle.

Part of the peer-chunk-sharing design; no behavior change at runtime
until a later PR enables the flag on both filer and mount.

* filer: nil-safe peerRegistryEnable + registry hardening

Addresses review feedback on PR #9131.

* Fix: nil pointer deref in the mini cluster. FilerOptions instances
  constructed outside weed/command/filer.go (e.g. miniFilerOptions in
  mini.go) do not populate peerRegistryEnable, so dereferencing the
  pointer panics at Filer startup. Use the same
  `nil && deref` idiom already used for distributedLock / writebackCache.

* Hardening (gemini review): registry now enforces three invariants:
  - empty peer_addr is silently rejected (no client-controlled sentinel
    mass-inserts)
  - TTL is capped at 1 hour so a runaway client cannot pin entries
  - new-entry count is capped at 10000 to bound memory; renewals of
    existing entries are always honored, so a full registry still
    heartbeats its existing members correctly

Covered by new unit tests.

* filer: rename -peer.registry.enable flag to -mount.p2p

Per review feedback: the old name "peer.registry.enable" leaked
the implementation ("registry") into the CLI surface. "mount.p2p"
is shorter and describes what it actually controls — whether this
filer participates in mount-to-mount peer chunk sharing.

Flag renames (all three keep default=true, idle cost is near-zero):
  -peer.registry.enable        ->  -mount.p2p         (weed filer)
  -filer.peer.registry.enable  ->  -filer.mount.p2p   (weed mini, weed server)

Internal variable names (mountPeerRegistryEnable, MountPeerRegistry)
keep their longer form — they describe the component, not the knob.

* filer: MountList returns DataCenter + List uses RLock

Two review follow-ups on the mount peer registry:

* weed/server/filer_grpc_server_mount_peer.go: MountList was dropping
  the DataCenter on the wire. The whole point of carrying DC separately
  from Rack is letting the mount-side fetcher re-rank peers by the
  two-level locality hierarchy (same-rack > same-DC > cross-DC); without
  DC in the response every remote peer collapsed to "unknown locality."

* weed/filer/mount_peer_registry.go: List() was taking a write lock so
  it could lazy-delete expired entries inline. But MountList is a
  read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is
  already wired as the sole reclamation path (same pattern as the
  mount-side PeerDirectory). Switch List to RLock + filter, let Sweep
  do the map mutation, so concurrent MountList callers don't serialize
  on each other.

Test updated to reflect the new contract (List no longer mutates the
map; Sweep is what drops expired entries).
2026-04-18 20:03:23 -07:00
Lisandro Pin
6bcacedda9 Export master_disconnections metrics on volume servers. (#9104)
This allows to track connection issues and master failovers in real time via Prometheus metrics.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-17 15:15:26 -07:00
Chris Lu
00a2e22478 fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes

The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.

Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.

Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.

* fix(upload): populate ExpectedDataSize from actual chunk bytes

UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.

Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.

* fix(assign): pass real ExpectedDataSize at every assign call site

After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.

- filer: assignNewFileInfo now takes expectedDataSize and stamps it
  on both primary and alternate VolumeAssignRequests. Callers pass:
  - SSE data-to-chunk: len(data)
  - copy manifest save: len(data)
  - streamCopyChunk: srcChunk.Size
  - TUS sub-chunk: bytes read
  - saveAsChunk (autochunk/manifestize): 0 (small, size unknown
    until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
  after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
  upload_chunked.go passes the buffered dataSize at the call site.
  S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
  all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
  FileSize (average for batched requests, real per-chunk size for
  sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).

* fix(assign): thread size through SaveDataAsChunkFunctionType

The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.

Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
  it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
  UploadWithRetry, which already stamps len(data) on the assign
  after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
  path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
  assignNewFileInfo so manifest-chunk assigns get a real size.

Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
2026-04-16 15:51:13 -07:00
Chris Lu
979c54f693 fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals

Raft leader is advertised as host:httpPort.grpcPort, but clients dial
host:httpPort. Raw string comparison against VolumeLocation.Leader /
HeartbeatResponse.Leader therefore never matches, causing the
masterclient and the volume server heartbeat loop to continuously
"redirect" to the already-connected master, tearing down the stream
and reconnecting.

Use ServerAddress.Equals, which normalizes the grpc-port suffix.

* fix(filer,mq): compare ServerAddress via Equals in two more sites

filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local
partition assignment check both compared a wire-supplied address string
against the local self ServerAddress with raw string equality. Both are
vulnerable to the same plain-vs-host:port.grpcPort mismatch as the
masterclient/volume heartbeat sites: filer would bootstrap from itself,
and the broker would fail to claim a partition it was actually assigned.
Route both through ServerAddress.Equals.

* fix(master,shell): more ServerAddress comparisons via Equals

- raft_server_handlers.go HealthzHandler: s.serverAddr == leader would
  skip the child-lock check on the real leader when the two carry
  different plain/grpc-suffix forms, returning 200 OK instead of 423.
- master_server.go SetRaftServer leader-change callback: the
  Leader() == Name() guard for ensureTopologyId could disagree with
  topology.IsLeader() (which already uses Equals), so leader-only
  initialization could be skipped after an election.
- command_volume_merge.go isReplicaServer: the -target guard compared
  user-supplied host:port against NewServerAddressFromDataNode(...) with
  ==, letting an existing replica slip through when topology carries
  the embedded gRPC port.

All routed through pb.ServerAddress.Equals.

* fix(mq,cluster): more ServerAddress comparisons via Equals

- broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the
  partition ownership check gated listing on raw
  LeaderBroker == BrokerAddress().String(), so listings silently omitted
  partitions hosted locally when the assignment carried the other
  host:port / host:port.grpcPort form.
- lock_client.go: LockHostMovedTo comparison and the seedFiler fallback
  guard both used raw string equality against configured filer
  addresses (which may be plain host:port while LockHostMovedTo comes
  back suffixed), causing spurious host-change churn and blocking the
  seed-filer fallback.

* fix(mq): more ServerAddress comparisons via Equals

- pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct
  activeBrokers.Get() lookup missed brokers when a persisted assignment
  carried a different address encoding than the registered broker key,
  triggering a bogus reassignment on every read/write cycle. Added a
  findActiveBroker helper that falls back to an Equals-based scan and
  canonicalizes the assignment in place so later writes are stable.
- broker_grpc_lookup.go isLockOwner: used raw string equality between
  LockOwner() and BrokerAddress().String(), so a lock owner could fail
  to recognize itself and proxy local lookup/config/admin RPCs away.
- pub_client/scheduler.go onEachAssignments: reused publisher jobs only
  on exact LeaderBroker match, so an encoding flip in lookup results
  tore down and recreated a stream to the same broker.
2026-04-15 12:29:31 -07:00
Chris Lu
08d9193fe1 [nfs] Add NFS (#9067)
* add filer inode foundation for nfs

* nfs command skeleton

* add filer inode index foundation for nfs

* make nfs inode index hardlink aware

* add nfs filehandle and inode lookup plumbing

* add read-only nfs frontend foundation

* add nfs namespace mutation support

* add chunk-backed nfs write path

* add nfs protocol integration tests

* add stale handle nfs coverage

* complete nfs hardlink and failover coverage

* add nfs export access controls

* add nfs metadata cache invalidation

* fix nfs chunk read lookup routing

* fix nfs review findings and rename regression

* address pr 9067 review comments

- filer_inode: fail fast if the snowflake sequencer cannot start, and let
  operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
  to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
  a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
  after CommitTransaction so a rolled-back rename does not strand live
  metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
  filer.path, so the experimental server does not expose the entire
  filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
  layout rather than RFC 1813 LINK3args

* mount: pre-allocate inode in Mkdir and Symlink

Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.

createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.

* sequence: mask snowflake node id on int→uint32 conversion

CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.

* add test/nfs integration suite

Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.

Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.

Framework notes:

- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
  -port.grpc explicitly so the default port+10000 convention cannot
  overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
  starting the NFS server — the NFS server's ensureIndexedEntry check
  requires the export root to exist with a real entry, which filer.Root
  does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
  not try to re-dial via portmapper (which concatenates ":111" onto the
  address).

* ci: add NFS integration test workflow

Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.

* nfs: use append for buffer growth in Write and Truncate

The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.

* filer: honor caller cancellation in collectInodeIndexEntries

Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.

* nfs: skip filer read on open when O_TRUNC is set

openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.

* nfs: allow Seek on O_APPEND files and document buffered write cap

Two related cleanups on filesystem.go:

- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
  Seek error ("append-only file descriptors may only seek to EOF")
  prevented read-and-write workloads that legitimately reposition the
  read cursor. Write already snaps the offset to EOF before persisting
  (see seaweedFile Write), so Seek can unconditionally accept any
  offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
  per-file ceiling, the memory footprint it implies, and that the real
  fix for larger whole-file rewrites is streaming / multi-chunk support.

Both changes flagged by gemini-code-assist on PR 9067.

* nfs: guard offset before casting to int in Write

CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.

* nfs: compute Write endOffset in int64 to satisfy CodeQL

The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.

* ci: drop emojis from nfs-tests workflow summary

Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.

* nfs: annotate remaining DEV_PLAN TODOs with status

Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:

- Reuse chunk cache and mutation stream helpers without FUSE deps:
  checked off — the NFS server imports weed/filer.ReaderCache and
  weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
  as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
  annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
  go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
  "Current Blockers" note.

* test/nfs: share port+readiness helpers with test/testutil

Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.

Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.

* nfs: stream writes to volume servers instead of buffering the whole file

Before this change the NFS write path held the full contents of every
writable open in memory:

  - OpenFile(write) called loadWritableContent which read the existing
    file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
  - each Write() extended content in-place
  - Close() uploaded the whole buffer as a single chunk via
    persistContent + AssignVolume

The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:

  - openFile(write) no longer loads the existing content at all; it
    only issues an UpdateEntry when O_TRUNC is set *and* the file is
    non-empty (so a fresh create+trunc is still zero-RPC)
  - Write() streams the caller's bytes straight to a volume server via
    one AssignVolume + one chunk upload, then atomically appends the
    resulting chunk to the filer entry through mutateEntry. Any
    previously inlined entry.Content is migrated to a chunk in the same
    update so the chunk list becomes the authoritative representation.
  - Truncate() becomes a direct mutateEntry (drop chunks past the new
    size, clip inline content, update FileSize) instead of resizing an
    in-memory buffer.
  - Close() is a no-op because everything was flushed inline.

The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.

Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).

* filer: extract shared gateway upload helper for NFS and WebDAV

Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.

- NFS's saveDataAsChunk is now a thin adapter that assembles the
  GatewayChunkUploadRequest from server options and calls the helper.
  The chunkUploader interface keeps working for test injection because
  the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
  local operation.NewUploader call plus the AssignVolume/UploadOption
  scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
  features that do not fit the shared helper (a pre-allocated file-id
  pool used to skip AssignVolume entirely, and a chunkCache
  write-through at offset 0 so future reads hit the mount's local
  cache), both of which are mount-specific.

Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.

* nfs: remove DESIGN.md and DEV_PLAN.md

The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
2026-04-14 20:48:24 -07:00
Lisandro Pin
67a2810d2d Export start_time_seconds metrics on both master & volume servers. (#9046)
These are to be used to track uptimes.

See https://github.com/seaweedfs/seaweedfs/issues/8535 for details.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-13 09:34:08 -07:00
Chris Lu
edf7d2a074 fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039)
* fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035)

Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to
readPersistedLogBufferPosition, causing LoopProcessLogData to call
ReadPersistedLogBuffer on every 250ms health-check tick when a
subscriber encounters ResumeFromDiskError.  Each call creates an
OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a
readahead goroutine with a 1024-element channel, finds no data, and
returns — 4 times per second even on an idle filer.

This is redundant because SubscribeLocalMetadata already manages disk
reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs
tracking in the outer loop.

Set ReadFromDiskFn back to nil for LocalMetaLogBuffer.  When
LoopProcessLogData encounters ResumeFromDiskError with nil
ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the
caller (SubscribeLocalMetadata), which blocks efficiently on
listenersCond.Wait() instead of polling.

* fix(filer): add gap detection for slow consumers after disk-read stall

When a slow consumer falls behind and LoopProcessLogData returns
ResumeFromDiskError with no flush or read-position progress, there may
be a gap between persisted data and in-memory data (e.g. writes stopped
while consumer was still catching up). Without this, the consumer would
block on listenersCond.Wait() forever.

Skip forward to the earliest in-memory time to resume progress, matching
the gap-handling pattern already used in the shouldReadFromDisk path.

* fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall

The gap-detection block added in the previous commit skips lastReadTime
forward to GetEarliestTime() and continues the outer loop.  On the next
iteration, shouldReadFromDisk becomes true (currentReadTsNs >
lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the
existing gap handler at the top of the loop runs its own gap check.
That check uses readInMemoryLogErr == ResumeFromDiskError as the entry
condition — but readInMemoryLogErr is still the stale error from two
iterations ago.  GetEarliestTime() now equals lastReadTime.Time (we
already advanced to it), so earliestTime.After(lastReadTime.Time) is
false and the handler falls into listenersCond.Wait() — stuck.

Clear readInMemoryLogErr at the gap-skip point, matching the existing
pattern at the earlier gap handler that already clears it for the same
reason.

* fix(log_buffer): GetEarliestTime must include sealed prev buffers

GetEarliestTime previously returned only logBuffer.startTime (the active
buffer's first timestamp).  That is narrower than ReadFromBuffer's
tsMemory, which is the min across active + prev buffers.  Callers using
GetEarliestTime for gap detection after ResumeFromDiskError (the
SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in
the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time
that was *newer* than the real earliest in-memory data.

Impact in SubscribeLocalMetadata's slow-consumer path:
  - tsMemory = earliest prev buffer time (T_prev)
  - GetEarliestTime() = active startTime (T_active, later than T_prev)
  - Consumer position = T1, with T_prev < T1 < T_active
  - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory)
  - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true
  - Skip forward to T_active -- silently drops the prev-buffer data
  - And when T_active happens to equal the stuck position, gap detect
    evaluates false, and the subscriber stalls on listenersCond.Wait()

This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing
failure in CI where the consumer stalled at 10220/20000 after writing
stopped -- the buffer still had data in prev[0..3], but gap detection
was comparing against the active buffer's startTime.

Fix: scan all sealed prev buffers under RLock, return the true minimum
startTime.  Matches the min-of-buffers logic in ReadFromBuffer.

* test(log_buffer): make DiskReadRetry test deterministic

The previous test added the message via AddToBuffer + ForceFlush and
relied on a race: the second disk read had to happen before the data
was delivered through the in-memory path.  Under the race detector or
on a slow CI runner, the reader is woken by AddToBuffer's notification,
finds the data in the active buffer or its prev slot, and returns after
exactly one disk read — failing the >= 2 disk reads assertion even
though the loop behaved correctly.

Reproduced on master with race detector (2/5 failures).

Rewrite the test to deliver the data exclusively through the disk-read
path: no AddToBuffer, no ForceFlush.  The test waits until the reader
has issued at least one no-op disk read, then atomically flips a
"dataReady" flag.  The reader's next iteration through readFromDiskFn
returns the entry.  This deterministically exercises the retry-loop
behavior the test was originally written to protect, and removes the
in-memory delivery race entirely.
2026-04-11 23:12:54 -07:00
os-pradipbabar
9cae95d749 fix(filer): prevent data corruption during graceful shutdown (#9037)
* fix: wait for in-flight uploads to complete before filer shutdown

Prevents data corruption when SIGTERM is received during active uploads.
The filer now waits for all in-flight operations to complete before
calling the underlying shutdown logic.

This affects all deployment types (Kubernetes, Docker, systemd) and
fixes corruption issues during rolling updates, certificate rotation,
and manual restarts.

Changes:
- Add FilerServer.Shutdown() method with upload wait logic
- Update grace.OnInterrupt hook to use new shutdown method

Fixes data corruption reported by production users during pod restarts.

* fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete

* fix: address review comments on graceful shutdown

- Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking
  from long-lived streams (falls back to Stop on timeout)
- Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within
  Kubernetes default 30s termination grace period
- Move fs.Shutdown() (database close) after Serve() returns instead
  of a separate hook to eliminate race where main goroutine exits
  before the shutdown hook runs

* fix: shut down all HTTP servers before filer database close

Address remaining review comments:
- Shut down auxiliary HTTP servers (Unix socket, local listener) during
  graceful shutdown so they can't serve write traffic after the main
  server stops
- Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it
  completes before os.Exit(0), fixing the race between the grace
  goroutine and the main goroutine
- Use sync.Once to ensure fs.Shutdown() runs exactly once regardless
  of whether shutdown is signal-driven or context-driven (MiniCluster)

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-11 21:18:22 -07:00
Chris Lu
b37bbf541a feat(master): drain pending size before marking volume readonly (#9036)
* feat(master): drain pending size before marking volume readonly

When vacuum, volume move, or EC encoding marks a volume readonly,
in-flight assigned bytes may still be pending. This adds a drain step:
immediately remove from writable list (stop new assigns), then wait
for pending to decay below 4MB or 30s timeout.

- Add volumeSizeTracking struct consolidating effectiveSize,
  reportedSize, and compactRevision into a single map
- Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable,
  DrainAndSetVolumeReadOnly to VolumeLayout
- UpdateVolumeSize detects compaction via compactRevision change and
  resets effectiveSize instead of decaying
- Wire drain into vacuum (topology_vacuum.go) and volume mark readonly
  (master_grpc_server_volume.go)

* fix: use 2MB pending size drain threshold

* fix: check crowded state on initial UpdateVolumeSize registration

* fix: respect context cancellation in drain, relax test timing

- DrainAndSetVolumeReadOnly now accepts context.Context and returns
  early on cancellation (for gRPC handler timeout/cancel)
- waitForPendingDrain uses select on ctx.Done instead of time.Sleep
- Increase concurrent heartbeat test timeout from 10s to 15s for CI

* fix: use time-based dedup so decay runs even when reported size is unchanged

The value-based dedup (same reportedSize + compactRevision = skip) prevented
decay from running when pending bytes existed but no writes had landed on
disk yet. The reported size stayed the same across heartbeats, so the excess
never decayed.

Fix: dedup replicas within the same heartbeat cycle using a 2-second time
window instead of comparing values. This allows decay to run once per
heartbeat cycle even when the reported size is unchanged.

Also confirmed finding 1 (draining re-add race) is a false positive:
- Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes
- Move/EC: readonlyVolumes flag prevents re-adding during drain

* fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout

The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up
to 30s waiting for pending bytes to decay. In integration tests (and
real clusters during EC encoding), this caused timeouts because multiple
volumes are marked readonly sequentially and heartbeats may not arrive
fast enough to decay pending within the drain window.

Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops
new assigns) and only logs a warning if pending bytes remain. The drain
wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs
inside the master's own goroutine pool.

Remove DrainAndSetVolumeReadOnly as it's no longer used.

* fix: relax test timing, rename test, add post-condition assert

* test: add vacuum integration tests with CI workflow

Full-cluster integration test for vacuum, modeled on the EC integration
tests. Starts a real master + 2 volume servers, uploads data, deletes
entries to create garbage, runs volume.vacuum via shell command, and
verifies garbage cleanup and data integrity.

Test flow:
1. Start cluster (master + 2 volume servers)
2. Upload 10 files to create volume with data
3. Delete 5 files to create ~50% garbage
4. Verify garbage ratio > 10%
5. Run volume.vacuum command
6. Verify garbage cleaned up
7. Verify remaining 5 files are still accessible

CI workflow runs on push/PR to master with 15-minute timeout.
Log collection on failure via artifact upload.

* fix: use 500KB files and delete 75% to exceed vacuum garbage threshold

* fix: add shell lock before vacuum command, fix compilation error

* fix: strengthen vacuum integration test assertions

- waitForServer: use net.DialTimeout instead of grpc.NewClient for
  real TCP readiness check
- verify_garbage_before_vacuum: t.Fatal instead of warning when no
  garbage detected
- verify_cleanup_after_vacuum: t.Fatal if no server reported the
  volume or cleanup wasn't verified
- verify_remaining_data: read actual file contents via HTTP and
  compare byte-for-byte against original uploaded payloads

* fix: use http.Client with timeout and close body before retry
2026-04-11 18:29:11 -07:00
Chris Lu
10b0bdce02 feat: pass expected_data_size from clients for size-aware assignment (#9032)
* feat: pass expected_data_size from clients for size-aware assignment

Add expected_data_size field to AssignRequest (master proto) and
AssignVolumeRequest (filer proto) so clients can hint how large the
data will be. The master uses this instead of the 1MB default when
tracking pending volume sizes for weighted assignment.

- Add expected_data_size to master.proto AssignRequest
- Add expected_data_size to filer.proto AssignVolumeRequest
- Wire through filer AssignVolume handler
- Wire through HTTP submit handler (uses actual upload size)
- Add ExpectedDataSize to VolumeAssignRequest in operation package
- Topology.PickForWrite accepts optional expectedDataSize parameter

* fix: guard integer conversions in expected_data_size path

- common.go: clamp OriginalDataSize to non-negative before uint64 cast
- topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast

* fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize

- HTTP /dir/assign now parses optional "dataSize" query parameter
  and passes it to PickForWrite instead of hardcoded 0
- Add test assertion for PickForWrite with non-zero expectedDataSize
2026-04-11 11:30:47 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
6f036c7015 fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock (#8998)
* fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock

When fastResume is active (single-master + resumeState + non-empty log),
the raft server becomes leader within ~1ms. DoJoinCommand then enters
the leaderLoop's processCommand path, which calls setCommitIndex to
commit all pending entries. The goraft setCommitIndex implementation
returns early when it encounters a JoinCommand entry (to recalculate
quorum), which can prevent the new entry's event channel from being
notified — leaving DoJoinCommand blocked forever.

Each restart appends a new raft:join entry to the log, while the conf
file's commitIndex (only persisted on AddPeer) lags behind. After 3-4
restarts the uncommitted range contains old JoinCommand entries that
trigger the early return before the new entry is reached.

Fix: skip DoJoinCommand when the raft log already has entries (the
server was already joined in a previous run). The fastResume mechanism
handles leader election independently.

* fix(master): handle Hashicorp Raft in HasExistingState

Add Hashicorp Raft support to HasExistingState by checking
AppliedIndex, consistent with how other RaftServer methods
handle both raft implementations.

* fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft

AppliedIndex() reflects in-memory FSM state which starts at 0 before
log replay completes. LastIndex() reads from persisted stable storage,
correctly mirroring the non-Hashicorp IsLogEmpty() check.
2026-04-08 21:08:50 -07:00
Lars Lehtonen
8edadf7f4a chore(weed/server): prune unused unexported struct fields (#8980) 2026-04-07 21:24:30 -07:00
Chris Lu
940eed0bd3 fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972)
* fix(ec): generate .ecx before EC shards to prevent data inconsistency

In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER
the EC shards were generated from .dat. If any write occurred between
these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses
the read-only check), the .ecx would contain entries pointing to data
that doesn't exist in the EC shards, causing "shard too short" and
"size mismatch" errors on subsequent reads and scrubs.

Fix by generating .ecx FIRST, then snapshotting datFileSize, then
encoding EC shards. If a write sneaks in after .ecx generation, the
EC shards contain more data than .ecx references — which is harmless
(the extra data is simply not indexed).

Also snapshot datFileSize before EC encoding to ensure the .vif
reflects the same .dat state that .ecx was generated from.

Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the
race condition by appending data between EC encoding and .ecx generation.

* fix: pass actual offset to ReadBytes, improve test quality

- Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve
  correct error metrics and error messages within ReadBytes
- Handle Stat() error in assembleFromIntervalsAllowError
- Rename TestEcConsistency_DatFileGrowsDuringEncoding to
  TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size
  encoding, not concurrent growth)
- Update test comment to clarify it reproduces the old buggy sequence
- Fix verification loop to advance by readSize for full data coverage

* fix(ec): add dat/idx consistency check in worker EC encoding

The erasure_coding worker copies .dat and .idx as separate network
transfers. If a write lands on the source between these copies, the
.idx may have entries pointing past the end of .dat, leading to EC
volumes with .ecx entries that reference non-existent shard data.

Add verifyDatIdxConsistency() that walks the .idx and verifies no
entry's offset+size exceeds the .dat file size. This fails the EC
task early with a clear error instead of silently producing corrupt
EC volumes.

* test(ec): add integration test verifying .ecx/.ecd consistency

TestEcIndexConsistencyAfterEncode uploads multiple needles of varying
sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then
reads every needle back via the EC read path and verifies payload
correctness. This catches any inconsistency between .ecx index entries
and EC shard data.

* fix(test): account for needle overhead in test volume fixture

WriteTestVolumeFiles created a .dat of exactly datSize bytes but the
.idx entry claimed a needle of that same size. GetActualSize adds
header + checksum + timestamp overhead, so the consistency check
correctly rejects this as the needle extends past the .dat file.

Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry
is consistent with the .dat contents.

* fix(test): remove flaky shard ID assertion in EC scrub test

When shard 0 is truncated on disk after mount, the volume server may
detect corruption via parity mismatches (shards 10-13) rather than a
direct read failure on shard 0, depending on OS caching/mmap behavior.
Replace the brittle shard-0-specific check with a volume ID validation.

* fix(test): close upload response bodies and tighten file count assertion

Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd
leaks during test execution. Also tighten TotalFiles check from >= 1
to == 1 since ecSetup uploads exactly one file.
2026-04-07 19:05:36 -07:00
Chris Lu
a4753b6a3b S3: delay empty folder cleanup to prevent Spark write failures (#8970)
* S3: delay empty folder cleanup to prevent Spark write failures (#8963)

Empty folders were being cleaned up within seconds, causing Apache Spark
(s3a) writes to fail when temporary directories like _temporary/0/task_xxx/
were briefly empty.

- Increase default cleanup delay from 5s to 2 minutes
- Only process queue items that have individually aged past the delay
  (previously the entire queue was drained once any item triggered)
- Make the delay configurable via filer.toml:
  [filer.options]
  s3.empty_folder_cleanup_delay = "2m"

* test: increase cleanup wait timeout to match 2m delay

The empty folder cleanup delay was increased to 2 minutes, so the
Spark integration test needs to wait longer for temporary directories
to disappear.

* fix: eagerly clean parent directories after empty folder deletion

After deleting an empty folder, immediately try to clean its parent
rather than relying on cascading metadata events that each re-enter
the 2-minute delay queue. This prevents multi-minute waits when
cleaning nested temporary directory trees (e.g. Spark's _temporary
hierarchy with 3+ levels would take 6m+ vs near-instant).

Fixes the CI failure where lingering _temporary parent directories
were not cleaned within the test's 3-minute timeout.
2026-04-07 13:20:59 -07:00
Chris Lu
4efe0acaf5 fix(master): fast resume state and default resumeState to true (#8925)
* fix(master): fast resume state and default resumeState to true

When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.

Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.

Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.

* fix(master): prevent fastResume goroutine from hanging forever

Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.

* fix(master): use ticker instead of time.After in fastResume polling loop
2026-04-04 14:15:56 -07:00
Chris Lu
896114d330 fix(admin): fix master leader link showing incorrect port in Admin UI (#8924)
fix(admin): use gRPC address for current server in RaftListClusterServers

The old Raft implementation was returning the HTTP address
(ms.option.Master) for the current server, while peers used gRPC
addresses (peer.ConnectionString). The Admin UI's GetClusterMasters()
converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress
(port - 10000), which produced a negative port (-667) for the current
server since its address was already in HTTP format (port 9333).

Use ToGrpcAddress() for consistency with both HashicorpRaft (which
stores gRPC addresses) and old Raft peers.

Fixes #8921
2026-04-04 11:50:43 -07:00