seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-14 05:41:29 +00:00

Author	SHA1	Message	Date
Chris Lu	3a8389cd68	fix(ec): verify full shard set before deleting source volume (#9490 ) (#9493 ) * fix(ec): verify full shard set before deleting source volume (#9490) Before this change, both the worker EC task and the shell ec.encode command would delete the source .dat as soon as MountEcShards returned — even if distribute/mount failed partway, leaving fewer than 14 shards in the cluster. The deletion was logged at V(2), so by the time someone noticed missing data the only trace was a 0-byte .dat synthesized by disk_location at next restart. - Worker path adds Step 6: poll VolumeEcShardsInfo on every destination, union the bitmaps, and refuse to call deleteOriginalVolume unless all TotalShardsCount distinct shard ids are observed. A failed gate leaves the source readonly so the next detection scan can retry. - Shell ec.encode adds the same gate after EcBalance, walking the master topology with collectEcNodeShardsInfo. - VolumeDelete RPC success and .dat/.idx unlinks now log at V(0) so any source destruction is traceable in default-verbosity production logs. The EC-balance-vs-in-flight-encode race is intentionally left for a follow-up; balance should refuse to move shards for a volume whose encode job is not in Completed state. * fix(ec): trim doc comments on the new shard-verification path Drop WHAT-describing godoc on freshly added helpers; keep only the WHY notes (query-error policy in VerifyShardsAcrossServers, the #9490 reference at the call sites). * fix(ec): drop issue-number anchors from new comments Issue references age poorly — the why behind each comment already stands on its own. * fix(ec): parametrize RequireFullShardSet on totalShards Take totalShards as an argument instead of reading the package-level TotalShardsCount constant. The OSS callers continue to pass 14, but the helper is now usable with any DataShards+ParityShards ratio. * test(plugin_workers): make fake volume server respond to VolumeEcShardsInfo The new pre-delete verification gate calls VolumeEcShardsInfo on every destination after mount, and the fake server's UnimplementedVolumeServer returns Unimplemented — the verifier read that as zero shards on every node and aborted source deletion. Build the response from recorded mount requests so the integration test exercises the gate end-to-end. * fix(rust/volume): log .dat/.idx unlink with size in remove_volume_files Mirror the Go-side change in weed/storage/volume_write.go: stat each file before removing and emit an info-level log for .dat/.idx so a destructive call is always traceable. The OSS Rust crate previously unlinked them silently. * fix(ec/decode): verify regenerated .dat before deleting EC shards After mountDecodedVolume succeeds, the previous code immediately unmounts and deletes every EC shard. A silent failure in generate or mount could leave the cluster with neither shards nor a valid normal volume. Probe ReadVolumeFileStatus on the target and refuse to proceed if dat or idx is 0 bytes. Also make the fake volume server's VolumeEcShardsInfo reflect whichever shard files exist on disk (seeded for tests as well as mounted via RPC), so the new gate can be exercised end-to-end. * fix(ec): address PR review nits in verification + fake server - Drop unused ServerShardInventory.Sizes field. - Skip shard ids >= MaxShardCount before bitmap Set so the ShardBits bound is explicit (Set already no-ops on overflow, this is for clarity). - Nil-guard the fake server's VolumeEcShardsInfo so a malformed call doesn't panic the test process.	2026-05-13 19:29:24 -07:00
Chris Lu	d5c0a7b153	fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487 ) * fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume.	2026-05-13 13:56:20 -07:00
Chris Lu	f51468cf73	Revert #9443 — heartbeat peer binding breaks hostname-based clusters (#9474 ) Revert "master: bind heartbeat claims to the connecting peer (#9443)" This reverts commit `f28c7ce6df`. The strict heartbeat-ip-vs-peer match in authorizeHeartbeatPeer rejects every hostname-based deployment. In docker-compose / k8s the volume server is started with -ip=<service-name> and the gRPC peer surfaces as the container/pod IP, so the two never match and every heartbeat fails with `heartbeat ip "volume" does not match peer "172.18.0.3"`. The master therefore never learns about any volume, growth fails, and fio writes against the mount return EIO. After the #9440 revert merged (`43a8c4fdc`), the e2e workflow is still failing for this reason; see https://github.com/seaweedfs/seaweedfs/actions/runs/25767265775 . Reverting to unblock e2e. A narrower re-do should accept the heartbeat when heartbeat.Ip resolves (DNS) to the peer address, so the spoof hardening can return without breaking hostname-based clusters.	2026-05-12 18:22:21 -07:00
Chris Lu	43a8c4fdca	Revert #9440 — volume admin fail-closed gate breaks multi-host clusters (#9472 ) * Revert "volume: fail closed in admin gRPC gate when no whitelist is configured (#9440)" This reverts commit `21054b6c18`. The fail-closed gate broke any multi-host cluster: in compose / k8s / remote-host deployments the master's IP isn't loopback, so every master->volume admin RPC (AllocateVolume, BatchDelete, EC reroute, vacuum, scrub, ...) is rejected with PermissionDenied unless the operator manually configures -whiteList. The e2e workflow has been failing since `10cc06333` with `not authorized: 172.18.0.2` on AllocateVolume; downstream symptom is fio fsync EIO because zero volumes can be grown. The gate's intent was to lock down destructive admin tooling, but the same RPCs are the master's normal mechanism for growing and managing volumes. Reverting to restore cluster-internal operation; a narrower re-do should distinguish operator/admin callers from the master peer (e.g. trust IPs resolved from -master) before going back in. * security: skip invalid CIDR in UpdateWhiteList so IsWhiteListed can't panic The revert in the previous commit also rolled back an unrelated bug fix that lived inside #9440: UpdateWhiteList logged on net.ParseCIDR error but did not continue, so the nil *net.IPNet was stored in whiteListCIDR and IsWhiteListed would panic dereferencing cidrnet.Contains(remote) on the next gRPC admin check. Restore the continue. Orthogonal to the fail-closed semantics this PR is reverting.	2026-05-12 16:00:44 -07:00
Chris Lu	f28c7ce6df	master: bind heartbeat claims to the connecting peer (#9443 ) SendHeartbeat used to accept whatever Ip/Port/Volumes the caller put on the wire. Three changes tighten that: - Reject heartbeats whose Ip does not match the gRPC peer's source address. Loopback peers are still trusted; operators behind a proxy can opt out with -master.allowUntrustedHeartbeat. - Track which (ip, port) first claimed a volume id or an ec shard slot and drop foreign re-claims. Non-EC volume claims are bounded by the replica copy count so legitimate replicas still register. EC ownership is keyed by (vid, shard_id) so the same vid can legitimately be split across many peers as long as their EcIndexBits are disjoint; rejected bits are cleared from the bitmap and the parallel ShardSizes array is compacted in lock-step. - Maintain reverse indexes owner -> volumes and owner -> ec shard slots so disconnect cleanup is O(M) in what that peer held rather than O(N) over the whole map. Bindings are also released when a heartbeat reports that the peer no longer holds an id, either via explicit Deleted{Volumes,EcShards} entries or by omitting it from a full snapshot. Without this, a planned rebalance that moved a vid or an ec shard from peer A to peer B would leave B's heartbeats permanently filtered out until A disconnected, breaking ec encode/decode flows that delete shards on the source as soon as the move completes. The (vid -> owners) binding still does not track which replica slot each peer occupies, so the first N claims under the copy count win; strict per-slot mapping is a follow-up.	2026-05-12 15:38:52 -07:00
Chris Lu	10cc06333b	cluster: restrict Ping RPC to known peers of the requested type (#9445 ) Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master.	2026-05-12 13:00:52 -07:00
Chris Lu	21054b6c18	volume: fail closed in admin gRPC gate when no whitelist is configured (#9440 ) Add Guard.IsAdminAuthorized, a fail-closed variant of IsWhiteListed, and use it to gate destructive volume admin RPCs. IsWhiteListed keeps its allow-all-when-empty semantics for HTTP compatibility. For TCP peers with an empty whitelist, off-host callers are rejected but loopback (127.0.0.0/8, ::1) is still trusted. A volume server commonly cohabits with the master/filer on a single host and in integration-test clusters; the loopback exception keeps cluster-internal admin traffic working without -whiteList while still locking out off-host attackers. Non-TCP peers (in-process / bufconn / unix-socket) bypass the host check entirely. When `weed server` runs master+volume+filer in a single process the master dials the volume server in-process and the peer address surfaces as "@", which has no parseable IP. Such a caller shares our OS process and cannot be spoofed by a remote attacker, so we treat it as trusted by construction. The gate also tolerates a nil guard (developmental / embedded path) and only enforces once a guard is wired up. UpdateWhiteList skips entries whose CIDR fails to parse so the IP-iteration path can no longer hit a nil *net.IPNet.	2026-05-12 12:35:27 -07:00
Chris Lu	69da20bdae	volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441 ) volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the destructive volume-server RPCs, and reject cluster-internal remote S3 endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before dialing. Pin the validated address against DNS rebinding by routing the AWS SDK through an HTTP transport whose DialContext re-resolves the host and re-applies the deny list on every dial, so an endpoint that resolves to a public IP at validate-time and then flips to 127.0.0.1 at connect time is refused. Operators that legitimately fetch from private hosts can opt out with -volume.allowUntrustedRemoteEndpoints.	2026-05-12 10:11:20 -07:00
Chris Lu	5e8f99f40a	filer: require admin-signed JWT on the IAM gRPC service (#9442 ) Every IAM RPC (CreateUser, PutPolicy, CreateAccessKey, ...) now requires a Bearer token in the authorization metadata, signed with the filer write-signing key. The service refuses to register on a filer that has no jwt.filer_signing.key set, so the unauthenticated default is gone: operators who use these RPCs must configure the key and attach a token on every call. Bearer scheme matching is case-insensitive (RFC 6750), every handler nil-checks req before dereferencing it, and tests now cover the expired-token path.	2026-05-12 10:11:08 -07:00
Chris Lu	05ed5c9ae8	filer: scope JWT allowed_prefixes to path components (#9439 ) The allowed_prefixes check used a literal byte-prefix match, so a token scoped to /tenant1 also matched /tenant1234, /tenant1-old, and similar sibling paths. Match on /-separated path components after path.Clean normalisation instead.	2026-05-12 10:10:48 -07:00
Chris Lu	532b088262	fix(ec): preserve source disk type across EC encoding (#9423 ) (#9449 ) * fix(ec): carry source disk type on VolumeEcShardsMount (#9423) When EC shards land on a target whose disk type differs from the source volume's, master heartbeats wrongly reported under the target disk's type. Add source_disk_type to VolumeEcShardsMountRequest; the target server applies it to the in-memory EcVolume via SetDiskType so the mount notification and steady-state heartbeat both carry the source's disk type. Empty value falls back to the location's disk type (used by disk-scan reload paths). The override is not persisted with the volume — disk type stays an environmental property and .vif remains portable. * fix(ec): plumb source disk type through plugin worker (#9423) Add source_disk_type to ErasureCodingTaskParams (field 8; 7 reserved), populate it from the metric the detector already collects, thread it through ec_task into the MountEcShards helper, and forward it on the VolumeEcShardsMount RPC. * fix(ec): mirror source disk type plumbing in rust volume server (#9423) The volume_ec_shards_mount handler now forwards source_disk_type into mount_ec_shard → DiskLocation::mount_ec_shards. When non-empty it overrides ec_vol.disk_type (and each mounted shard's disk_type) via the new set_disk_type method; empty value keeps the location's disk type, so disk-scan reload and reconcile paths are unchanged. Also picks up two pre-existing proto drifts that 'make gen' synced from weed/pb (LockRingUpdate in master.proto, listing_cache_ttl_seconds in remote.proto). * feat(ec): bias placement toward preferred disk type (#9423) Add DiskCandidate.DiskType and PlacementRequest.PreferredDiskType. When PreferredDiskType is non-empty, SelectDestinations partitions suitable disks into matching/fallback tiers and runs the rack/server/ disk-diversity passes on the matching tier first; the fallback tier is only consulted if the matching pool can't satisfy ShardsNeeded. PlacementResult.SpilledToOtherDiskType lets callers warn on spillover. Empty PreferredDiskType keeps the existing single-pool behavior. * fix(ec): plumb source disk type into placement planner (#9423) diskInfosToCandidates now copies DiskInfo.DiskType into the placement candidate, and ecPlacementPlanner.selectDestinations forwards metric.DiskType as PreferredDiskType so EC shards land on disks matching the source volume's disk type when possible. A glog warning fires when placement had to spill to other disk types. * test(ec): integration coverage for source-disk-type plumbing (#9423) store_ec_disk_type_test exercises Store.MountEcShards end-to-end: a shard physically lives on an HDD location, MountEcShards is called with sourceDiskType="ssd", and the test asserts that the in-memory EcVolume, the mounted shard, the NewEcShardsChan notification, and the steady-state heartbeat all report under the source's disk type. A companion test pins the empty-source path so disk-scan reload keeps the location's disk type. detection_disk_type_test exercises the worker plumbing: with a cluster of nodes carrying both HDD and SSD disks, planECDestinations must place every shard on SSD when metric.DiskType="ssd"; with only one SSD node and 13 HDD nodes it must still satisfy a 10+4 layout via spillover (and log a warning). * revert(ec): drop unrelated proto drift in seaweed-volume/proto (#9423) make gen pulled two pre-existing OSS changes into the rust proto tree (LockRingUpdate / by_plugin in master.proto, listing_cache_ttl_seconds in remote.proto). Reviewers flagged it as scope creep — none of the rust EC fix references those fields. Restore both files to origin/master so this branch only touches EC-related symbols. * fix(ec placement): treat empty disk type as hdd and skip used racks on spill (#9423) partitionByDiskType used raw string comparison, so a PreferredDiskType of "hdd" never matched candidates whose DiskType is "" (the HardDriveType sentinel that weed/storage/types uses). EC encoding of an HDD source would spill onto any HDD reporting "" even when the cluster has plenty of matching capacity. Normalize both sides through normalizeDiskType, which lowercases and folds "" → "hdd", mirroring types.ToDiskType without taking a dependency on it. selectFromTier's rack-diversity pass also kept revisiting racks the preferred tier had already used when running on the fallback tier, which negated PreferDifferentRacks on spillover. Skip racks already in usedRacks so fallback placements still spread onto new racks. * fix(ec): empty-source remount must not clobber existing disk type (#9423) mount_ec_shards_with_idx_dir runs more than once per vid (RPC mount, disk-scan reload, orphan-shard reconcile). After an RPC sets the source-derived disk type, any later call passing source_disk_type="" was resetting ec_vol.disk_type back to the location's value, which reintroduces the heartbeat drift this PR is meant to fix. Only default to the location's disk type when the EC volume is fresh (no shards mounted yet); otherwise leave the recorded type alone so empty-source reloads preserve whatever the original mount RPC set.	2026-05-11 20:21:50 -07:00
Chris Lu	b2d24dd54f	volume: require admin auth on BatchDelete (#9438 ) Run BatchDelete through checkGrpcAdminAuth like the other destructive volume-server RPCs (VolumeDelete, DeleteCollection, vacuum, EC, ...), so a whitelist-configured server denies non-admin callers.	2026-05-11 13:50:48 -07:00
Chris Lu	2b21d19e4c	volume: require admin auth on ReadAllNeedles and VolumeNeedleStatus (#9437 ) Both RPCs hand out raw needle bytes / cookies. Run them through checkGrpcAdminAuth like the rest of the volume-server admin handlers.	2026-05-11 13:50:19 -07:00
Minsoo Kim	a1e5eb9dad	Fix UI prefix url encoding (#9344 ) * Fix filer UI navigation for URL-sensitive object prefixes * Fix filer UI navigation for URL-sensitive object prefixes * Clarify filer UI path escaping test name Rename the legacy filer UI path test to describe the actual behavior being checked. The printpath helper preserves timestamp characters that are valid in URL path components, while the PR fix is focused on query-string escaping for path and cursor parameters.	2026-05-06 19:14:36 -07:00
Chris Lu	1c0e24f06a	fix(balance): don't move remote-tiered volumes; don't fatal on missing .idx (#9335 ) * fix(volume): don't fatal on missing .idx for remote-tiered volume A .vif left behind without its .idx (orphaned by a crashed move, partial copy, or hand-edit) would trip glog.Fatalf in checkIdxFile and take the whole volume server down on boot, killing every healthy volume on it too. For remote-tiered volumes treat it as a per-volume load error so the server can come up and the operator can clean up the stray .vif. Refs #9331. * fix(balance): skip remote-tiered volumes in admin balance detection The admin/worker balance detector had no equivalent of the shell-side guard ("does not move volume in remote storage" in command_volume_balance.go), so it scheduled moves on remote-tiered volumes. The "move" copies .idx/.vif to the destination and then calls Volume.Destroy on the source, which calls backendStorage.DeleteFile — deleting the remote object the destination's new .vif now points at. Populate HasRemoteCopy on the metrics emitted by both the admin maintenance scanner and the worker's master poll, then drop those volumes at the top of Detection. Fixes #9331. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fix(volume): keep remote data on volume-move-driven delete The on-source delete after a volume move (admin/worker balance and shell volume.move) ran Volume.Destroy with no way to opt out of the remote-object cleanup. Volume.Destroy unconditionally calls backendStorage.DeleteFile for remote-tiered volumes, so a successful move would copy .idx/.vif to the destination and then nuke the cloud object the destination's new .vif was already pointing at. Add VolumeDeleteRequest.keep_remote_data and plumb it through Store.DeleteVolume / DiskLocation.DeleteVolume / Volume.Destroy. The balance task and shell volume.move set it to true; the post-tier-upload cleanup of other replicas and the over-replication trim in volume.fix.replication also set it to true since the remote object is still referenced. Other real-delete callers keep the default. The delete-before-receive path in VolumeCopy also sets it: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. Refs #9331. * test(storage): in-process remote-tier integration tests Cover the four operations the user is most likely to run against a cloud-tiered volume — balance/move, vacuum, EC encode, EC decode — by registering a local-disk-backed BackendStorage as the "remote" tier and exercising the real Volume / DiskLocation / EC encoder code paths. Locks in: - Destroy(keepRemoteData=true) preserves the remote object (move case) - Destroy(keepRemoteData=false) deletes it (real-delete case) - Vacuum/compact on a remote-tier volume never deletes the remote object - EC encode requires the local .dat (callers must download first) - EC encode + rebuild round-trips after a tier-down Tests run in-process and finish in under a second total — no cluster, binary, or external storage required. * fix(rust-volume): keep remote data on volume-move-driven delete Mirror the Go fix in seaweed-volume: plumb keep_remote_data through grpc volume_delete → Store.delete_volume → DiskLocation.delete_volume → Volume.destroy, and skip the s3-tier delete_file call when the flag is set. The pre-receive cleanup in volume_copy passes true for the same reason as the Go side: the inbound copy carries a .vif that may reference the same cloud object as the existing volume. The Rust loader already warns rather than fataling on a stray .vif without an .idx (volume.rs load_index_inmemory / load_index_redb), so no counterpart to the Go fatal-on-missing-idx fix is needed. Refs #9331. * fix(volume): preserve remote tier on IO-error eviction; fix EC test target Two review nits: - Store.MaybeAddVolumes' periodic cleanup pass deleted IO-errored volumes with keepRemoteData=false, so a transient local fault on a remote-tiered volume would also nuke the cloud object. Track the delete reason via a parallel slice and pass keepRemoteData=v.HasRemoteFile() for IO-error evictions; TTL-expired evictions still pass false. - TestRemoteTier_ECEncodeDecode_AfterDownload deleted shards 0..3 but called them "parity" — by the klauspost/reedsolomon convention shards 0..DataShardsCount-1 are data and DataShardsCount..TotalShardsCount-1 are parity. Switch the loop to delete the parity range so the intent matches the indices. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-05-06 15:19:43 -07:00
Chris Lu	2417ba0354	fix(volume): add authentication to destructive gRPC admin endpoints (#8876 ) * fix(volume): add authentication to destructive gRPC admin endpoints Three destructive VolumeServer gRPC endpoints (DeleteCollection, VolumeDelete, VolumeServerLeave) had no authentication checks, unlike their HTTP counterparts which are protected by the Guard whitelist. Add IsWhiteListed(host) to security.Guard and a checkGrpcAdminAuth helper on VolumeServer that extracts the peer IP from gRPC context and validates it against the guard whitelist. Gate all three endpoints behind this check. * fix(volume): tolerate unparseable gRPC peer address in admin auth check S3 Filer Group integration tests were failing with PermissionDenied "bad peer address: address @: missing port in address" when DeleteCollection ran across the in-process gRPC connection between filer and volume server — the peer addr surfaces as "@" there and net.SplitHostPort can't parse it. The check rejected before IsWhiteListed could exercise its allow-all path for empty-whitelist deployments. Hand the raw peer string to IsWhiteListed when SplitHostPort fails. With no whitelist configured (the test environment's mode) it accepts; with a whitelist configured the unparseable host won't match anything and the call still gets denied as it should. Adds three regression tests for IsWhiteListed pinning the empty-config allow-all, populated-list reject-unknown, and signing-key-only allow- all branches that the gRPC admin helper relies on. * refactor(security): dedup checkWhiteList through IsWhiteListed The HTTP-side checkWhiteList and the gRPC-side IsWhiteListed had the same lookup logic in two places; future drift was just a matter of time. Have checkWhiteList delegate so the membership semantics live in exactly one function. Behaviour is unchanged: the new path still returns nil for isEmptyWhiteList (signing-key-only mode) and still rejects unknown hosts when a whitelist is configured. Addresses gemini medium review on PR #8876. * fix(volume): protect remaining state-altering gRPC admin endpoints DeleteCollection, VolumeDelete, and VolumeServerLeave were the truly-destructive endpoints, but AllocateVolume, VolumeMount, VolumeUnmount, VolumeConfigure, VolumeMarkReadonly, and VolumeMarkWritable also modify server state and should sit behind the same whitelist gate. Read-only endpoints (VolumeStatus, VolumeServerStatus, VolumeNeedleStatus, Ping) stay open. The check is a no-op when no whitelist is configured (the default), so existing deployments keep working; operators who lock down their volume servers via guard.white_list now get consistent coverage. Addresses gemini security-high review on PR #8876. * fix(volume): typed peer addr + audit log for gRPC admin auth Prefer a typed net.TCPAddr when extracting the peer IP — string parsing was already a fallback for the in-process case but using the typed form first is cleaner and skips an unnecessary parse on the common path. Log failed authorization attempts at V(0) so an operator running with a whitelist sees the host that was rejected (and the raw remote address in case the IP lookup itself was the failure mode), matching what the HTTP Guard already does. Addresses gemini medium review on PR #8876. fix(volume): protect vacuum + scrub + EC-shards-delete admin endpoints Five more master/admin-driven destructive operations live outside volume_grpc_admin.go and were missing the same whitelist gate: - VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup - ScrubVolume - VolumeEcShardsDelete VacuumVolumeCheck stays open (read-only). BatchDelete also stays open: it's the data-plane multi-object delete called from the S3 API and filer, not an admin operation; gating it would break ordinary S3 DeleteObjects calls. Addresses gemini security-high review on PR #8876. * fix(volume): simplify no-peer-info branch in gRPC admin auth The IsWhiteListed("") fallback was defending against a scenario that doesn't actually arise — real gRPC connections always populate peer info. Drop the branch and just deny when peer info is missing, which is the safer default and matches "if we don't know who the caller is, refuse". * fix(volume-rust): mirror gRPC admin auth on the rust volume server The rust volume server has the same set of destructive admin endpoints as the Go side and the same Guard infrastructure, but nothing was wired together — every endpoint accepted unauthenticated calls regardless of guard configuration. Same vulnerability class the Go fix on this PR closes; this commit closes it on the rust side too so the two stacks stay aligned. Adds VolumeGrpcService::check_grpc_admin_auth that pulls the peer SocketAddr off the tonic Request and runs Guard::check_whitelist on its IP, then applies the helper to the same set the Go side covers: DeleteCollection, AllocateVolume, VolumeMount, VolumeUnmount, VolumeDelete, VolumeMarkReadonly, VolumeMarkWritable, VolumeConfigure, VacuumVolumeCompact, VacuumVolumeCommit, VacuumVolumeCleanup, VolumeServerLeave, ScrubVolume, VolumeEcShardsDelete. Read-only endpoints stay open; BatchDelete stays open as a data-plane multi-object delete.	2026-05-04 21:14:55 -07:00
Chris Lu	d265274e13	fix(nfs): accept dirpath any-where under the export, mirroring rclone (#9291 ) * fix(nfs): accept any MOUNT3 dirpath, mirroring rclone's permissive policy weed nfs has exactly one export per process, so the MOUNT3 dirpath argument has no second export to disambiguate against. Strict comparison only translated PV-path typos into the inconsistent "mount succeeds but empty" / "mount fails completely" split that operators see. Match rclone's serve nfs Handler.Mount: ignore the dirpath, log an INFO line when it differs from the configured export, and always serve the export root. Apply the same change to the UDP MOUNT3 path so kernel clients defaulting to mountproto=udp see identical behaviour. Access control still goes through -allowedClients / -ip.bind, and file-handle scoping in FromHandle is unchanged so handles still cannot escape the export. Replace the prior single-path reject tests with table tests covering the shapes operators commonly hit: root, parent, sibling, deeper child, unrelated, empty, relative form, exact match, and trailing slash, at the Handler.Mount, UDP MOUNT3, and full RPC layers. * feat(nfs): mount at subdirectory when MOUNT3 dirpath is under the export Make the dirpath argument meaningful when the client asks for a subtree of the configured export. With -filer.path=/buckets, a client mounting <server>:/buckets/data lands directly inside /buckets/data instead of at the export root. - dirpath equals the export root: serve the export root. - dirpath strictly under the export, directory entry: serve that subdirectory; the returned filehandle encodes its inode. - dirpath strictly under the export, missing or non-directory: reject with NoEnt or NotDir. - dirpath outside the export: keep the rclone-style fallback to the export root. TCP returns a sub-rooted seaweedFileSystem and lets go-nfs's onMount call ToHandle to encode the FH; UDP encodes the FH itself. FromHandle is unchanged: handles are content-addressed by inode and resolve via the inode index, so they remain stable across mounts and across process restarts. The trimmed permissive tests keep their outside-export shapes; new subexport tests cover under-export directories, missing entries, and non-directory entries on Handler.Mount, the UDP MOUNT3 wire, and through the full RPC stack. * nfs: propagate request context through MOUNT3 resolution Mount now accepts the gonfs context and threads it through resolveMountFilesystem and lstatExportStatus so a slow filer call during MOUNT cannot outlive a cancelled or timed-out request. lstatExportStatus uses fileInfoForVirtualPath(ctx, "/") directly instead of billy.Filesystem.Lstat, which would otherwise drop the context on the floor by calling fileInfoForVirtualPathWithOptions with context.Background(). Lower the successful subexport-mount log from V(0) to V(1). The fallback log stays at V(0) so operator typos still surface; the success line is per-mount churn that adds up on NFS-CSI deployments. * nfs: mirror TCP defensive checks on the UDP MOUNT3 path Two transport-parity bugs the rabbit caught: (1) The exact-export-root and outside-export branches were returning MNT3_OK unconditionally, while the TCP handler runs lstatExportStatus on those same branches. If the configured -filer.path has been removed from the filer, TCP returns NoEnt/ServerFault but UDP would still hand out a synthetic root handle pointing at nothing. Add rootMountStatus as the UDP analogue and call it on both branches. (2) resolveSubexportFileHandle did filer I/O on the single UDP serve loop with context.Background(). One slow filer round-trip would block every later MOUNT packet. Wrap each MOUNT call's filer work in context.WithTimeout(mountUDPLookupTimeout) and thread that ctx through both rootMountStatus and resolveSubexportFileHandle. Lower the successful subexport log to V(1) to match the TCP side. * nfs: assert TCP/UDP MOUNT3 produce byte-identical filehandles The existing UDP subexport assertions only checked the decoded inode and kind. A regression that drifted the generation, exportID, or encoding format on one transport but not the other would have slipped through. Build the TCP Handler from the same Server, drive its Mount with the same dirpath, and require ToHandle to match the raw UDP FH bytes for every OK case. * nfs: take MOUNT3 dirpath as string in resolveMountFilesystem Convert req.Dirpath to string once at the call site instead of sprinkling string(...) casts through every log line and conversion inside the function. Behavior unchanged. * nfs: share rootFS lifecycle between TCP and UDP MOUNT handlers Server.rootFilesystem() lazily constructs the seaweedFileSystem rooted at the configured export the first time anything asks for it, then hands the same instance to every subsequent caller. newHandler() and mountUDPServer.rootMountStatus() now both go through it, so: - Both transports observe the same chunk reader cache and chunk invalidator without depending on call order during startup. - The UDP defensive Lstat doesn't allocate a fresh wrapper per MOUNT request anymore; one struct lives for the life of the Server. The sub-rooted seaweedFileSystem the subexport branch builds in resolveSubexportFileHandle is still per-request because actualRoot varies with the requested dirpath. * nfs: drive rootFilesystem before reading sharedReaderCache on UDP The UDP listener is started before serve() calls newHandler(), so an under-export MOUNT3 request can reach resolveSubexportFileHandle before Server.sharedReaderCache has been assigned. Reading it directly would hand newSeaweedFileSystem a nil cache and the sub-fs would build a throwaway ReaderCache that never gets shared with the TCP path. Take rootFS off Server.rootFilesystem() (which drives the sync.Once that initializes the shared cache) and read readerCache off that instead, so subexport sub-fs instances always share the same reader cache as rootFS regardless of which transport sees the first MOUNT. * nfs: collapse exact-match and outside-export MOUNT branches The two branches return the same filesystem (export root) and the same status; only the log line differs. Combine the conditions and guard the fallback log inline. Behavior unchanged.	2026-04-30 10:06:44 -07:00
Chris Lu	35fe3c801b	feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267 ) * feat(nfs): add UDP MOUNT v3 responder The upstream willscott/go-nfs library only serves the MOUNT protocol over TCP. Linux's mount.nfs and the in-kernel NFS client default mountproto to UDP in many configurations, so against a stock weed nfs deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0 ("not registered"), and either falls back inconsistently or surfaces EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version or transport protocol is not supported" reported in #9263. The user has to add `mountproto=tcp` or `mountport=2049` to mount options to coerce TCP just for the MOUNT phase. Add a small UDP responder that speaks just enough of MOUNT v3 to handle the procedures the kernel actually invokes during mount setup and teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors handler.go's TCP path so both transports produce the same root filehandle and the same auth flavor list for the same export. Other v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL. This commit only adds the responder; portmap-advertise and Server.Start wire-up follow in subsequent commits so each step stays independently reviewable. References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing constants and parseRPCCall / encodeAcceptedReply helpers from portmap.go are reused so behaviour stays consistent across both UDP listening goroutines. * feat(nfs): advertise UDP MOUNT v3 in the portmap responder The portmap responder advertised TCP-only entries because go-nfs only serves TCP, but with the new UDP MOUNT responder in place we can now honestly advertise MOUNT v3 over UDP as well. Linux clients whose default mountproto is UDP query portmap during mount setup; if the answer is "not registered" some kernels translate the result to EPROTONOSUPPORT instead of falling back to TCP, which is exactly the failure pattern reported in #9263. Add the entry, refresh the doc comment, and extend the existing GETPORT and DUMP unit tests so a regression that drops the entry shows up at unit-test granularity rather than only in an end-to-end mount. * feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener Plug the new mountUDPServer into Server.Start so it comes up on the same bind/port as the TCP NFS listener. Started before portmap so a portmap query that races a fast client never returns a UDP MOUNT entry the responder isn't actually answering, and shut down via the same defer chain so a portmap-or-listener startup failure doesn't leave the UDP responder dangling. The portmap startup log now reflects all three advertised entries (NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a glance that the UDP MOUNT path is up. Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a container with -portmap.bind, and mounted from another container using both the user-reported failing setup from #9263 (vers=3 + tcp without mountport) and an explicit mountproto=udp to force the new code path. The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049` now leads to a successful mount instead of EPROTONOSUPPORT. * docs(nfs): note that the plain mount form works on UDP-default clients With UDP MOUNT v3 now served alongside TCP, the only path that ever required mountproto=tcp / mountport=2049 — clients whose default mountproto is UDP — works against the plain mount example. Update the startup mount hint and the `weed nfs` long help so users don't go hunting for a mount-option workaround that no longer applies. The "without -portmap.bind" branch is unchanged: that path still has to bypass portmap entirely because there is no portmap responder for the kernel to query. * test(nfs): add kernel-mount e2e tests under test/nfs The existing test/nfs/ harness boots a real master + volume + filer + weed nfs subprocess stack and drives it via go-nfs-client. That covers protocol behaviour from a Go client's perspective, but anything mis-coded once a real Linux kernel parses the wire bytes is invisible: both ends of the test use the same RPC library, so identical bugs round-trip cleanly. The two NFS issues hit recently were exactly that shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3 — and only surfaced in a real client. Add three end-to-end tests that mount the harness's running NFS server through the in-tree Linux client: - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline). - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP only — regression test for the new UDP MOUNT v3 responder. - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server, asserting the kernel surfaces a protocol/version-level error rather than a generic "mount system call failed" — regression test for the PROG_MISMATCH path from #9262. The tests pass explicit port=/mountport= mount options so the kernel never queries portmap, which means the harness doesn't need to bind the privileged port 111 and won't collide with a system rpcbind on a shared CI runner. They t.Skip cleanly when the host isn't Linux, when mount.nfs isn't installed, or when the test process isn't running as root. Run locally with: cd test/nfs sudo go test -v -run TestKernelMount ./... CI wiring follows in the next commit. * ci(nfs): run kernel-mount e2e tests in nfs-tests workflow Wire the new TestKernelMount* tests from test/nfs into the existing NFS workflow: - Existing protocol-layer step now skips '^TestKernelMount' so a "skipped because not root" line doesn't appear on every run. - New "Install kernel NFS client" step pulls nfs-common (mount.nfs + helpers) and netbase (/etc/protocols, which mount.nfs's protocol- name lookups need to resolve `tcp`/`udp`). - New privileged step runs only the kernel-mount tests under sudo, preserving PATH and pointing GOMODCACHE/GOCACHE at the user's caches so the second `go test` invocation reuses already-built test binaries instead of redownloading modules under root. The summary block now lists the three kernel-mount cases explicitly so a regression on either of #9262 or this PR's UDP MOUNT change is traceable from the workflow run page.	2026-04-28 14:06:35 -07:00
Lisandro Pin	3f3aaa7cc8	Export Prometheus metrics for scrubbing operations. (#9264 ) This PR introduces three new metrics... - `scrub_last_time_seconds` - `scrub_volume_failures` - `scrub_shard_failures` ...capturing overall volume scrub results, and allowing to construct alerts and dashboards to monitor scrubbing progress. Note that these metrics are aggregated at the volume/EC shard level, and not intended for fine-grained tracking of scrubbing operations.	2026-04-28 12:34:02 -07:00
Chris Lu	e2c8791441	fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262 ) * feat(nfs): add NFSv3-only RPC version filter The upstream willscott/go-nfs library dispatches RPC calls by (program, procedure) only — it does not validate the program version. A client sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same handler map as NFSv3 and gets routed to v3 SETATTR, which parses the COMPOUND args as SETATTR3args and writes a malformed reply. The kernel then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version or transport protocol is not supported" without retrying v3. This commit adds a listener wrapper that peeks the first RPC frame on each new TCP connection. If the program is NFS or MOUNT and the version is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported range 3..3, per RFC 5531) directly to the socket and closes the connection. v3 frames are replayed unchanged via a bufio reader so go-nfs sees the original bytes. Unknown programs pass through so go-nfs's own PROG_UNAVAIL handling stays in charge. The filter is not yet wired into the server; the next commit activates it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and unknown-program pass-through. * fix(nfs): wire NFSv3 version filter into the listener chain Place the version filter after the optional client allowlist so that unauthorized peers are still rejected first by IP/CIDR before we look at RPC content. With the filter active, a Linux client doing the default v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3 mountOptions that already work for rclone serve nfs against this deployment. * test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1 TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending mountProgramID with version 1, so the test never actually covered the "reject MOUNT v4" path it claims to exercise. The filter does reject any non-v3 version uniformly, so the test still passed, but a future change that tightened the version check (for example, only rejecting v4) would let this test silently lie about coverage. Bump the call to version 4 so the name matches what is actually exercised. * refactor(nfs): reuse package RPC constants and io.ReadFull in version filter The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH, AUTH_NONE, the NFS/MOUNT program numbers) are already named in portmap.go alongside the portmap responder. Reuse them here instead of defining a parallel set in rpc_version_filter.go: keeping one source of truth per package means a future correction in one spot can't drift away from the other. The filter-only constants (peek timeout, peek length, supportedNFSVer) stay local because they have no portmap analog. In the test, drop the bespoke readFull loop in favor of io.ReadFull. The custom version was a near-identical reimplementation that did not return io.ErrUnexpectedEOF on short reads, so the standard library is both shorter and more diagnostic-friendly. * fix(nfs): move RPC peek off the Accept path The previous wrapper called filterFirstRPCFrame inline inside versionFilterListener.Accept, which meant a single slow or idle TCP connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line blocking against every other accept: gonfs.Serve calls Accept serially, so each in-flight peek stalled the next legitimate client until the deadline expired. An attacker who simply opens a TCP connection without sending any RPC payload could trivially throttle accept throughput. Restructure the wrapper so a background goroutine drives the inner Accept loop and hands each raw conn to its own short-lived goroutine that runs the peek. Validated conns are sent on a buffered-once channel, which the wrapper's Accept reads from; rejected conns finish their PROG_MISMATCH reply and disappear without ever reaching the channel. This means N concurrent slow clients only block themselves, not the N+1th fast client that connects after them. Add Close coordination — sync.WaitGroup for the accept loop and per-conn peek goroutines, plus a closed channel so Accept unblocks immediately on shutdown — so the wrapper now satisfies the full net.Listener contract instead of relying on the embedded listener. Add a regression test that opens a slow conn (TCP only, never writes) and a fast conn (sends a v3 frame) and asserts the fast conn reaches the inner accept handler well below the peek timeout. * test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close The post-rejection check was only failing when conn.Read succeeded; any error — including a deadline timeout because the server kept the socket open — let the test pass. That defeats the point of the assertion: a regression where the filter replies but forgets to close would slip through silently. Match against io.EOF explicitly. The TCP semantics are deterministic here: the server writes PROG_MISMATCH, calls conn.Close(), the client reads what's left in flight and then sees a clean FIN, which surfaces as io.EOF on the next zero-byte read. * fix(nfs): reject short first fragments before parsing RPC header fields bufio.Reader.Peek(28) is willing to read across record boundaries to satisfy the requested length, so a final fragment whose body is shorter than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog + vers + proc) leaves the trailing peek bytes pointing at the next RPC's framing or whatever bytes happen to follow on the wire. Indexing hdr[16:24] for prog/vers in that state can spuriously reject (or pass through) traffic based on data that doesn't belong to the request being classified. Drop those frames out of the filter early: if the first fragment can't possibly hold a full CALL header, pass the connection straight to go-nfs, which has its own framing-error handling for malformed input. Add a regression test that crafts a 12-byte first fragment whose trailing peek bytes are deliberately shaped like an NFSv4 CALL — without the length check the filter sends a PROG_MISMATCH; with it, the conn passes through silently. Verified by stashing the production-code change and running the test in isolation: it fails as expected without the fix. * fix(nfs): retry transient Accept() errors instead of treating any error as terminal acceptLoop previously exited on the first error returned by the inner listener's Accept(). That conflates two very different failure modes: permanent shutdown (the listener was Close()d, OS-level fatal failure) and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on accept). The transient case should not take the entire NFS server down — a single fd-table-full event would leave the deployment offline until restart. Classify the error: errors.Is(err, net.ErrClosed) is the permanent signal we already wanted to surface to Accept(); everything else is transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff (50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The backoff sleep is interruptible via the closed channel so Close() still shuts the loop down promptly. Add a regression test that wraps a real listener with one that injects 3 fake transient errors before delegating, and asserts Accept() still delivers the next real connection. Verified the test fails on the old "any error is terminal" loop and passes with this change. * fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic The filter was rejecting any CALL-shaped record with prog=100003 or 100005 and vers!=3, regardless of the rpcvers field. If the caller is speaking some other protocol that happens to share the port — or just sending garbled bytes — pretending to be an NFSv3 server replying PROG_MISMATCH is misleading at best, and at worst fabricates a coherent RPC reply for traffic we don't actually understand. Add an rpcvers==2 check between the msg_type and prog/vers parses. Any non-v2 record now passes through to go-nfs, whose RFC 5531 §9 RPC_MISMATCH handling is the correct place to reject mis-versioned RPC. Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the client and that the conn is delivered to the inner accept handler. Verified the test fails on the previous code (filter still rejected on prog/vers alone) and passes with the guard in place. * fix(nfs): bound Close() latency by evicting in-flight prefilter conns Close() does wg.Wait() to drain handleConn goroutines, but each of those goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC header. A client that completes the TCP handshake but never sends a byte therefore stretched shutdown by 10s per such conn — a real regression for stop/restart paths and for tests that just want to tear the listener down. Track raw (pre-peek) conns in versionFilterListener.inFlight as handleConn enters, untrack on exit, and have Close() forcibly close every tracked conn before wg.Wait. Closing the underlying conn breaks its Peek immediately, so handleConn returns within a single scheduler hop. trackInFlight also short-circuits if shutdown has already started, so a conn accepted after signalClose can't slip past the eviction. Black-box regression test opens 4 idle TCP-handshake-only conns, lets their handleConn goroutines settle into Peek, and asserts Close() returns under 2s. Verified: same test fails on the previous code with Close taking ~9.9s; passes here at ~100ms.	2026-04-28 12:17:54 -07:00
Chris Lu	5fbe39320c	fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212 ) (#9245 ) * fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) ec.rebuild only sets CopyEcxFile=true on the first shard sent to the rebuilder; subsequent shards rely on VolumeEcShardsCopy / ReceiveFile auto-select to land on the same disk. The old auto-select used FindEcVolume (in-memory) to detect the "already has this volume" case. Mid-rebuild, no EC volume has been mounted yet on the destination, so FindEcVolume returns nothing and the fallback picks "any HDD with free space" — which can split shards from their .ecx across disks of the same node and feed the orphan-shard layout reported in #9212 / fixed on the loader side in #9244. Add Store.FindEcShardTargetLocation as the canonical placement primitive: prefer a mounted EC volume, then a disk that has the .ecx on disk, then any HDD, then any disk. DiskLocation.HasEcxFileOnDisk is the new on-disk check, and it looks at IdxDirectory first with a fallback to Directory to handle .ecx written before -dir.idx was configured. Both VolumeEcShardsCopy and ReceiveFile now route through the new helper, dropping their duplicated 4-level fallback ladder. No protocol changes; explicit DiskId callers are unaffected. * fix(volume_server): treat directories named .ecx as no-match in HasEcxFileOnDisk os.Stat(".ecx") succeeds for both files and directories. If something happens to leave a directory named X.ecx in the data or idx folder, HasEcxFileOnDisk would currently report true and FindEcShardTargetLocation would route shards to that disk — where NewEcVolume's eventual OpenFile(O_RDWR) on the same path errors out. Add a !info.IsDir() check on both stat sites. Cheap and conservative. Suggested in PR #9245 review by @gemini-code-assist. refactor(volume_server): collapse EC placement helper to a single pass FindEcShardTargetLocation called FindFreeLocation up to four times. Each call iterates s.Locations and acquires VolumesLen / EcShardCount RLocks per disk — for a typical 4-disk node that's 32 RLock cycles per placement decision. Walk s.Locations once, score each disk by tier (mounted > .ecx-on-disk > HDD > any-disk), break ties by free count. The free-slot math is factored into a small helper that mirrors FindFreeLocation's formula without re-entering the location's locks. Behaviour is unchanged: each existing tier still wins over later tiers, and within a tier the disk with the most free count still wins, matching the original max-tracking in FindFreeLocation. Suggested in PR #9245 review by @gemini-code-assist. * refactor(volume_server): thread dataShardCount as a parameter through EC placement ecFreeShardCount and FindEcShardTargetLocation referenced erasure_coding.DataShardsCount directly. Take it as a parameter so custom-ratio builds (e.g. enterprise) can swap the default without touching the helper itself, and so unit tests can pin a specific ratio independent of the package constant. Default callsites in VolumeEcShardsCopy and ReceiveFile now pass the package default explicitly; tests pass a literal 10 for clarity. * fix(volume_server): treat MaxVolumeCount=0 as unlimited in EC placement ecFreeShardCount computed `MaxVolumeCount - VolumesLen()` and went negative when MaxVolumeCount was 0 — the "unlimited disk" sentinel already honoured by Store.hasFreeDiskLocation and friends. With a negative free count, FindEcShardTargetLocation's `freeCount <= 0` guard skipped the disk entirely, so unlimited disks could never receive EC shards via the placement helper. Special-case MaxVolumeCount<=0: report a synthetic large free count that decrements with current usage, so unlimited disks are eligible and tie-breaks still prefer the less-loaded one. Added TestFindEcShardTargetLocation_HonoursUnlimitedDisk as the regression. Reported in PR #9245 review by @gemini-code-assist. * fix(volume_server): account in shard slots, not volume slots, in ecFreeShardCount FindFreeLocation in store.go ends with `free /= DataShardsCount`, converting "shard slots free" back to "volume-equivalent slots." The truncation is harmless there, but my new ecFreeShardCount inherited the same final divide and re-introduced exactly the orphan-shard hazard #9245 was meant to prevent: with MaxVolumeCount=1, VolumesLen=0, EcShardCount=1 the formula reports 0 even though the disk has room for 9 more shards, so subsequent shards route off the .ecx-owning disk into the HDD-fallback tier. Drop the trailing divide and return the count directly in shard slots. Same shape, finer granularity; tie-breaks still order by free count. The unlimited branch's "used" calculation is updated to match (mix volume-slots and shard-slots in shard units). Added TestFindEcShardTargetLocation_TightProvisioningKeepsEcxDisk as the regression. Reported in PR #9245 review by @coderabbitai.	2026-04-27 15:59:57 -07:00
Lisandro Pin	2c404f66bc	Export `file_read_invalid_needles` metric for REST read requests on invalid file IDs. (#9241 ) Provides a straightforward metric to count read requests with incorrect file/needle IDs, which can indicate client issues. Note that the metric does not cover gRPC calls, as the current proto service API does not support seeking files by ID.	2026-04-27 12:22:42 -07:00
Chris Lu	7f770b1553	fix(filer): return 503 + Retry-After when remote object not cached yet (#9236 ) * extend cache-not-ready handling to filer HTTP path Mirror the s3api change for the native filer HTTP handlers. When the filer GET hits a remote-only object whose cache fill hasn't completed, return 503 Service Unavailable with Retry-After: 5 instead of 500 Internal Error, and treat client disconnects as silent cancellations rather than logging them as errors. Adds an ErrCacheNotReady sentinel and a small helper used at the prepareWriteFn-error sites in ProcessRangeRequest, so the same classification (cancel / not-ready / other) applies to plain GETs, single-range, and multi-range requests. * clear Content-Range on prepareWriteFn error The single-range path sets Content-Range before calling prepareWriteFn. If prepareWriteFn fails, http.Error is about to write a fresh body for 503 or 500, but the stale Content-Range header would still go out and no longer match. Drop it alongside Content-Length in the shared helper so all current and future callers are covered. * strip success-path headers and forward NotFound on prepareWriteFn error When ProcessRangeRequest writes an error response, the previously-set success headers (Content-Disposition, ETag, Last-Modified, in addition to Content-Length/Content-Range) shouldn't ride along on the new body. With ?dl=1 a stale Content-Disposition would even cause browsers to save the error message under the object's filename. Strip them all in the shared helper. Also forward filer_pb.ErrNotFound through the cache-failure branch so a mid-cache entry deletion surfaces as 404, not as a 503 retry-loop. Permanent upstream cloud errors (403/404 from the cloud SDK) still come back as opaque wrapped strings via FetchAndWriteNeedle and remain mapped to 503; distinguishing those would need a wider refactor.	2026-04-27 01:58:33 -07:00
Lisandro Pin	93247d6de4	Export REST file_{read,write}_failures metrics on volume servers (#9215 ) * Export gRPC `file_{read,write}_failures` metrics on volume servers. Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. * Export REST `file_{read,write}_failures` metrics on volume servers.	2026-04-24 11:45:21 -07:00
Chris Lu	3d39324bc1	fix(nfs): make Linux `mount -t nfs` work without client workaround (#9199 ) (#9201 ) * fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199) The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and does not register with portmap. Linux mount.nfs queries portmap on port 111 first, so the plain `mount -t nfs host:/export /mnt` form failed with "portmap query failed" / "requested NFS version or transport protocol is not supported" against a default `weed nfs` deployment. - Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses. Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port. - New CLI flag `-portmap.bind` (empty, disabled by default) to opt into the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE and must not collide with a system rpcbind. - Extended `weed nfs -h` help with the two supported ways to mount from Linux (client-side portmap bypass, or server-side `-portmap.bind`). - Startup log now prints a copy-pasteable mount command tailored to whether portmap is enabled. Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP round-trip against the real listener. Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0` the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock host:/export /mnt`) now succeeds and both read and write work. * fix(nfs): harden portmap responder per review feedback (#9201) Addresses three review findings on the portmap responder: - parseRPCCall: validate opaque_auth length against the record limit before applying the XDR 4-byte padding, so a near-uint32-max authLen can no longer overflow (authLen + 3) and bypass the bounds check. (gemini-code-assist) - serveTCP/Close: track live TCP connections and evict them on Close() so shutdown does not block on idle clients waiting for the read deadline to trip. serveTCP also no longer tears the listener down on a non-fatal Accept error (e.g. EMFILE); it logs and retries after a small back-off. Replaces the atomic.Bool closed flag with a mutex-guarded one so closed, conns, and the shutdown transition stay consistent. (coderabbit, minor) - handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s in-flight) so a peer that opens the privileged port 111 and stalls cannot pin a goroutine indefinitely. (coderabbit, major) Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP connection idle and asserts Close() returns within 2s (well under the 30s idle deadline) and that the client sees the eviction. All existing tests still pass, including under -race. * fix(nfs): keep portmap UDP responder alive on transient read errors (#9201) - serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and continue instead of returning. Matches how serveTCP now treats non-fatal Accept errors so a transient network blip doesn't take UDP portmap down until restart. (coderabbit) - Rename portmapAcceptBackoff -> portmapRetryBackoff now that both paths use it. - pmapProcDump: fix the pre-allocation capacity to match the actual encoding (20 bytes per entry + 4-byte terminator), replacing the old over-estimate of 24 per entry. No behavior change; just documents intent. (coderabbit nit) * docs(nfs): clarify encodeAcceptedReply body semantics (#9201) The prior comment said body is "nil when the accept_stat is itself an error", which was misleading: the PROG_MISMATCH branch already passes an 8-byte mismatch_info body. Rewrite to enumerate which error accept_stat values omit the body and call out PROG_MISMATCH as the exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit) * fix(nfs): make portmap retry backoff interruptible by Close() (#9201) serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a non-fatal listener error. If Close() races in during that sleep, the goroutine can't be interrupted, so Close() has to wait out the remaining backoff before wg.Wait() returns. Add a done channel that Close() closes once, and replace both time.Sleep calls with a select on ps.done + time.After. The window was tiny in practice but the select makes shutdown strictly bounded by Close()'s own work. (coderabbit nit)	2026-04-23 13:53:53 -07:00
steve.wei	1a7ab2ea82	fix(upload): keep Content-MD5 on 204 unchanged writes (#9198 ) Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved.	2026-04-23 10:59:59 -07:00
Chris Lu	592d6d6021	fix(filer/remote): keep re-cache work alive past caller cancellation (#9174 ) (#9193 ) * fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot finish before the S3 gateway's initial cache wait elapses. When it does, the gRPC ctx cancellation cascades into the filer's chunk downloads, the error path calls DeleteUncommittedChunks on every chunk already written, and the next retry starts over. boto3 splitting the GET into concurrent ranges (or any client tear-down on first failure) shortens the window between retries, so the loop never converges. Detach the caller's ctx with context.WithoutCancel before invoking the singleflight work so the download runs to completion regardless of client cancellations. Subsequent waiters — via the in-flight singleflight, or a fresh retry landing after completion — observe the cached entry and stream normally. Same detach pattern is used in filer_server_handlers_write.go:53 and volume_server_handlers_write.go:51. * simplify rationale comment * switch to DoChan so handler can return on caller cancel Do keeps the handler goroutine blocked for the full detached download even after the client is gone. DoChan lets the handler select on ctx.Done() and exit immediately; the singleflight goroutine continues on bgCtx and the next request either joins it or finds the entry cached.	2026-04-22 17:56:15 -07:00
Chris Lu	f438cc3544	fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184 ) (#9186 ) * test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers.	2026-04-22 16:47:01 -07:00
Lisandro Pin	fff243d463	Export gRPC `file_{read,write}_failures` metrics on volume servers. (#9177 ) Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-22 11:22:21 -07:00
Chris Lu	c4e1885053	fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184 ) (#9185 ) * test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested.	2026-04-22 10:30:13 -07:00
Chris Lu	7f67995c24	chore(filer): remove -mount.p2p flag; registry is always on (#9183 ) The filer-side mount peer registry (tier 1 of peer chunk sharing) was gated behind -mount.p2p (default true). Idle cost is negligible — a tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the surface area. Removes the flag from weed filer, weed server (-filer.mount.p2p), and weed mini, and always constructs the registry in NewFilerServer. Also drops the now-dead nil guards in MountRegister/MountList/sweeper and the TestMountRegister_DisabledIsNoOp case.	2026-04-21 23:00:11 -07:00
Chris Lu	c40db5a52d	perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171 ) * perf(filer): parallelize StreamMutateEntry with path-keyed scheduler The server handler processed one mutation at a time per stream, capping a mount's aggregate throughput at ~1/filer_store_service_time regardless of client concurrency (see issue #9138). With 12 rclone processes this showed as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via unary CreateEntry. Replace the serial for-loop with a per-request goroutine admitted by a path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor (weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier and descendant-barrier rules. Cross-path mutations run in parallel; same- path mutations serialize on arrival order; recursive delete and directory rename act as subtree barriers; directory attribute bumps stay non-barrier so they do not serialize file writes under them. Correctness and safety: - Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource use from a single noisy mount. - syncStream wrapper serializes stream.Send across worker goroutines (gRPC Send is not concurrent-safe). - Handler waits on in-flight workers before returning on recv EOF/error so no worker writes to a torn-down stream. - First fatal Send error from any worker propagates as the handler's return, causing the stream to tear down. Benchmark (2 ms simulated filer-store service delay, 12 client workers): serial : 440 QPS sem only : 4902 QPS (unsafe — reorders same-path ops) scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct) The sem-only number shows the upper bound of raw parallelism; the scheduler matches it on distinct paths (the realistic 12-rclone case) and correctly falls back to serial when the workload demands ordering. Peak concurrent mutations at the handler equals client worker count on the distinct-path workload and pins to 1 on the same-path workload, as the scheduler intends. * perf(filer): decouple StreamMutateEntry admission from receive loop The previous StreamMutateEntry handler called sched.Admit directly in the Recv loop. A single request conflicting on path /hot would head-of-line block stream.Recv, so later requests targeting unrelated paths could not be received or admitted until /hot drained — cross-path parallelism then depended on request ordering instead of being a property of the scheduler. Spawn the worker goroutine immediately on Recv and move sched.Admit into that goroutine. A new streamMutatePendingLimit (1024) caps total per- stream outstanding goroutines (pending + active) so a client flooding a conflicted path cannot explode goroutine count without bound. Addresses #9171 review comment (coderabbitai, Major). * fix(filer): reply with EINVAL on unknown StreamMutateEntry request type Returning nil when req.Request is a future oneof variant or a malformed request left the client's per-RequestId waiter blocked forever, because no response was ever sent for that id. Reply with IsLast=true and EINVAL so the waiter completes with a well-formed error. Addresses #9171 review comments (gemini-code-assist, coderabbitai). * fix(filer): make classifyMutation crash-free and correct for deletes Two issues addressed together because they share one function: 1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without a nil guard; an empty create_request / update_request / rename_request from a misbehaving client crashed the scheduler. Guard each oneof variant and fall back to a "/" barrier; the handler then sends EINVAL via the unknown-request path. 2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry- Request does not carry IsDirectory, so the previous kindMutateFile classification for non-recursive deletes did not conflict with an in- flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same path — a race in scheduler terms. Classify every delete as kindMutateBarrierDir regardless of IsRecursive. The incremental cost of a descendant-wait for a non-recursive delete of a non-empty dir is negligible since that call fails at the store anyway. Adds classifyMutation tests for malformed create/update, empty oneof, and updates the delete-non-recursive case to the new expected kind. Addresses #9171 review comments (coderabbitai Critical, Major). * fix(filer): route renameStreamProxy.SendMsg through the wrapping Send The default pass-through SendMsg on renameStreamProxy bypassed the syncStream mutex and the StreamMutateEntryResponse wrapping: anything the rename helpers happened to push via SendMsg would have been emitted on the wire as the wrong protobuf type and could interleave with other workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry Recv loop and could steal unrelated mutation requests. Route SendMsg through the wrapping Send (rejecting other payload types) and fail RecvMsg explicitly — the rename logic is a strictly server-push stream and never calls RecvMsg, so loud failure is safer than silent stealing. Addresses #9171 review comment (coderabbitai, Major). * test(filer): run exactly ops in stream-mutate workloads perGoroutine := ops / concurrency silently truncated the total when the values were not divisible — e.g. 2400 ops with 64 workers actually ran 2368 and with 256 workers ran 2304, making the logged "ops per run" inaccurate and introducing measurement noise that varied across the concurrency sweep. Introduce opsForWorker(g, concurrency, ops) which distributes the remainder to the first (ops % concurrency) workers so the three workloads (unary, stream sync, stream async) each dispatch exactly `ops` operations. No changes to the timing methodology. Addresses #9171 review comment (coderabbitai, Minor). * fix(filer): enforce per-path FIFO admission in mutateScheduler sync.Cond.Broadcast wakes every waiter; the first to re-acquire the mutex wins, so two conflicting same-path admissions could be reordered by the Go runtime even though they arrived serially on the stream. A single stream is supposed to carry ordered mutations — the PR's original #8770 claim — so admission must be FIFO per path. Replace the single cond with a per-path FIFO queue. Each Admit enqueues a waiter on every path it touches (primary, and on rename the secondary too) and blocks on a ready channel. tryPromoteLocked admits any waiter that is at the head of every queue it joined, passes pathConflictsLocked against the active-state indexes, and is under concurrencyLimit. Done removes the heads and re-runs tryPromoteLocked so waiters freed by the completion move in arrival order. Side effect: two non-barrier directory updates on the same path now serialize instead of overlapping. filer.sync's MetadataProcessor intentionally allows them to overlap because its events come from a committed log where last-writer-wins coalescing is safe; streamed mutations carry client operations whose order matters, so we drop that optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier release) and TestAdmitSamePathNonBarrierSerializes to cover both. Also refreshed the kindMutateFile doc comment that still referenced the pre-#1ecf805f5 "non-recursive delete" classification. Addresses #9171 review comments (coderabbitai Critical, Minor). * test(filer): make TestAdmitSamePathFIFO deterministic without sleeps The previous arrival-ordering sync (send to `started` before calling Admit, plus a 1 ms sleep) relied on the goroutine actually entering Admit and reaching the per-path queue during that sleep. Under -race on a loaded CI that is a real flake source, which is ironic for a test whose job is catching non-deterministic wake-ups. Observe the scheduler's own pathQueue length between spawns instead — waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected number of waiters (1 barrier holder + i+1 file waiters) is enqueued. That's the exact event the test wants to synchronise on, so there is no fudge factor. Verified by `go test -race -count=5`. Addresses #9171 review comment (coderabbitai, Minor).	2026-04-21 11:25:09 -07:00
Chris Lu	141413ad76	fix(tests): make tests pass on 32-bit architectures (#9168 ) (#9170 ) Two separate failures reported on 32-bit builds (void-linux 4.21): - weed/server: errorStreamImpl.count (and the same pattern in slowStream plus local totalEventsSent/totalSends) was a bare int64 sitting after smaller fields, so on 386/ARMv7/mips32 it landed at a 4-byte-aligned offset and atomic.AddInt64 panicked with "unaligned 64-bit atomic operation". Switched the counters to atomic.Int64, which Go guarantees is 8-byte aligned on every architecture. - weed/plugin/worker/iceberg: three equality-delete tests fail on 32-bit because the upstream github.com/apache/iceberg-go declares manifestEntry.EqualityIDs as *[]int while the Iceberg Avro schema defines equality_ids as long, and hamba/avro refuses to map Go int onto Avro long when int is 32-bit. Not fixable in seaweedfs, so guard the affected tests with a t.Skip() when unsafe.Sizeof(int) < 8 until the upstream type is changed to []int32/[]int64.	2026-04-20 22:48:01 -07:00
Chris Lu	e24a443b17	peer chunk sharing 2/8: filer mount registry (#9131 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries).	2026-04-18 20:03:23 -07:00
Lisandro Pin	6bcacedda9	Export `master_disconnections` metrics on volume servers. (#9104 ) This allows to track connection issues and master failovers in real time via Prometheus metrics. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-17 15:15:26 -07:00
Chris Lu	00a2e22478	fix(mount): remove fid pool to stop master over-allocating volumes (#9111 ) * fix(mount): remove fid pool to stop master over-allocating volumes The writeback-cache fid pool pre-allocated file IDs with ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's PickForWrite charges count * expectedDataSize against the volume's effectiveSize, so a full pool refill could charge hundreds of MB against a single volume before any bytes were actually written. That tripped RecordAssign's hard-limit path and eagerly removed volumes from writable, causing the master to grow new volumes even when the real data being written was tiny. Drop the pool entirely. Every chunk upload goes through UploadWithRetry -> AssignVolume with no ExpectedDataSize hint, letting the master fall back to the 1 MB default estimate. The mount->filer grpc connection is already cached in pb.WithGrpcClient (non-streaming mode), so per-chunk AssignVolume is a unary RPC over an existing HTTP/2 stream, not a full dial. Path-based filer.conf storage rules now apply to mount chunk assigns again, which the pool had to skip. Also remove the now-unused operation.UploadWithAssignFunc and its AssignFunc type. * fix(upload): populate ExpectedDataSize from actual chunk bytes UploadWithRetry already buffers the full chunk into `data` before calling AssignVolume, so the real size is known. Previously the assign request went out with ExpectedDataSize=0, making the master fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same over-reservation symptom the pool had, just smaller per call. Stamp ExpectedDataSize = len(data) before the assign RPC when the caller hasn't already set it. This covers mount chunk uploads, filer_copy, filersink, mq/logstore, broker_write, gateway_upload, and nfs — all the UploadWithRetry paths. * fix(assign): pass real ExpectedDataSize at every assign call site After removing the mount fid pool, per-chunk AssignVolume calls went out with ExpectedDataSize=0, making the master fall back to its 1 MB DefaultNeedleSizeEstimate. That's still an over-estimate for small writes. Thread the real payload size through every remaining assign site so RecordAssign charges effectiveSize accurately and stops prematurely marking volumes full. - filer: assignNewFileInfo now takes expectedDataSize and stamps it on both primary and alternate VolumeAssignRequests. Callers pass: - SSE data-to-chunk: len(data) - copy manifest save: len(data) - streamCopyChunk: srcChunk.Size - TUS sub-chunk: bytes read - saveAsChunk (autochunk/manifestize): 0 (small, size unknown until the reader is drained; master uses 1 MB default) - filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize after the adaptive chunkSize is computed. - ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter; upload_chunked.go passes the buffered dataSize at the call site. S3 PUT assignFunc stamps it on the AssignVolumeRequest. - S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize; all seven call sites pass the source chunk's Size. - operation.SubmitFiles / FilePart.Upload: derive per-fid size from FileSize (average for batched requests, real per-chunk size for sequential chunk assigns). - benchmark: pass fileSize. - filer append-to-file: pass len(data). * fix(assign): thread size through SaveDataAsChunkFunctionType The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran AssignVolume before the reader was drained, so it had to pass ExpectedDataSize=0 and fall back to the master's 1 MB default. Add an expectedDataSize parameter to SaveDataAsChunkFunctionType. - mergeIntoManifest already has the serialized manifest bytes, so it passes uint64(len(data)) directly. - Mount's saveDataAsChunk ignores the parameter because it uses UploadWithRetry, which already stamps len(data) on the assign after reading the payload. - webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry path and also ignore the hint. - Filer's saveAsChunk (used for manifestize) plumbs the value to assignNewFileInfo so manifest-chunk assigns get a real size. Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked) pass the chunk size they're about to upload.	2026-04-16 15:51:13 -07:00
Chris Lu	979c54f693	fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089 ) * fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker.	2026-04-15 12:29:31 -07:00
Chris Lu	08d9193fe1	[nfs] Add NFS (#9067 ) * add filer inode foundation for nfs * nfs command skeleton * add filer inode index foundation for nfs * make nfs inode index hardlink aware * add nfs filehandle and inode lookup plumbing * add read-only nfs frontend foundation * add nfs namespace mutation support * add chunk-backed nfs write path * add nfs protocol integration tests * add stale handle nfs coverage * complete nfs hardlink and failover coverage * add nfs export access controls * add nfs metadata cache invalidation * fix nfs chunk read lookup routing * fix nfs review findings and rename regression * address pr 9067 review comments - filer_inode: fail fast if the snowflake sequencer cannot start, and let operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID to avoid multi-filer collisions - filer_inode: drop the redundant retry loop in nextInode - filerstore_wrapper: treat inode-index writes/removals as best-effort so a primary store success no longer surfaces as an operation failure - filer_grpc_server_rename: defer overwritten-target chunk deletion until after CommitTransaction so a rolled-back rename does not strand live metadata pointing at freshly deleted chunks - command/nfs: default ip.bind to loopback and require an explicit filer.path, so the experimental server does not expose the entire filer namespace on first run - nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire layout rather than RFC 1813 LINK3args * mount: pre-allocate inode in Mkdir and Symlink Mkdir and Symlink used to send filer_pb.CreateEntryRequest with Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns its own inode in that case, so the filer-side entry ends up with a different inode than the one the mount allocates via inodeToPath.Lookup and returns to the kernel. Once applyLocalMetadataEvent stores the filer's entry in the meta cache, subsequent GetAttr calls read the cached entry and hit the setAttrByPbEntry override at line 197 of weedfs_attr.go, returning the filer-assigned inode instead of the mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91) caught this — it lstat'd a freshly-created directory/symlink, renamed it, lstat'd again, and saw a different inode the second time. createRegularFile already pre-allocates via inodeToPath.AllocateInode and stamps it into the create request. Do the same thing in Mkdir and Symlink so both sides agree on the object identity from the very first request, and so GetAttr's cache path returns the same value as Mkdir / Symlink's initial response. * sequence: mask snowflake node id on int→uint32 conversion CodeQL flagged the unchecked uint32(snowflakeId) cast in NewSnowflakeSequencer as a potential truncation bug when snowflakeId is sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask to the 10 bits the snowflake library actually uses so any caller- supplied int is safely clamped into range. * add test/nfs integration suite Boots a real SeaweedFS cluster (master + volume + filer) plus the experimental `weed nfs` frontend as subprocesses and drives it through the NFSv3 wire protocol via go-nfs-client, mirroring the layout of test/sftp. The tests run without a kernel NFS mount, privileged ports, or any platform-specific tooling. Coverage includes read/write round-trip, mkdir/rmdir, nested directories, rename content preservation, overwrite + explicit truncate, 3 MiB binary file, all-byte binary and empty files, symlink round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity, sequential appends, and readdir-after-remove. Framework notes: - Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes -port.grpc explicitly so the default port+10000 convention cannot overflow uint16 on macOS. - Pre-creates the /nfs_export directory via the filer HTTP API before starting the NFS server — the NFS server's ensureIndexedEntry check requires the export root to exist with a real entry, which filer.Root does not satisfy when the export path is "/". - Reuses the same rpc.Client for mount and target so go-nfs-client does not try to re-dial via portmapper (which concatenates ":111" onto the address). * ci: add NFS integration test workflow Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch the NFS server, the inode filer plumbing it depends on, or the test harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu 22.04 via `make test`. * nfs: use append for buffer growth in Write and Truncate The previous make+copy pattern reallocated the full buffer on every extending write or truncate, giving O(N^2) behaviour for sequential write loops. Switching to `append(f.content, make([]byte, delta)...)` lets Go's amortized growth strategy absorb the repeated extensions. Called out by gemini-code-assist on PR 9067. * filer: honor caller cancellation in collectInodeIndexEntries Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of the inode-index scan if the client disconnects mid-walk. The cleanup is already treated as best-effort by the caller (it logs on error and continues), so a cancelled walk just means the partial index rebuild is skipped — the same failure mode as any other index write error. Flagged as a DoS concern by gemini-code-assist on PR 9067. * nfs: skip filer read on open when O_TRUNC is set openFile used to unconditionally loadWritableContent for every writable open and then discard the buffer if O_TRUNC was set. For large files that is a pointless 64 MiB round-trip. Reorder the branches so we only fetch existing content when the caller intends to keep it, and mark the file dirty right away so the subsequent Close still issues the truncating write. Called out by gemini-code-assist on PR 9067. * nfs: allow Seek on O_APPEND files and document buffered write cap Two related cleanups on filesystem.go: - POSIX only restricts Write on an O_APPEND fd, not lseek. The existing Seek error ("append-only file descriptors may only seek to EOF") prevented read-and-write workloads that legitimately reposition the read cursor. Write already snaps the offset to EOF before persisting (see seaweedFile Write), so Seek can unconditionally accept any offset. Update the unit test that was asserting the old behaviour. - Add a doc comment on maxBufferedWriteSize explaining that it is a per-file ceiling, the memory footprint it implies, and that the real fix for larger whole-file rewrites is streaming / multi-chunk support. Both changes flagged by gemini-code-assist on PR 9067. * nfs: guard offset before casting to int in Write CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as a potential overflow on architectures where `int` is 32-bit. The existing check only bounded the post-cast value, which is too late. Clamp f.offset against maxBufferedWriteSize before the cast and also reject negative/overflowed endOffset results. Both branches fall through to billy.ErrNotSupported, the same behaviour the caller gets today for any out-of-range buffered write. * nfs: compute Write endOffset in int64 to satisfy CodeQL The previous guard bounded f.offset but left len(p) unchecked, so CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width overflow path. Bound len(p) against maxBufferedWriteSize first, do the addition in int64, and only cast down after the total has been clamped against the buffer ceiling. Behaviour is unchanged: any out-of-range write still returns billy.ErrNotSupported. * ci: drop emojis from nfs-tests workflow summary Plain-text step summary per user preference — no decorative glyphs in the NFS CI output or checklist. * nfs: annotate remaining DEV_PLAN TODOs with status Three of the unchecked items are genuine follow-up PRs rather than missing work in this one, and one was actually already done: - Reuse chunk cache and mutation stream helpers without FUSE deps: checked off — the NFS server imports weed/filer.ReaderCache and weed/util/chunk_cache directly with no weed/mount or go-fuse imports. - Extract shared read/write helpers from mount/WebDAV/SFTP: annotated as deferred to a separate refactor PR (touches four packages). - Expand direct data-path writes beyond the 64 MiB buffered fallback: annotated as deferred — requires a streaming WRITE path. - Shared lock state + lock tests: annotated as blocked upstream on go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing "Current Blockers" note. * test/nfs: share port+readiness helpers with test/testutil Drop the per-suite mustPickFreePort and waitForService re-implementations in favor of testutil.MustAllocatePorts (atomic batch allocation; no close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout. Pull testutil in via a local replace directive so this standalone seaweedfs-nfs-tests module can import the in-repo package without a separate release. Subprocess startup is still master + volume + filer + nfs — no switch to weed mini yet, since mini does not know about the nfs frontend. * nfs: stream writes to volume servers instead of buffering the whole file Before this change the NFS write path held the full contents of every writable open in memory: - OpenFile(write) called loadWritableContent which read the existing file into seaweedFile.content up to maxBufferedWriteSize (64 MiB) - each Write() extended content in-place - Close() uploaded the whole buffer as a single chunk via persistContent + AssignVolume The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and even below the cap every Write paid a whole-file-in-memory cost. This PR rewrites the write path to match how `weed filer` and the S3 gateway persist data: - openFile(write) no longer loads the existing content at all; it only issues an UpdateEntry when O_TRUNC is set and the file is non-empty (so a fresh create+trunc is still zero-RPC) - Write() streams the caller's bytes straight to a volume server via one AssignVolume + one chunk upload, then atomically appends the resulting chunk to the filer entry through mutateEntry. Any previously inlined entry.Content is migrated to a chunk in the same update so the chunk list becomes the authoritative representation. - Truncate() becomes a direct mutateEntry (drop chunks past the new size, clip inline content, update FileSize) instead of resizing an in-memory buffer. - Close() is a no-op because everything was flushed inline. The small-file fast path that the filer HTTP handler uses is preserved: if the post-write size still fits in maxInlineWriteSize (4 MiB) and the file has no existing chunks, we rewrite entry.Content directly and skip the volume-server round-trip. This keeps single-shot tiny writes (echo, small edits) cheap while completely removing the 64 MiB cap on larger files. Read() now always reads through the chunk reader instead of a local byte slice, so reads inside the same session see the freshly appended data. Drops the unused seaweedFile.content / dirty fields, the maxBufferedWriteSize constant, and the loadWritableContent helper. Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations to match the new "no extra O_TRUNC UpdateEntry on an empty file" behavior (still 3 updates: Write + Chmod + Truncate). * filer: extract shared gateway upload helper for NFS and WebDAV Three filer-backed gateways (NFS, WebDAV, and mount) each had a local saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry with near-identical bodies: build AssignVolumeRequest, build UploadOption, build genFileUrlFn with optional filerProxy rewriting, call UploadWithRetry, validate the result, and call ToPbFileChunk. Pull that body into filer.SaveGatewayDataAsChunk with a GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate to one implementation. - NFS's saveDataAsChunk is now a thin adapter that assembles the GatewayChunkUploadRequest from server options and calls the helper. The chunkUploader interface keeps working for test injection because the new GatewayChunkUploader interface is structurally identical. - WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the local operation.NewUploader call plus the AssignVolume/UploadOption scaffolding. - mount is intentionally left alone. mount's saveDataAsChunk has two features that do not fit the shared helper (a pre-allocated file-id pool used to skip AssignVolume entirely, and a chunkCache write-through at offset 0 so future reads hit the mount's local cache), both of which are mount-specific. Marks the Phase 2 "extract shared read/write helpers from mount, WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals + NewChunkReaderAtFromClient) was already shared. * nfs: remove DESIGN.md and DEV_PLAN.md The planning documents have served their purpose — all phase 1 and phase 2 items are landed, phase 3 streaming writes are landed, phase 2 shared helpers are extracted, and the two remaining phase 4 items (shared lock state + lock tests) are blocked upstream on github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state RPCs. The running decision log no longer reflects current code and would just drift. The NFS wiki page (https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries the overview, configuration surface, architecture notes, and known limitations; the source is the source of truth for the rest.	2026-04-14 20:48:24 -07:00
Lisandro Pin	67a2810d2d	Export `start_time_seconds` metrics on both master & volume servers. (#9046 ) These are to be used to track uptimes. See https://github.com/seaweedfs/seaweedfs/issues/8535 for details. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-13 09:34:08 -07:00
Chris Lu	edf7d2a074	fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039 ) * fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was newer than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely.	2026-04-11 23:12:54 -07:00
os-pradipbabar	9cae95d749	fix(filer): prevent data corruption during graceful shutdown (#9037 ) * fix: wait for in-flight uploads to complete before filer shutdown Prevents data corruption when SIGTERM is received during active uploads. The filer now waits for all in-flight operations to complete before calling the underlying shutdown logic. This affects all deployment types (Kubernetes, Docker, systemd) and fixes corruption issues during rolling updates, certificate rotation, and manual restarts. Changes: - Add FilerServer.Shutdown() method with upload wait logic - Update grace.OnInterrupt hook to use new shutdown method Fixes data corruption reported by production users during pod restarts. * fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete * fix: address review comments on graceful shutdown - Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking from long-lived streams (falls back to Stop on timeout) - Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within Kubernetes default 30s termination grace period - Move fs.Shutdown() (database close) after Serve() returns instead of a separate hook to eliminate race where main goroutine exits before the shutdown hook runs * fix: shut down all HTTP servers before filer database close Address remaining review comments: - Shut down auxiliary HTTP servers (Unix socket, local listener) during graceful shutdown so they can't serve write traffic after the main server stops - Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it completes before os.Exit(0), fixing the race between the grace goroutine and the main goroutine - Use sync.Once to ensure fs.Shutdown() runs exactly once regardless of whether shutdown is signal-driven or context-driven (MiniCluster) --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-11 21:18:22 -07:00
Chris Lu	b37bbf541a	feat(master): drain pending size before marking volume readonly (#9036 ) * feat(master): drain pending size before marking volume readonly When vacuum, volume move, or EC encoding marks a volume readonly, in-flight assigned bytes may still be pending. This adds a drain step: immediately remove from writable list (stop new assigns), then wait for pending to decay below 4MB or 30s timeout. - Add volumeSizeTracking struct consolidating effectiveSize, reportedSize, and compactRevision into a single map - Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable, DrainAndSetVolumeReadOnly to VolumeLayout - UpdateVolumeSize detects compaction via compactRevision change and resets effectiveSize instead of decaying - Wire drain into vacuum (topology_vacuum.go) and volume mark readonly (master_grpc_server_volume.go) * fix: use 2MB pending size drain threshold * fix: check crowded state on initial UpdateVolumeSize registration * fix: respect context cancellation in drain, relax test timing - DrainAndSetVolumeReadOnly now accepts context.Context and returns early on cancellation (for gRPC handler timeout/cancel) - waitForPendingDrain uses select on ctx.Done instead of time.Sleep - Increase concurrent heartbeat test timeout from 10s to 15s for CI * fix: use time-based dedup so decay runs even when reported size is unchanged The value-based dedup (same reportedSize + compactRevision = skip) prevented decay from running when pending bytes existed but no writes had landed on disk yet. The reported size stayed the same across heartbeats, so the excess never decayed. Fix: dedup replicas within the same heartbeat cycle using a 2-second time window instead of comparing values. This allows decay to run once per heartbeat cycle even when the reported size is unchanged. Also confirmed finding 1 (draining re-add race) is a false positive: - Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes - Move/EC: readonlyVolumes flag prevents re-adding during drain * fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up to 30s waiting for pending bytes to decay. In integration tests (and real clusters during EC encoding), this caused timeouts because multiple volumes are marked readonly sequentially and heartbeats may not arrive fast enough to decay pending within the drain window. Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops new assigns) and only logs a warning if pending bytes remain. The drain wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs inside the master's own goroutine pool. Remove DrainAndSetVolumeReadOnly as it's no longer used. * fix: relax test timing, rename test, add post-condition assert * test: add vacuum integration tests with CI workflow Full-cluster integration test for vacuum, modeled on the EC integration tests. Starts a real master + 2 volume servers, uploads data, deletes entries to create garbage, runs volume.vacuum via shell command, and verifies garbage cleanup and data integrity. Test flow: 1. Start cluster (master + 2 volume servers) 2. Upload 10 files to create volume with data 3. Delete 5 files to create ~50% garbage 4. Verify garbage ratio > 10% 5. Run volume.vacuum command 6. Verify garbage cleaned up 7. Verify remaining 5 files are still accessible CI workflow runs on push/PR to master with 15-minute timeout. Log collection on failure via artifact upload. * fix: use 500KB files and delete 75% to exceed vacuum garbage threshold * fix: add shell lock before vacuum command, fix compilation error * fix: strengthen vacuum integration test assertions - waitForServer: use net.DialTimeout instead of grpc.NewClient for real TCP readiness check - verify_garbage_before_vacuum: t.Fatal instead of warning when no garbage detected - verify_cleanup_after_vacuum: t.Fatal if no server reported the volume or cleanup wasn't verified - verify_remaining_data: read actual file contents via HTTP and compare byte-for-byte against original uploaded payloads * fix: use http.Client with timeout and close body before retry	2026-04-11 18:29:11 -07:00
Chris Lu	10b0bdce02	feat: pass expected_data_size from clients for size-aware assignment (#9032 ) * feat: pass expected_data_size from clients for size-aware assignment Add expected_data_size field to AssignRequest (master proto) and AssignVolumeRequest (filer proto) so clients can hint how large the data will be. The master uses this instead of the 1MB default when tracking pending volume sizes for weighted assignment. - Add expected_data_size to master.proto AssignRequest - Add expected_data_size to filer.proto AssignVolumeRequest - Wire through filer AssignVolume handler - Wire through HTTP submit handler (uses actual upload size) - Add ExpectedDataSize to VolumeAssignRequest in operation package - Topology.PickForWrite accepts optional expectedDataSize parameter * fix: guard integer conversions in expected_data_size path - common.go: clamp OriginalDataSize to non-negative before uint64 cast - topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast * fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize - HTTP /dir/assign now parses optional "dataSize" query parameter and passes it to PickForWrite instead of hardcoded 0 - Add test assertion for PickForWrite with non-zero expectedDataSize	2026-04-11 11:30:47 -07:00
Chris Lu	e648c76bcf	go fmt	2026-04-10 17:31:14 -07:00
Chris Lu	6f036c7015	fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock (#8998 ) * fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock When fastResume is active (single-master + resumeState + non-empty log), the raft server becomes leader within ~1ms. DoJoinCommand then enters the leaderLoop's processCommand path, which calls setCommitIndex to commit all pending entries. The goraft setCommitIndex implementation returns early when it encounters a JoinCommand entry (to recalculate quorum), which can prevent the new entry's event channel from being notified — leaving DoJoinCommand blocked forever. Each restart appends a new raft:join entry to the log, while the conf file's commitIndex (only persisted on AddPeer) lags behind. After 3-4 restarts the uncommitted range contains old JoinCommand entries that trigger the early return before the new entry is reached. Fix: skip DoJoinCommand when the raft log already has entries (the server was already joined in a previous run). The fastResume mechanism handles leader election independently. * fix(master): handle Hashicorp Raft in HasExistingState Add Hashicorp Raft support to HasExistingState by checking AppliedIndex, consistent with how other RaftServer methods handle both raft implementations. * fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft AppliedIndex() reflects in-memory FSM state which starts at 0 before log replay completes. LastIndex() reads from persisted stable storage, correctly mirroring the non-Hashicorp IsLogEmpty() check.	2026-04-08 21:08:50 -07:00
Lars Lehtonen	8edadf7f4a	chore(weed/server): prune unused unexported struct fields (#8980 )	2026-04-07 21:24:30 -07:00
Chris Lu	940eed0bd3	fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972 ) * fix(ec): generate .ecx before EC shards to prevent data inconsistency In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER the EC shards were generated from .dat. If any write occurred between these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses the read-only check), the .ecx would contain entries pointing to data that doesn't exist in the EC shards, causing "shard too short" and "size mismatch" errors on subsequent reads and scrubs. Fix by generating .ecx FIRST, then snapshotting datFileSize, then encoding EC shards. If a write sneaks in after .ecx generation, the EC shards contain more data than .ecx references — which is harmless (the extra data is simply not indexed). Also snapshot datFileSize before EC encoding to ensure the .vif reflects the same .dat state that .ecx was generated from. Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the race condition by appending data between EC encoding and .ecx generation. * fix: pass actual offset to ReadBytes, improve test quality - Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve correct error metrics and error messages within ReadBytes - Handle Stat() error in assembleFromIntervalsAllowError - Rename TestEcConsistency_DatFileGrowsDuringEncoding to TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size encoding, not concurrent growth) - Update test comment to clarify it reproduces the old buggy sequence - Fix verification loop to advance by readSize for full data coverage * fix(ec): add dat/idx consistency check in worker EC encoding The erasure_coding worker copies .dat and .idx as separate network transfers. If a write lands on the source between these copies, the .idx may have entries pointing past the end of .dat, leading to EC volumes with .ecx entries that reference non-existent shard data. Add verifyDatIdxConsistency() that walks the .idx and verifies no entry's offset+size exceeds the .dat file size. This fails the EC task early with a clear error instead of silently producing corrupt EC volumes. * test(ec): add integration test verifying .ecx/.ecd consistency TestEcIndexConsistencyAfterEncode uploads multiple needles of varying sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then reads every needle back via the EC read path and verifies payload correctness. This catches any inconsistency between .ecx index entries and EC shard data. * fix(test): account for needle overhead in test volume fixture WriteTestVolumeFiles created a .dat of exactly datSize bytes but the .idx entry claimed a needle of that same size. GetActualSize adds header + checksum + timestamp overhead, so the consistency check correctly rejects this as the needle extends past the .dat file. Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry is consistent with the .dat contents. * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 19:05:36 -07:00
Chris Lu	a4753b6a3b	S3: delay empty folder cleanup to prevent Spark write failures (#8970 ) * S3: delay empty folder cleanup to prevent Spark write failures (#8963) Empty folders were being cleaned up within seconds, causing Apache Spark (s3a) writes to fail when temporary directories like _temporary/0/task_xxx/ were briefly empty. - Increase default cleanup delay from 5s to 2 minutes - Only process queue items that have individually aged past the delay (previously the entire queue was drained once any item triggered) - Make the delay configurable via filer.toml: [filer.options] s3.empty_folder_cleanup_delay = "2m" * test: increase cleanup wait timeout to match 2m delay The empty folder cleanup delay was increased to 2 minutes, so the Spark integration test needs to wait longer for temporary directories to disappear. * fix: eagerly clean parent directories after empty folder deletion After deleting an empty folder, immediately try to clean its parent rather than relying on cascading metadata events that each re-enter the 2-minute delay queue. This prevents multi-minute waits when cleaning nested temporary directory trees (e.g. Spark's _temporary hierarchy with 3+ levels would take 6m+ vs near-instant). Fixes the CI failure where lingering _temporary parent directories were not cleaned within the test's 3-minute timeout.	2026-04-07 13:20:59 -07:00
Chris Lu	4efe0acaf5	fix(master): fast resume state and default resumeState to true (#8925 ) * fix(master): fast resume state and default resumeState to true When resumeState is enabled in single-master mode, the raft server had existing log entries so the self-join path couldn't promote to leader. The server waited the full election timeout (10-20s) before self-electing. Fix by temporarily setting election timeout to 1ms before Start() when in single-master + resumeState mode with existing log, then restoring the original timeout after leader election. This makes resume near-instant. Also change the default for resumeState from false to true across all CLI commands (master, mini, server) so state is preserved by default. * fix(master): prevent fastResume goroutine from hanging forever Use defer to guarantee election timeout is always restored, and bound the polling loop with a timeout so it cannot spin indefinitely if leader election never succeeds. * fix(master): use ticker instead of time.After in fastResume polling loop	2026-04-04 14:15:56 -07:00
Chris Lu	896114d330	fix(admin): fix master leader link showing incorrect port in Admin UI (#8924 ) fix(admin): use gRPC address for current server in RaftListClusterServers The old Raft implementation was returning the HTTP address (ms.option.Master) for the current server, while peers used gRPC addresses (peer.ConnectionString). The Admin UI's GetClusterMasters() converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress (port - 10000), which produced a negative port (-667) for the current server since its address was already in HTTP format (port 9333). Use ToGrpcAddress() for consistency with both HashicorpRaft (which stores gRPC addresses) and old Raft peers. Fixes #8921	2026-04-04 11:50:43 -07:00

1 2 3 4 5 ...

1882 Commits