seaweedfs

mirror of https://github.com/seaweedfs/seaweedfs.git synced 2026-05-23 18:21:28 +00:00

Author	SHA1	Message	Date
Lisandro Pin	93247d6de4	Export REST file_{read,write}_failures metrics on volume servers (#9215 ) * Export gRPC `file_{read,write}_failures` metrics on volume servers. Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. * Export REST `file_{read,write}_failures` metrics on volume servers.	2026-04-24 11:45:21 -07:00
Chris Lu	3d39324bc1	fix(nfs): make Linux `mount -t nfs` work without client workaround (#9199 ) (#9201 ) * fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199) The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and does not register with portmap. Linux mount.nfs queries portmap on port 111 first, so the plain `mount -t nfs host:/export /mnt` form failed with "portmap query failed" / "requested NFS version or transport protocol is not supported" against a default `weed nfs` deployment. - Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses. Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port. - New CLI flag `-portmap.bind` (empty, disabled by default) to opt into the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE and must not collide with a system rpcbind. - Extended `weed nfs -h` help with the two supported ways to mount from Linux (client-side portmap bypass, or server-side `-portmap.bind`). - Startup log now prints a copy-pasteable mount command tailored to whether portmap is enabled. Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP round-trip against the real listener. Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0` the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock host:/export /mnt`) now succeeds and both read and write work. * fix(nfs): harden portmap responder per review feedback (#9201) Addresses three review findings on the portmap responder: - parseRPCCall: validate opaque_auth length against the record limit before applying the XDR 4-byte padding, so a near-uint32-max authLen can no longer overflow (authLen + 3) and bypass the bounds check. (gemini-code-assist) - serveTCP/Close: track live TCP connections and evict them on Close() so shutdown does not block on idle clients waiting for the read deadline to trip. serveTCP also no longer tears the listener down on a non-fatal Accept error (e.g. EMFILE); it logs and retries after a small back-off. Replaces the atomic.Bool closed flag with a mutex-guarded one so closed, conns, and the shutdown transition stay consistent. (coderabbit, minor) - handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s in-flight) so a peer that opens the privileged port 111 and stalls cannot pin a goroutine indefinitely. (coderabbit, major) Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP connection idle and asserts Close() returns within 2s (well under the 30s idle deadline) and that the client sees the eviction. All existing tests still pass, including under -race. * fix(nfs): keep portmap UDP responder alive on transient read errors (#9201) - serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and continue instead of returning. Matches how serveTCP now treats non-fatal Accept errors so a transient network blip doesn't take UDP portmap down until restart. (coderabbit) - Rename portmapAcceptBackoff -> portmapRetryBackoff now that both paths use it. - pmapProcDump: fix the pre-allocation capacity to match the actual encoding (20 bytes per entry + 4-byte terminator), replacing the old over-estimate of 24 per entry. No behavior change; just documents intent. (coderabbit nit) * docs(nfs): clarify encodeAcceptedReply body semantics (#9201) The prior comment said body is "nil when the accept_stat is itself an error", which was misleading: the PROG_MISMATCH branch already passes an 8-byte mismatch_info body. Rewrite to enumerate which error accept_stat values omit the body and call out PROG_MISMATCH as the exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit) * fix(nfs): make portmap retry backoff interruptible by Close() (#9201) serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a non-fatal listener error. If Close() races in during that sleep, the goroutine can't be interrupted, so Close() has to wait out the remaining backoff before wg.Wait() returns. Add a done channel that Close() closes once, and replace both time.Sleep calls with a select on ps.done + time.After. The window was tiny in practice but the select makes shutdown strictly bounded by Close()'s own work. (coderabbit nit)	2026-04-23 13:53:53 -07:00
steve.wei	1a7ab2ea82	fix(upload): keep Content-MD5 on 204 unchanged writes (#9198 ) Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved.	2026-04-23 10:59:59 -07:00
Chris Lu	592d6d6021	fix(filer/remote): keep re-cache work alive past caller cancellation (#9174 ) (#9193 ) * fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot finish before the S3 gateway's initial cache wait elapses. When it does, the gRPC ctx cancellation cascades into the filer's chunk downloads, the error path calls DeleteUncommittedChunks on every chunk already written, and the next retry starts over. boto3 splitting the GET into concurrent ranges (or any client tear-down on first failure) shortens the window between retries, so the loop never converges. Detach the caller's ctx with context.WithoutCancel before invoking the singleflight work so the download runs to completion regardless of client cancellations. Subsequent waiters — via the in-flight singleflight, or a fresh retry landing after completion — observe the cached entry and stream normally. Same detach pattern is used in filer_server_handlers_write.go:53 and volume_server_handlers_write.go:51. * simplify rationale comment * switch to DoChan so handler can return on caller cancel Do keeps the handler goroutine blocked for the full detached download even after the client is gone. DoChan lets the handler select on ctx.Done() and exit immediately; the singleflight goroutine continues on bgCtx and the next request either joins it or finds the entry cached.	2026-04-22 17:56:15 -07:00
Chris Lu	f438cc3544	fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184 ) (#9186 ) * test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers.	2026-04-22 16:47:01 -07:00
Lisandro Pin	fff243d463	Export gRPC `file_{read,write}_failures` metrics on volume servers. (#9177 ) Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-22 11:22:21 -07:00
Chris Lu	c4e1885053	fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184 ) (#9185 ) * test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested.	2026-04-22 10:30:13 -07:00
Chris Lu	7f67995c24	chore(filer): remove -mount.p2p flag; registry is always on (#9183 ) The filer-side mount peer registry (tier 1 of peer chunk sharing) was gated behind -mount.p2p (default true). Idle cost is negligible — a tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the surface area. Removes the flag from weed filer, weed server (-filer.mount.p2p), and weed mini, and always constructs the registry in NewFilerServer. Also drops the now-dead nil guards in MountRegister/MountList/sweeper and the TestMountRegister_DisabledIsNoOp case.	2026-04-21 23:00:11 -07:00
Chris Lu	c40db5a52d	perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171 ) * perf(filer): parallelize StreamMutateEntry with path-keyed scheduler The server handler processed one mutation at a time per stream, capping a mount's aggregate throughput at ~1/filer_store_service_time regardless of client concurrency (see issue #9138). With 12 rclone processes this showed as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via unary CreateEntry. Replace the serial for-loop with a per-request goroutine admitted by a path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor (weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier and descendant-barrier rules. Cross-path mutations run in parallel; same- path mutations serialize on arrival order; recursive delete and directory rename act as subtree barriers; directory attribute bumps stay non-barrier so they do not serialize file writes under them. Correctness and safety: - Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource use from a single noisy mount. - syncStream wrapper serializes stream.Send across worker goroutines (gRPC Send is not concurrent-safe). - Handler waits on in-flight workers before returning on recv EOF/error so no worker writes to a torn-down stream. - First fatal Send error from any worker propagates as the handler's return, causing the stream to tear down. Benchmark (2 ms simulated filer-store service delay, 12 client workers): serial : 440 QPS sem only : 4902 QPS (unsafe — reorders same-path ops) scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct) The sem-only number shows the upper bound of raw parallelism; the scheduler matches it on distinct paths (the realistic 12-rclone case) and correctly falls back to serial when the workload demands ordering. Peak concurrent mutations at the handler equals client worker count on the distinct-path workload and pins to 1 on the same-path workload, as the scheduler intends. * perf(filer): decouple StreamMutateEntry admission from receive loop The previous StreamMutateEntry handler called sched.Admit directly in the Recv loop. A single request conflicting on path /hot would head-of-line block stream.Recv, so later requests targeting unrelated paths could not be received or admitted until /hot drained — cross-path parallelism then depended on request ordering instead of being a property of the scheduler. Spawn the worker goroutine immediately on Recv and move sched.Admit into that goroutine. A new streamMutatePendingLimit (1024) caps total per- stream outstanding goroutines (pending + active) so a client flooding a conflicted path cannot explode goroutine count without bound. Addresses #9171 review comment (coderabbitai, Major). * fix(filer): reply with EINVAL on unknown StreamMutateEntry request type Returning nil when req.Request is a future oneof variant or a malformed request left the client's per-RequestId waiter blocked forever, because no response was ever sent for that id. Reply with IsLast=true and EINVAL so the waiter completes with a well-formed error. Addresses #9171 review comments (gemini-code-assist, coderabbitai). * fix(filer): make classifyMutation crash-free and correct for deletes Two issues addressed together because they share one function: 1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without a nil guard; an empty create_request / update_request / rename_request from a misbehaving client crashed the scheduler. Guard each oneof variant and fall back to a "/" barrier; the handler then sends EINVAL via the unknown-request path. 2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry- Request does not carry IsDirectory, so the previous kindMutateFile classification for non-recursive deletes did not conflict with an in- flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same path — a race in scheduler terms. Classify every delete as kindMutateBarrierDir regardless of IsRecursive. The incremental cost of a descendant-wait for a non-recursive delete of a non-empty dir is negligible since that call fails at the store anyway. Adds classifyMutation tests for malformed create/update, empty oneof, and updates the delete-non-recursive case to the new expected kind. Addresses #9171 review comments (coderabbitai Critical, Major). * fix(filer): route renameStreamProxy.SendMsg through the wrapping Send The default pass-through SendMsg on renameStreamProxy bypassed the syncStream mutex and the StreamMutateEntryResponse wrapping: anything the rename helpers happened to push via SendMsg would have been emitted on the wire as the wrong protobuf type and could interleave with other workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry Recv loop and could steal unrelated mutation requests. Route SendMsg through the wrapping Send (rejecting other payload types) and fail RecvMsg explicitly — the rename logic is a strictly server-push stream and never calls RecvMsg, so loud failure is safer than silent stealing. Addresses #9171 review comment (coderabbitai, Major). * test(filer): run exactly ops in stream-mutate workloads perGoroutine := ops / concurrency silently truncated the total when the values were not divisible — e.g. 2400 ops with 64 workers actually ran 2368 and with 256 workers ran 2304, making the logged "ops per run" inaccurate and introducing measurement noise that varied across the concurrency sweep. Introduce opsForWorker(g, concurrency, ops) which distributes the remainder to the first (ops % concurrency) workers so the three workloads (unary, stream sync, stream async) each dispatch exactly `ops` operations. No changes to the timing methodology. Addresses #9171 review comment (coderabbitai, Minor). * fix(filer): enforce per-path FIFO admission in mutateScheduler sync.Cond.Broadcast wakes every waiter; the first to re-acquire the mutex wins, so two conflicting same-path admissions could be reordered by the Go runtime even though they arrived serially on the stream. A single stream is supposed to carry ordered mutations — the PR's original #8770 claim — so admission must be FIFO per path. Replace the single cond with a per-path FIFO queue. Each Admit enqueues a waiter on every path it touches (primary, and on rename the secondary too) and blocks on a ready channel. tryPromoteLocked admits any waiter that is at the head of every queue it joined, passes pathConflictsLocked against the active-state indexes, and is under concurrencyLimit. Done removes the heads and re-runs tryPromoteLocked so waiters freed by the completion move in arrival order. Side effect: two non-barrier directory updates on the same path now serialize instead of overlapping. filer.sync's MetadataProcessor intentionally allows them to overlap because its events come from a committed log where last-writer-wins coalescing is safe; streamed mutations carry client operations whose order matters, so we drop that optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier release) and TestAdmitSamePathNonBarrierSerializes to cover both. Also refreshed the kindMutateFile doc comment that still referenced the pre-#1ecf805f5 "non-recursive delete" classification. Addresses #9171 review comments (coderabbitai Critical, Minor). * test(filer): make TestAdmitSamePathFIFO deterministic without sleeps The previous arrival-ordering sync (send to `started` before calling Admit, plus a 1 ms sleep) relied on the goroutine actually entering Admit and reaching the per-path queue during that sleep. Under -race on a loaded CI that is a real flake source, which is ironic for a test whose job is catching non-deterministic wake-ups. Observe the scheduler's own pathQueue length between spawns instead — waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected number of waiters (1 barrier holder + i+1 file waiters) is enqueued. That's the exact event the test wants to synchronise on, so there is no fudge factor. Verified by `go test -race -count=5`. Addresses #9171 review comment (coderabbitai, Minor).	2026-04-21 11:25:09 -07:00
Chris Lu	141413ad76	fix(tests): make tests pass on 32-bit architectures (#9168 ) (#9170 ) Two separate failures reported on 32-bit builds (void-linux 4.21): - weed/server: errorStreamImpl.count (and the same pattern in slowStream plus local totalEventsSent/totalSends) was a bare int64 sitting after smaller fields, so on 386/ARMv7/mips32 it landed at a 4-byte-aligned offset and atomic.AddInt64 panicked with "unaligned 64-bit atomic operation". Switched the counters to atomic.Int64, which Go guarantees is 8-byte aligned on every architecture. - weed/plugin/worker/iceberg: three equality-delete tests fail on 32-bit because the upstream github.com/apache/iceberg-go declares manifestEntry.EqualityIDs as *[]int while the Iceberg Avro schema defines equality_ids as long, and hamba/avro refuses to map Go int onto Avro long when int is 32-bit. Not fixable in seaweedfs, so guard the affected tests with a t.Skip() when unsafe.Sizeof(int) < 8 until the upstream type is changed to []int32/[]int64.	2026-04-20 22:48:01 -07:00
Chris Lu	e24a443b17	peer chunk sharing 2/8: filer mount registry (#9131 ) * proto: define MountRegister/MountList and MountPeer service Adds the wire types for peer chunk sharing between weed mount clients: * filer.proto: MountRegister / MountList RPCs so each mount can heartbeat its peer-serve address into a filer-hosted registry, and refresh the list of peers. Tiny payload; the filer stores only O(fleet_size) state. * mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the mount-to-mount chunk directory. Each fid's directory entry lives on an HRW-assigned mount; announces and lookups route to that mount. No behavior yet — later PRs wire the RPCs into the filer and mount. See design-weed-mount-peer-chunk-sharing.md for the full design. * filer: add mount-server registry behind -peer.registry.enable Implements tier 1 of the peer chunk sharing design: an in-memory registry of live weed mount servers, keyed by peer address, refreshed by MountRegister heartbeats and served by MountList. * weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy sweep on List plus a background sweeper goroutine for bounded memory. * weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC handlers. When -peer.registry.enable is false (the default), both RPCs are silent no-ops so probing older filers is harmless. * -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled wires it through. Phase 1 is single-filer (no cross-filer replication of the registry); mounts that fail over to another filer will re-register on the next heartbeat, so the registry self-heals within one TTL cycle. Part of the peer-chunk-sharing design; no behavior change at runtime until a later PR enables the flag on both filer and mount. * filer: nil-safe peerRegistryEnable + registry hardening Addresses review feedback on PR #9131. * Fix: nil pointer deref in the mini cluster. FilerOptions instances constructed outside weed/command/filer.go (e.g. miniFilerOptions in mini.go) do not populate peerRegistryEnable, so dereferencing the pointer panics at Filer startup. Use the same `nil && deref` idiom already used for distributedLock / writebackCache. * Hardening (gemini review): registry now enforces three invariants: - empty peer_addr is silently rejected (no client-controlled sentinel mass-inserts) - TTL is capped at 1 hour so a runaway client cannot pin entries - new-entry count is capped at 10000 to bound memory; renewals of existing entries are always honored, so a full registry still heartbeats its existing members correctly Covered by new unit tests. * filer: rename -peer.registry.enable flag to -mount.p2p Per review feedback: the old name "peer.registry.enable" leaked the implementation ("registry") into the CLI surface. "mount.p2p" is shorter and describes what it actually controls — whether this filer participates in mount-to-mount peer chunk sharing. Flag renames (all three keep default=true, idle cost is near-zero): -peer.registry.enable -> -mount.p2p (weed filer) -filer.peer.registry.enable -> -filer.mount.p2p (weed mini, weed server) Internal variable names (mountPeerRegistryEnable, MountPeerRegistry) keep their longer form — they describe the component, not the knob. * filer: MountList returns DataCenter + List uses RLock Two review follow-ups on the mount peer registry: * weed/server/filer_grpc_server_mount_peer.go: MountList was dropping the DataCenter on the wire. The whole point of carrying DC separately from Rack is letting the mount-side fetcher re-rank peers by the two-level locality hierarchy (same-rack > same-DC > cross-DC); without DC in the response every remote peer collapsed to "unknown locality." * weed/filer/mount_peer_registry.go: List() was taking a write lock so it could lazy-delete expired entries inline. But MountList is a read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is already wired as the sole reclamation path (same pattern as the mount-side PeerDirectory). Switch List to RLock + filter, let Sweep do the map mutation, so concurrent MountList callers don't serialize on each other. Test updated to reflect the new contract (List no longer mutates the map; Sweep is what drops expired entries).	2026-04-18 20:03:23 -07:00
Lisandro Pin	6bcacedda9	Export `master_disconnections` metrics on volume servers. (#9104 ) This allows to track connection issues and master failovers in real time via Prometheus metrics. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-17 15:15:26 -07:00
Chris Lu	00a2e22478	fix(mount): remove fid pool to stop master over-allocating volumes (#9111 ) * fix(mount): remove fid pool to stop master over-allocating volumes The writeback-cache fid pool pre-allocated file IDs with ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's PickForWrite charges count * expectedDataSize against the volume's effectiveSize, so a full pool refill could charge hundreds of MB against a single volume before any bytes were actually written. That tripped RecordAssign's hard-limit path and eagerly removed volumes from writable, causing the master to grow new volumes even when the real data being written was tiny. Drop the pool entirely. Every chunk upload goes through UploadWithRetry -> AssignVolume with no ExpectedDataSize hint, letting the master fall back to the 1 MB default estimate. The mount->filer grpc connection is already cached in pb.WithGrpcClient (non-streaming mode), so per-chunk AssignVolume is a unary RPC over an existing HTTP/2 stream, not a full dial. Path-based filer.conf storage rules now apply to mount chunk assigns again, which the pool had to skip. Also remove the now-unused operation.UploadWithAssignFunc and its AssignFunc type. * fix(upload): populate ExpectedDataSize from actual chunk bytes UploadWithRetry already buffers the full chunk into `data` before calling AssignVolume, so the real size is known. Previously the assign request went out with ExpectedDataSize=0, making the master fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same over-reservation symptom the pool had, just smaller per call. Stamp ExpectedDataSize = len(data) before the assign RPC when the caller hasn't already set it. This covers mount chunk uploads, filer_copy, filersink, mq/logstore, broker_write, gateway_upload, and nfs — all the UploadWithRetry paths. * fix(assign): pass real ExpectedDataSize at every assign call site After removing the mount fid pool, per-chunk AssignVolume calls went out with ExpectedDataSize=0, making the master fall back to its 1 MB DefaultNeedleSizeEstimate. That's still an over-estimate for small writes. Thread the real payload size through every remaining assign site so RecordAssign charges effectiveSize accurately and stops prematurely marking volumes full. - filer: assignNewFileInfo now takes expectedDataSize and stamps it on both primary and alternate VolumeAssignRequests. Callers pass: - SSE data-to-chunk: len(data) - copy manifest save: len(data) - streamCopyChunk: srcChunk.Size - TUS sub-chunk: bytes read - saveAsChunk (autochunk/manifestize): 0 (small, size unknown until the reader is drained; master uses 1 MB default) - filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize after the adaptive chunkSize is computed. - ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter; upload_chunked.go passes the buffered dataSize at the call site. S3 PUT assignFunc stamps it on the AssignVolumeRequest. - S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize; all seven call sites pass the source chunk's Size. - operation.SubmitFiles / FilePart.Upload: derive per-fid size from FileSize (average for batched requests, real per-chunk size for sequential chunk assigns). - benchmark: pass fileSize. - filer append-to-file: pass len(data). * fix(assign): thread size through SaveDataAsChunkFunctionType The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran AssignVolume before the reader was drained, so it had to pass ExpectedDataSize=0 and fall back to the master's 1 MB default. Add an expectedDataSize parameter to SaveDataAsChunkFunctionType. - mergeIntoManifest already has the serialized manifest bytes, so it passes uint64(len(data)) directly. - Mount's saveDataAsChunk ignores the parameter because it uses UploadWithRetry, which already stamps len(data) on the assign after reading the payload. - webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry path and also ignore the hint. - Filer's saveAsChunk (used for manifestize) plumbs the value to assignNewFileInfo so manifest-chunk assigns get a real size. Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked) pass the chunk size they're about to upload.	2026-04-16 15:51:13 -07:00
Chris Lu	979c54f693	fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089 ) * fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker.	2026-04-15 12:29:31 -07:00
Chris Lu	08d9193fe1	[nfs] Add NFS (#9067 ) * add filer inode foundation for nfs * nfs command skeleton * add filer inode index foundation for nfs * make nfs inode index hardlink aware * add nfs filehandle and inode lookup plumbing * add read-only nfs frontend foundation * add nfs namespace mutation support * add chunk-backed nfs write path * add nfs protocol integration tests * add stale handle nfs coverage * complete nfs hardlink and failover coverage * add nfs export access controls * add nfs metadata cache invalidation * fix nfs chunk read lookup routing * fix nfs review findings and rename regression * address pr 9067 review comments - filer_inode: fail fast if the snowflake sequencer cannot start, and let operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID to avoid multi-filer collisions - filer_inode: drop the redundant retry loop in nextInode - filerstore_wrapper: treat inode-index writes/removals as best-effort so a primary store success no longer surfaces as an operation failure - filer_grpc_server_rename: defer overwritten-target chunk deletion until after CommitTransaction so a rolled-back rename does not strand live metadata pointing at freshly deleted chunks - command/nfs: default ip.bind to loopback and require an explicit filer.path, so the experimental server does not expose the entire filer namespace on first run - nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire layout rather than RFC 1813 LINK3args * mount: pre-allocate inode in Mkdir and Symlink Mkdir and Symlink used to send filer_pb.CreateEntryRequest with Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns its own inode in that case, so the filer-side entry ends up with a different inode than the one the mount allocates via inodeToPath.Lookup and returns to the kernel. Once applyLocalMetadataEvent stores the filer's entry in the meta cache, subsequent GetAttr calls read the cached entry and hit the setAttrByPbEntry override at line 197 of weedfs_attr.go, returning the filer-assigned inode instead of the mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91) caught this — it lstat'd a freshly-created directory/symlink, renamed it, lstat'd again, and saw a different inode the second time. createRegularFile already pre-allocates via inodeToPath.AllocateInode and stamps it into the create request. Do the same thing in Mkdir and Symlink so both sides agree on the object identity from the very first request, and so GetAttr's cache path returns the same value as Mkdir / Symlink's initial response. * sequence: mask snowflake node id on int→uint32 conversion CodeQL flagged the unchecked uint32(snowflakeId) cast in NewSnowflakeSequencer as a potential truncation bug when snowflakeId is sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask to the 10 bits the snowflake library actually uses so any caller- supplied int is safely clamped into range. * add test/nfs integration suite Boots a real SeaweedFS cluster (master + volume + filer) plus the experimental `weed nfs` frontend as subprocesses and drives it through the NFSv3 wire protocol via go-nfs-client, mirroring the layout of test/sftp. The tests run without a kernel NFS mount, privileged ports, or any platform-specific tooling. Coverage includes read/write round-trip, mkdir/rmdir, nested directories, rename content preservation, overwrite + explicit truncate, 3 MiB binary file, all-byte binary and empty files, symlink round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity, sequential appends, and readdir-after-remove. Framework notes: - Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes -port.grpc explicitly so the default port+10000 convention cannot overflow uint16 on macOS. - Pre-creates the /nfs_export directory via the filer HTTP API before starting the NFS server — the NFS server's ensureIndexedEntry check requires the export root to exist with a real entry, which filer.Root does not satisfy when the export path is "/". - Reuses the same rpc.Client for mount and target so go-nfs-client does not try to re-dial via portmapper (which concatenates ":111" onto the address). * ci: add NFS integration test workflow Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch the NFS server, the inode filer plumbing it depends on, or the test harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu 22.04 via `make test`. * nfs: use append for buffer growth in Write and Truncate The previous make+copy pattern reallocated the full buffer on every extending write or truncate, giving O(N^2) behaviour for sequential write loops. Switching to `append(f.content, make([]byte, delta)...)` lets Go's amortized growth strategy absorb the repeated extensions. Called out by gemini-code-assist on PR 9067. * filer: honor caller cancellation in collectInodeIndexEntries Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of the inode-index scan if the client disconnects mid-walk. The cleanup is already treated as best-effort by the caller (it logs on error and continues), so a cancelled walk just means the partial index rebuild is skipped — the same failure mode as any other index write error. Flagged as a DoS concern by gemini-code-assist on PR 9067. * nfs: skip filer read on open when O_TRUNC is set openFile used to unconditionally loadWritableContent for every writable open and then discard the buffer if O_TRUNC was set. For large files that is a pointless 64 MiB round-trip. Reorder the branches so we only fetch existing content when the caller intends to keep it, and mark the file dirty right away so the subsequent Close still issues the truncating write. Called out by gemini-code-assist on PR 9067. * nfs: allow Seek on O_APPEND files and document buffered write cap Two related cleanups on filesystem.go: - POSIX only restricts Write on an O_APPEND fd, not lseek. The existing Seek error ("append-only file descriptors may only seek to EOF") prevented read-and-write workloads that legitimately reposition the read cursor. Write already snaps the offset to EOF before persisting (see seaweedFile Write), so Seek can unconditionally accept any offset. Update the unit test that was asserting the old behaviour. - Add a doc comment on maxBufferedWriteSize explaining that it is a per-file ceiling, the memory footprint it implies, and that the real fix for larger whole-file rewrites is streaming / multi-chunk support. Both changes flagged by gemini-code-assist on PR 9067. * nfs: guard offset before casting to int in Write CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as a potential overflow on architectures where `int` is 32-bit. The existing check only bounded the post-cast value, which is too late. Clamp f.offset against maxBufferedWriteSize before the cast and also reject negative/overflowed endOffset results. Both branches fall through to billy.ErrNotSupported, the same behaviour the caller gets today for any out-of-range buffered write. * nfs: compute Write endOffset in int64 to satisfy CodeQL The previous guard bounded f.offset but left len(p) unchecked, so CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width overflow path. Bound len(p) against maxBufferedWriteSize first, do the addition in int64, and only cast down after the total has been clamped against the buffer ceiling. Behaviour is unchanged: any out-of-range write still returns billy.ErrNotSupported. * ci: drop emojis from nfs-tests workflow summary Plain-text step summary per user preference — no decorative glyphs in the NFS CI output or checklist. * nfs: annotate remaining DEV_PLAN TODOs with status Three of the unchecked items are genuine follow-up PRs rather than missing work in this one, and one was actually already done: - Reuse chunk cache and mutation stream helpers without FUSE deps: checked off — the NFS server imports weed/filer.ReaderCache and weed/util/chunk_cache directly with no weed/mount or go-fuse imports. - Extract shared read/write helpers from mount/WebDAV/SFTP: annotated as deferred to a separate refactor PR (touches four packages). - Expand direct data-path writes beyond the 64 MiB buffered fallback: annotated as deferred — requires a streaming WRITE path. - Shared lock state + lock tests: annotated as blocked upstream on go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing "Current Blockers" note. * test/nfs: share port+readiness helpers with test/testutil Drop the per-suite mustPickFreePort and waitForService re-implementations in favor of testutil.MustAllocatePorts (atomic batch allocation; no close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout. Pull testutil in via a local replace directive so this standalone seaweedfs-nfs-tests module can import the in-repo package without a separate release. Subprocess startup is still master + volume + filer + nfs — no switch to weed mini yet, since mini does not know about the nfs frontend. * nfs: stream writes to volume servers instead of buffering the whole file Before this change the NFS write path held the full contents of every writable open in memory: - OpenFile(write) called loadWritableContent which read the existing file into seaweedFile.content up to maxBufferedWriteSize (64 MiB) - each Write() extended content in-place - Close() uploaded the whole buffer as a single chunk via persistContent + AssignVolume The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and even below the cap every Write paid a whole-file-in-memory cost. This PR rewrites the write path to match how `weed filer` and the S3 gateway persist data: - openFile(write) no longer loads the existing content at all; it only issues an UpdateEntry when O_TRUNC is set and the file is non-empty (so a fresh create+trunc is still zero-RPC) - Write() streams the caller's bytes straight to a volume server via one AssignVolume + one chunk upload, then atomically appends the resulting chunk to the filer entry through mutateEntry. Any previously inlined entry.Content is migrated to a chunk in the same update so the chunk list becomes the authoritative representation. - Truncate() becomes a direct mutateEntry (drop chunks past the new size, clip inline content, update FileSize) instead of resizing an in-memory buffer. - Close() is a no-op because everything was flushed inline. The small-file fast path that the filer HTTP handler uses is preserved: if the post-write size still fits in maxInlineWriteSize (4 MiB) and the file has no existing chunks, we rewrite entry.Content directly and skip the volume-server round-trip. This keeps single-shot tiny writes (echo, small edits) cheap while completely removing the 64 MiB cap on larger files. Read() now always reads through the chunk reader instead of a local byte slice, so reads inside the same session see the freshly appended data. Drops the unused seaweedFile.content / dirty fields, the maxBufferedWriteSize constant, and the loadWritableContent helper. Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations to match the new "no extra O_TRUNC UpdateEntry on an empty file" behavior (still 3 updates: Write + Chmod + Truncate). * filer: extract shared gateway upload helper for NFS and WebDAV Three filer-backed gateways (NFS, WebDAV, and mount) each had a local saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry with near-identical bodies: build AssignVolumeRequest, build UploadOption, build genFileUrlFn with optional filerProxy rewriting, call UploadWithRetry, validate the result, and call ToPbFileChunk. Pull that body into filer.SaveGatewayDataAsChunk with a GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate to one implementation. - NFS's saveDataAsChunk is now a thin adapter that assembles the GatewayChunkUploadRequest from server options and calls the helper. The chunkUploader interface keeps working for test injection because the new GatewayChunkUploader interface is structurally identical. - WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the local operation.NewUploader call plus the AssignVolume/UploadOption scaffolding. - mount is intentionally left alone. mount's saveDataAsChunk has two features that do not fit the shared helper (a pre-allocated file-id pool used to skip AssignVolume entirely, and a chunkCache write-through at offset 0 so future reads hit the mount's local cache), both of which are mount-specific. Marks the Phase 2 "extract shared read/write helpers from mount, WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals + NewChunkReaderAtFromClient) was already shared. * nfs: remove DESIGN.md and DEV_PLAN.md The planning documents have served their purpose — all phase 1 and phase 2 items are landed, phase 3 streaming writes are landed, phase 2 shared helpers are extracted, and the two remaining phase 4 items (shared lock state + lock tests) are blocked upstream on github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state RPCs. The running decision log no longer reflects current code and would just drift. The NFS wiki page (https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries the overview, configuration surface, architecture notes, and known limitations; the source is the source of truth for the rest.	2026-04-14 20:48:24 -07:00
Lisandro Pin	67a2810d2d	Export `start_time_seconds` metrics on both master & volume servers. (#9046 ) These are to be used to track uptimes. See https://github.com/seaweedfs/seaweedfs/issues/8535 for details. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>	2026-04-13 09:34:08 -07:00
Chris Lu	edf7d2a074	fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039 ) * fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was newer than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely.	2026-04-11 23:12:54 -07:00
os-pradipbabar	9cae95d749	fix(filer): prevent data corruption during graceful shutdown (#9037 ) * fix: wait for in-flight uploads to complete before filer shutdown Prevents data corruption when SIGTERM is received during active uploads. The filer now waits for all in-flight operations to complete before calling the underlying shutdown logic. This affects all deployment types (Kubernetes, Docker, systemd) and fixes corruption issues during rolling updates, certificate rotation, and manual restarts. Changes: - Add FilerServer.Shutdown() method with upload wait logic - Update grace.OnInterrupt hook to use new shutdown method Fixes data corruption reported by production users during pod restarts. * fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete * fix: address review comments on graceful shutdown - Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking from long-lived streams (falls back to Stop on timeout) - Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within Kubernetes default 30s termination grace period - Move fs.Shutdown() (database close) after Serve() returns instead of a separate hook to eliminate race where main goroutine exits before the shutdown hook runs * fix: shut down all HTTP servers before filer database close Address remaining review comments: - Shut down auxiliary HTTP servers (Unix socket, local listener) during graceful shutdown so they can't serve write traffic after the main server stops - Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it completes before os.Exit(0), fixing the race between the grace goroutine and the main goroutine - Use sync.Once to ensure fs.Shutdown() runs exactly once regardless of whether shutdown is signal-driven or context-driven (MiniCluster) --------- Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-04-11 21:18:22 -07:00
Chris Lu	b37bbf541a	feat(master): drain pending size before marking volume readonly (#9036 ) * feat(master): drain pending size before marking volume readonly When vacuum, volume move, or EC encoding marks a volume readonly, in-flight assigned bytes may still be pending. This adds a drain step: immediately remove from writable list (stop new assigns), then wait for pending to decay below 4MB or 30s timeout. - Add volumeSizeTracking struct consolidating effectiveSize, reportedSize, and compactRevision into a single map - Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable, DrainAndSetVolumeReadOnly to VolumeLayout - UpdateVolumeSize detects compaction via compactRevision change and resets effectiveSize instead of decaying - Wire drain into vacuum (topology_vacuum.go) and volume mark readonly (master_grpc_server_volume.go) * fix: use 2MB pending size drain threshold * fix: check crowded state on initial UpdateVolumeSize registration * fix: respect context cancellation in drain, relax test timing - DrainAndSetVolumeReadOnly now accepts context.Context and returns early on cancellation (for gRPC handler timeout/cancel) - waitForPendingDrain uses select on ctx.Done instead of time.Sleep - Increase concurrent heartbeat test timeout from 10s to 15s for CI * fix: use time-based dedup so decay runs even when reported size is unchanged The value-based dedup (same reportedSize + compactRevision = skip) prevented decay from running when pending bytes existed but no writes had landed on disk yet. The reported size stayed the same across heartbeats, so the excess never decayed. Fix: dedup replicas within the same heartbeat cycle using a 2-second time window instead of comparing values. This allows decay to run once per heartbeat cycle even when the reported size is unchanged. Also confirmed finding 1 (draining re-add race) is a false positive: - Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes - Move/EC: readonlyVolumes flag prevents re-adding during drain * fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up to 30s waiting for pending bytes to decay. In integration tests (and real clusters during EC encoding), this caused timeouts because multiple volumes are marked readonly sequentially and heartbeats may not arrive fast enough to decay pending within the drain window. Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops new assigns) and only logs a warning if pending bytes remain. The drain wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs inside the master's own goroutine pool. Remove DrainAndSetVolumeReadOnly as it's no longer used. * fix: relax test timing, rename test, add post-condition assert * test: add vacuum integration tests with CI workflow Full-cluster integration test for vacuum, modeled on the EC integration tests. Starts a real master + 2 volume servers, uploads data, deletes entries to create garbage, runs volume.vacuum via shell command, and verifies garbage cleanup and data integrity. Test flow: 1. Start cluster (master + 2 volume servers) 2. Upload 10 files to create volume with data 3. Delete 5 files to create ~50% garbage 4. Verify garbage ratio > 10% 5. Run volume.vacuum command 6. Verify garbage cleaned up 7. Verify remaining 5 files are still accessible CI workflow runs on push/PR to master with 15-minute timeout. Log collection on failure via artifact upload. * fix: use 500KB files and delete 75% to exceed vacuum garbage threshold * fix: add shell lock before vacuum command, fix compilation error * fix: strengthen vacuum integration test assertions - waitForServer: use net.DialTimeout instead of grpc.NewClient for real TCP readiness check - verify_garbage_before_vacuum: t.Fatal instead of warning when no garbage detected - verify_cleanup_after_vacuum: t.Fatal if no server reported the volume or cleanup wasn't verified - verify_remaining_data: read actual file contents via HTTP and compare byte-for-byte against original uploaded payloads * fix: use http.Client with timeout and close body before retry	2026-04-11 18:29:11 -07:00
Chris Lu	10b0bdce02	feat: pass expected_data_size from clients for size-aware assignment (#9032 ) * feat: pass expected_data_size from clients for size-aware assignment Add expected_data_size field to AssignRequest (master proto) and AssignVolumeRequest (filer proto) so clients can hint how large the data will be. The master uses this instead of the 1MB default when tracking pending volume sizes for weighted assignment. - Add expected_data_size to master.proto AssignRequest - Add expected_data_size to filer.proto AssignVolumeRequest - Wire through filer AssignVolume handler - Wire through HTTP submit handler (uses actual upload size) - Add ExpectedDataSize to VolumeAssignRequest in operation package - Topology.PickForWrite accepts optional expectedDataSize parameter * fix: guard integer conversions in expected_data_size path - common.go: clamp OriginalDataSize to non-negative before uint64 cast - topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast * fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize - HTTP /dir/assign now parses optional "dataSize" query parameter and passes it to PickForWrite instead of hardcoded 0 - Add test assertion for PickForWrite with non-zero expectedDataSize	2026-04-11 11:30:47 -07:00
Chris Lu	e648c76bcf	go fmt	2026-04-10 17:31:14 -07:00
Chris Lu	6f036c7015	fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock (#8998 ) * fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock When fastResume is active (single-master + resumeState + non-empty log), the raft server becomes leader within ~1ms. DoJoinCommand then enters the leaderLoop's processCommand path, which calls setCommitIndex to commit all pending entries. The goraft setCommitIndex implementation returns early when it encounters a JoinCommand entry (to recalculate quorum), which can prevent the new entry's event channel from being notified — leaving DoJoinCommand blocked forever. Each restart appends a new raft:join entry to the log, while the conf file's commitIndex (only persisted on AddPeer) lags behind. After 3-4 restarts the uncommitted range contains old JoinCommand entries that trigger the early return before the new entry is reached. Fix: skip DoJoinCommand when the raft log already has entries (the server was already joined in a previous run). The fastResume mechanism handles leader election independently. * fix(master): handle Hashicorp Raft in HasExistingState Add Hashicorp Raft support to HasExistingState by checking AppliedIndex, consistent with how other RaftServer methods handle both raft implementations. * fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft AppliedIndex() reflects in-memory FSM state which starts at 0 before log replay completes. LastIndex() reads from persisted stable storage, correctly mirroring the non-Hashicorp IsLogEmpty() check.	2026-04-08 21:08:50 -07:00
Lars Lehtonen	8edadf7f4a	chore(weed/server): prune unused unexported struct fields (#8980 )	2026-04-07 21:24:30 -07:00
Chris Lu	940eed0bd3	fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972 ) * fix(ec): generate .ecx before EC shards to prevent data inconsistency In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER the EC shards were generated from .dat. If any write occurred between these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses the read-only check), the .ecx would contain entries pointing to data that doesn't exist in the EC shards, causing "shard too short" and "size mismatch" errors on subsequent reads and scrubs. Fix by generating .ecx FIRST, then snapshotting datFileSize, then encoding EC shards. If a write sneaks in after .ecx generation, the EC shards contain more data than .ecx references — which is harmless (the extra data is simply not indexed). Also snapshot datFileSize before EC encoding to ensure the .vif reflects the same .dat state that .ecx was generated from. Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the race condition by appending data between EC encoding and .ecx generation. * fix: pass actual offset to ReadBytes, improve test quality - Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve correct error metrics and error messages within ReadBytes - Handle Stat() error in assembleFromIntervalsAllowError - Rename TestEcConsistency_DatFileGrowsDuringEncoding to TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size encoding, not concurrent growth) - Update test comment to clarify it reproduces the old buggy sequence - Fix verification loop to advance by readSize for full data coverage * fix(ec): add dat/idx consistency check in worker EC encoding The erasure_coding worker copies .dat and .idx as separate network transfers. If a write lands on the source between these copies, the .idx may have entries pointing past the end of .dat, leading to EC volumes with .ecx entries that reference non-existent shard data. Add verifyDatIdxConsistency() that walks the .idx and verifies no entry's offset+size exceeds the .dat file size. This fails the EC task early with a clear error instead of silently producing corrupt EC volumes. * test(ec): add integration test verifying .ecx/.ecd consistency TestEcIndexConsistencyAfterEncode uploads multiple needles of varying sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then reads every needle back via the EC read path and verifies payload correctness. This catches any inconsistency between .ecx index entries and EC shard data. * fix(test): account for needle overhead in test volume fixture WriteTestVolumeFiles created a .dat of exactly datSize bytes but the .idx entry claimed a needle of that same size. GetActualSize adds header + checksum + timestamp overhead, so the consistency check correctly rejects this as the needle extends past the .dat file. Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry is consistent with the .dat contents. * fix(test): remove flaky shard ID assertion in EC scrub test When shard 0 is truncated on disk after mount, the volume server may detect corruption via parity mismatches (shards 10-13) rather than a direct read failure on shard 0, depending on OS caching/mmap behavior. Replace the brittle shard-0-specific check with a volume ID validation. * fix(test): close upload response bodies and tighten file count assertion Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd leaks during test execution. Also tighten TotalFiles check from >= 1 to == 1 since ecSetup uploads exactly one file.	2026-04-07 19:05:36 -07:00
Chris Lu	a4753b6a3b	S3: delay empty folder cleanup to prevent Spark write failures (#8970 ) * S3: delay empty folder cleanup to prevent Spark write failures (#8963) Empty folders were being cleaned up within seconds, causing Apache Spark (s3a) writes to fail when temporary directories like _temporary/0/task_xxx/ were briefly empty. - Increase default cleanup delay from 5s to 2 minutes - Only process queue items that have individually aged past the delay (previously the entire queue was drained once any item triggered) - Make the delay configurable via filer.toml: [filer.options] s3.empty_folder_cleanup_delay = "2m" * test: increase cleanup wait timeout to match 2m delay The empty folder cleanup delay was increased to 2 minutes, so the Spark integration test needs to wait longer for temporary directories to disappear. * fix: eagerly clean parent directories after empty folder deletion After deleting an empty folder, immediately try to clean its parent rather than relying on cascading metadata events that each re-enter the 2-minute delay queue. This prevents multi-minute waits when cleaning nested temporary directory trees (e.g. Spark's _temporary hierarchy with 3+ levels would take 6m+ vs near-instant). Fixes the CI failure where lingering _temporary parent directories were not cleaned within the test's 3-minute timeout.	2026-04-07 13:20:59 -07:00
Chris Lu	4efe0acaf5	fix(master): fast resume state and default resumeState to true (#8925 ) * fix(master): fast resume state and default resumeState to true When resumeState is enabled in single-master mode, the raft server had existing log entries so the self-join path couldn't promote to leader. The server waited the full election timeout (10-20s) before self-electing. Fix by temporarily setting election timeout to 1ms before Start() when in single-master + resumeState mode with existing log, then restoring the original timeout after leader election. This makes resume near-instant. Also change the default for resumeState from false to true across all CLI commands (master, mini, server) so state is preserved by default. * fix(master): prevent fastResume goroutine from hanging forever Use defer to guarantee election timeout is always restored, and bound the polling loop with a timeout so it cannot spin indefinitely if leader election never succeeds. * fix(master): use ticker instead of time.After in fastResume polling loop	2026-04-04 14:15:56 -07:00
Chris Lu	896114d330	fix(admin): fix master leader link showing incorrect port in Admin UI (#8924 ) fix(admin): use gRPC address for current server in RaftListClusterServers The old Raft implementation was returning the HTTP address (ms.option.Master) for the current server, while peers used gRPC addresses (peer.ConnectionString). The Admin UI's GetClusterMasters() converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress (port - 10000), which produced a negative port (-667) for the current server since its address was already in HTTP format (port 9333). Use ToGrpcAddress() for consistency with both HashicorpRaft (which stores gRPC addresses) and old Raft peers. Fixes #8921	2026-04-04 11:50:43 -07:00
Chris Lu	d1823d3784	fix(s3): include static identities in listing operations (#8903 ) * fix(s3): include static identities in listing operations Static identities loaded from -s3.config file were only stored in the S3 API server's in-memory state. Listing operations (s3.configure shell command, aws iam list-users) queried the credential manager which only returned dynamic identities from the backend store. Register static identities with the credential manager after loading so they are included in LoadConfiguration and ListUsers results, and filtered out before SaveConfiguration to avoid persisting them to the dynamic store. Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896 * fix: avoid mutating caller's config and defensive copies - SaveConfiguration: use shallow struct copy instead of mutating the caller's config.Identities field - SetStaticIdentities: skip nil entries to avoid panics - GetStaticIdentities: defensively copy PolicyNames slice to avoid aliasing the original * fix: filter nil static identities and sync on config reload - SetStaticIdentities: filter nil entries from the stored slice (not just from staticNames) to prevent panics in LoadConfiguration/ListUsers - Extract updateCredentialManagerStaticIdentities helper and call it from both startup and the grace.OnReload handler so the credential manager's static snapshot stays current after config file reloads * fix: add mutex for static identity fields and fix ListUsers for store callers - Add sync.RWMutex to protect staticIdentities/staticNames against concurrent reads during config reload - Revert CredentialManager.ListUsers to return only store users, since internal callers (e.g. DeletePolicy) look up each user in the store and fail on non-existent static entries - Merge static usernames in the filer gRPC ListUsers handler instead, via the new GetStaticUsernames method - Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was failing because DeletePolicy iterated static users that don't exist in the store * fix: show static identities in admin UI and weed shell The admin UI and weed shell s3.configure command query the filer's credential manager via gRPC, which is a separate instance from the S3 server's credential manager. Static identities were only registered on the S3 server's credential manager, so they never appeared in the filer's responses. - Add CredentialManager.LoadS3ConfigFile to parse a static S3 config file and register its identities - Add FilerOptions.s3ConfigFile so the filer can load the same static config that the S3 server uses - Wire s3ConfigFile through in weed mini and weed server modes - Merge static usernames in filer gRPC ListUsers handler - Add CredentialManager.GetStaticUsernames helper - Add sync.RWMutex to protect concurrent access to static identity fields - Avoid importing weed/filer from weed/credential (which pulled in filer store init() registrations and broke test isolation) - Add docker/compose/s3_static_users_example.json * fix(admin): make static users read-only in admin UI Static users loaded from the -s3.config file should not be editable or deletable through the admin UI since they are managed via the config file. - Add IsStatic field to ObjectStoreUser, set from credential manager - Hide edit, delete, and access key buttons for static users in the users table template - Show a "static" badge next to static user names - Return 403 Forbidden from UpdateUser and DeleteUser API handlers when the target user is a static identity * fix(admin): show details for static users GetObjectStoreUserDetails called credentialManager.GetUser which only queries the dynamic store. For static users this returned ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup fails. * fix(admin): load static S3 identities in admin server The admin server has its own credential manager (gRPC store) which is a separate instance from the S3 server's and filer's. It had no static identity data, so IsStaticIdentity returned false (edit/delete buttons shown) and GetStaticIdentity returned nil (details page failed). Pass the -s3.config file path through to the admin server and call LoadS3ConfigFile on its credential manager, matching the approach used for the filer. * fix: use protobuf is_static field instead of passing config file path The previous approach passed -s3.config file path to every component (filer, admin). This is wrong because the admin server should not need to know about S3 config files. Instead, add an is_static field to the Identity protobuf message. The field is set when static identities are serialized (in GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads configuration via GetConfiguration automatically sees which identities are static, without needing the config file. - Add is_static field (tag 8) to iam_pb.Identity proto message - Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile - Admin GetObjectStoreUsers reads identity.IsStatic from proto - Admin IsStaticUser helper loads config via gRPC to check the flag - Filer GetUser gRPC handler falls back to GetStaticIdentity - Remove s3ConfigFile from AdminOptions and NewAdminServer signature	2026-04-03 20:01:28 -07:00
Chris Lu	0798b274dd	feat(s3): add concurrent chunk prefetch for large file downloads (#8917 ) * feat(s3): add concurrent chunk prefetch for large file downloads Add a pipe-based prefetch pipeline that overlaps chunk fetching with response writing during S3 GetObject, SSE downloads, and filer proxy. While chunk N streams to the HTTP response, fetch goroutines for the next K chunks establish HTTP connections to volume servers ahead of time, eliminating the RTT gap between sequential chunk fetches. Uses io.Pipe for minimal memory overhead (~1MB per download regardless of chunk size, vs buffering entire chunks). Also increases the streaming read buffer from 64KB to 256KB to reduce syscall overhead. Benchmark results (64KB chunks, prefetch=4): - 0ms latency: 1058 → 2362 MB/s (2.2× faster) - 5ms latency: 11.0 → 41.7 MB/s (3.8× faster) - 10ms latency: 5.9 → 23.3 MB/s (4.0× faster) - 20ms latency: 3.1 → 12.1 MB/s (3.9× faster) * fix: address review feedback for prefetch pipeline - Fix data race: use chunkPipeResult (pointer) on channel to avoid copying struct while fetch goroutines write to it. Confirmed clean with -race detector. - Remove concurrent map write: retryWithCacheInvalidation no longer updates fileId2Url map. Producer only reads it; consumer never writes. - Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure. - Add local cancellable context so consumer errors (client disconnect) immediately stop the producer and all in-flight fetch goroutines. fix(test): remove dead code and add Range header support in test server - Remove unused allData variable in makeChunksAndServer - Add Range header handling to createTestServer for partial chunk read coverage (206 Partial Content, 416 Range Not Satisfiable) * fix: correct retry condition and goroutine leak in prefetch pipeline - Fix retry condition: use result.fetchErr/result.written instead of copied to decide cache-invalidation retry. The old condition wrongly triggered retry when the fetch succeeded but the response writer failed on the first write (copied==0 despite fetcher having data). Now matches the sequential path (stream.go:197) which checks whether the fetcher itself wrote zero bytes. - Fix goroutine leak: when the producer's send to the results channel is interrupted by context cancellation, the fetch goroutine was already launched but the result was never sent to the channel. The drain loop couldn't handle it. Now waits on result.done before returning so every fetch goroutine is properly awaited.	2026-04-03 19:57:30 -07:00
Chris Lu	995dfc4d5d	chore: remove ~50k lines of unreachable dead code (#8913 ) * chore: remove unreachable dead code across the codebase Remove ~50,000 lines of unreachable code identified by static analysis. Major removals: - weed/filer/redis_lua: entire unused Redis Lua filer store implementation - weed/wdclient/net2, resource_pool: unused connection/resource pool packages - weed/plugin/worker/lifecycle: unused lifecycle plugin worker - weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy, multipart IAM, key rotation, and various SSE helper functions - weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions - weed/mq/offset: unused SQL storage and migration code - weed/worker: unused registry, task, and monitoring functions - weed/query: unused SQL engine, parquet scanner, and type functions - weed/shell: unused EC proportional rebalance functions - weed/storage/erasure_coding/distribution: unused distribution analysis functions - Individual unreachable functions removed from 150+ files across admin, credential, filer, iam, kms, mount, mq, operation, pb, s3api, server, shell, storage, topology, and util packages * fix(s3): reset shared memory store in IAM test to prevent flaky failure TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because the MemoryStore credential backend is a singleton registered via init(). Earlier tests that create anonymous identities pollute the shared store, causing LookupAnonymous() to unexpectedly return true. Fix by calling Reset() on the memory store before the test runs. * style: run gofmt on changed files * fix: restore KMS functions used by integration tests * fix(plugin): prevent panic on send to closed worker session channel The Plugin.sendToWorker method could panic with "send on closed channel" when a worker disconnected while a message was being sent. The race was between streamSession.close() closing the outgoing channel and sendToWorker writing to it concurrently. Add a done channel to streamSession that is closed before the outgoing channel, and check it in sendToWorker's select to safely detect closed sessions without panicking.	2026-04-03 16:04:27 -07:00
Chris Lu	af68449a26	Process .ecj deletions during EC decode and vacuum decoded volume (#8863 ) * Process .ecj deletions during EC decode and vacuum decoded volume (#8798) When decoding EC volumes back to normal volumes, deletions recorded in the .ecj journal were not being applied before computing the dat file size or checking for live needles. This caused the decoded volume to include data for deleted files and could produce false positives in the all-deleted check. - Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first - Vacuum the decoded volume after mounting in ec.decode to compact out deleted needle data from the .dat file - Add integration tests for decoding with non-empty .ecj files * storage: add offline volume compaction helper Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: compact decoded volumes before deleting shards Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ec: address PR review comments - Fall back to data directory for .ecx when idx directory lacks it - Make compaction failure non-fatal during EC decode - Remove misleading "buffer: 10%" from space check error message * ec: collect .ecj from all shard locations during decode Each server's .ecj only contains deletions for needles whose data resides in shards held by that server. Previously, sources with no new data shards to contribute were skipped entirely, losing their .ecj deletion entries. Now .ecj is always appended from every shard location so RebuildEcxFile sees the full set of deletions. * ec: add integration tests for .ecj collection during decode TestEcDecodePreservesDeletedNeedles: verifies that needles deleted via VolumeEcBlobDelete are excluded from the decoded volume. TestEcDecodeCollectsEcjFromPeer: regression test for the fix in collectEcShards. Deletes a needle only on a peer server that holds no new data shards, then verifies the deletion survives decode via .ecj collection. * ec: address review nits in decode and tests - Remove double error wrapping in mountDecodedVolume - Check VolumeUnmount error in peer ecj test - Assert 404 specifically for deleted needles, fail on 5xx --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-01 01:15:26 -07:00
Chris Lu	75a6a34528	dlm: resilient distributed locks via consistent hashing + backup replication (#8860 ) * dlm: replace modulo hashing with consistent hash ring Introduce HashRing with virtual nodes (CRC32-based consistent hashing) to replace the modulo-based hashKeyToServer. When a filer node is removed, only keys that hashed to that node are remapped to the next server on the ring, leaving all other mappings stable. This is the foundation for backup replication — the successor on the ring is always the natural takeover node. * dlm: add Generation and IsBackup fields to Lock Lock now carries IsBackup (whether this node holds the lock as a backup replica) and Generation (a monotonic fencing token that increments on each fresh acquisition, stays the same on renewal). Add helper methods: AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock. * dlm: add ReplicateLock RPC and generation/is_backup proto fields Add generation field to LockResponse for fencing tokens. Add generation and is_backup fields to Lock message. Add ReplicateLock RPC for primary-to-backup lock replication. Add ReplicateLockRequest/ReplicateLockResponse messages. * dlm: add async backup replication to DistributedLockManager Route lock/unlock via consistent hash ring's GetPrimaryAndBackup(). After a successful lock or unlock on the primary, asynchronously replicate the operation to the backup server via ReplicateFunc callback. Single-server deployments skip replication. * dlm: add ReplicateLock handler and backup-aware topology changes Add ReplicateLock gRPC handler for primary-to-backup replication. Revise OnDlmChangeSnapshot to handle three cases on topology change: - Promote backup locks when this node becomes primary - Demote primary locks when this node becomes backup - Transfer locks when this node is neither primary nor backup Wire up SetupDlmReplication during filer server initialization. * dlm: expose generation fencing token in lock client LiveLock now captures the generation from LockResponse and exposes it via Generation() method. Consumers can use this as a fencing token to detect stale lock holders. * dlm: update empty folder cleaner to use consistent hash ring Replace local modulo-based hashKeyToServer with LockRing.GetPrimary() which uses the shared consistent hash ring for folder ownership. * dlm: add unit tests for consistent hash ring Test basic operations, consistency on server removal (only keys from removed server move), backup-is-successor property (backup becomes new primary when primary is removed), and key distribution balance. * dlm: add integration tests for lock replication failure scenarios Test cases: - Primary crash with backup promotion (backup has valid token) - Backup crash with primary continuing - Both primary and backup crash (lock lost, re-acquirable) - Rolling restart across all nodes - Generation fencing token increments on new acquisition - Replication failure (primary still works independently) - Unlock replicates deletion to backup - Lock survives server addition (topology change) - Consistent hashing minimal disruption (only removed server's keys move) * dlm: address PR review findings 1. Causal replication ordering: Add per-lock sequence number (Seq) that increments on every mutation. Backup rejects incoming mutations with seq <= current seq, preventing stale async replications from overwriting newer state. Unlock replication also carries seq and is rejected if stale. 2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to the new primary first and only demotes to backup after a successful TransferLocks RPC. If the transfer fails, the lock stays as primary on this node. 3. SetSnapshot candidateServers leak: Replace the candidateServers map entirely instead of appending, so removed servers don't linger. 4. TransferLocks preserves Generation and Seq: InsertLock now accepts generation and seq parameters. After accepting a transferred lock, the receiving node re-replicates to its backup. 5. Rolling restart test: Add re-replication step after promotion and assert survivedCount > 0. Add TestDLM_StaleReplicationRejected. 6. Mixed-version upgrade note: Add comment on HashRing documenting that all filer nodes must be upgraded together. * dlm: serve renewals locally during transfer window on node join When a new node joins and steals hash ranges from surviving nodes, there's a window between ring update and lock transfer where the client gets redirected to a node that doesn't have the lock yet. Fix: if the ring says primary != self but we still hold the lock locally (non-backup, matching token), serve the renewal/unlock here rather than redirecting. The lock will be transferred by OnDlmChangeSnapshot, and subsequent requests will go to the new primary once the transfer completes. Add tests: - TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption when a node drops and a new one joins (14/100 surviving-node locks disrupted, all handled by transfer logic) - TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on old primary during the transfer window * dlm: master-managed lock ring with stabilization batching The master now owns the lock ring membership. Instead of filers independently reacting to individual ClusterNodeUpdate add/remove events, the master: 1. Tracks filer membership in LockRingManager 2. Batches rapid changes with a 1-second stabilization timer (e.g., a node drop + join within 1 second → single ring update) 3. Broadcasts the complete ring snapshot atomically via the new LockRingUpdate message in KeepConnectedResponse Filers receive the ring as a complete snapshot and apply it via SetSnapshot, ensuring all filers converge to the same ring state without intermediate churn. This eliminates the double-churn problem where a rapid drop+join would fire two separate ring mutations, each triggering lock transfers and disrupting ownership on surviving nodes. * dlm: track ring version, reject stale updates, remove dead code SetSnapshot now takes a version parameter from the master. Stale updates (version < current) are rejected, preventing reordered messages from overwriting a newer ring state. Version 0 is always accepted for bootstrap. Remove AddServer/RemoveServer from LockRing — the ring is now exclusively managed by the master via SetSnapshot. Remove the candidateServers map that was only used by those methods. * dlm: fix SelectLocks data race, advance generation on backup insert - SelectLocks: change RLock to Lock since the function deletes map entries, which is a write operation and causes a data race under RLock. - InsertBackupLock: advance nextGeneration to at least the incoming generation so that after failover promotion, new lock acquisitions get a generation strictly greater than any replicated lock. - Bump replication failure log from V(1) to Warningf for production visibility. * dlm: fix SetSnapshot race, test reliability, timer edge cases - SetSnapshot: hold LockRing lock through both version update and Ring.SetServers() so they're atomic. Prevents a concurrent caller from seeing the new version but applying stale servers. - Transfer window test: search for a key that actually moves primary when filer4 joins, instead of relying on a fixed key that may not. - renewLock redirect: pass the existing token to the new primary instead of empty string, so redirected renewals work correctly. - scheduleBroadcast: check timer.Stop() return value. If the timer already fired, the callback picks up latest state. - FlushPending: only broadcast if timer.Stop() returns true (timer was still pending). If false, the callback is already running. - Fix test comment: "idempotent" → "accepted, state-changing". * dlm: use wall-clock nanoseconds for lock ring version The lock ring version was an in-memory counter that reset to 0 on master restart. A filer that had seen version 5 would reject version 1 from the restarted master. Fix: use time.Now().UnixNano() as the version. This survives master restarts without persistence — the restarted master produces a version greater than any pre-restart value. * dlm: treat expired lock owners as missing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: reject stale lock transfers Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: order replication by generation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * dlm: bootstrap lock ring on reconnect Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-30 23:29:56 -07:00
Chris Lu	4705d8b82b	Fix stale admin lock metric when lock expires and is reacquired (#8859 ) * Fix stale admin lock metric when lock expires and is reacquired (#8857) When a lock expired without an explicit unlock and a different client acquired it, the old client's metric was never cleared, causing multiple clients to appear as simultaneously holding the lock. * Use DeleteLabelValues instead of Set(0) to remove stale metric series Avoids cardinality explosion from accumulated stale series when client names are dynamic.	2026-03-30 18:51:38 -07:00
Chris Lu	ced2236cc6	Adjust rename events metadata format (#8854 ) * rename metadata events * fix subscription filter to use NewEntry.Name for rename path matching The server-side subscription filter constructed the new path using OldEntry.Name instead of NewEntry.Name when checking if a rename event's destination matches the subscriber's path prefix. This could cause events to be incorrectly filtered when a rename changes the file name. * fix bucket events to handle rename of bucket directories onBucketEvents only checked IsCreate and IsDelete. A bucket directory rename via AtomicRenameEntry now emits a single rename event (both OldEntry and NewEntry non-nil), which matched neither check. Handle IsRename by deleting the old bucket and creating the new one. * fix replicator to handle rename events across directory boundaries Two issues fixed: 1. The replicator filtered events by checking if the key (old path) was under the source directory. Rename events now use the old path as key, so renames from outside into the watched directory were silently dropped. Now both old and new paths are checked, and cross-boundary renames are converted to create or delete. 2. NewParentPath was passed to the sink without remapping to the sink's target directory structure, causing the sink to write entries at the wrong location. Now NewParentPath is remapped alongside the key. * fix filer sync to handle rename events crossing directory boundaries The early directory-prefix filter only checked resp.Directory (old parent). Rename events now carry the old parent as Directory, so renames from outside the source path into it were dropped before reaching the existing cross-boundary handling logic. Check both old and new directories against sourcePath and excludePaths so the downstream old-key/new-key logic can properly convert these to create or delete operations. * fix metadata event path matching * fix metadata event consumers for rename targets * Fix replication rename target keys Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix filer sync rename path scoping Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix replicator directory boundary checks Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix etc-remote rename-out handling Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Defer rename events until commit Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Skip descendant rename target lookups Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tighten rename review tests Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix HardLinkId format verb in InsertEntryKnownAbsent error HardLinkId is a byte slice. %d prints each byte as a decimal number which is not useful for an identifier. Use %x to match the log line two lines above. * only skip descendant target lookup when source and dest use same store moveFolderSubEntries unconditionally passed skipTargetLookup=true for every descendant. This is safe when all paths resolve to the same underlying store, but with path-specific store configuration a child's destination may map to a different backend that already holds an entry at that path. Use FilerStoreWrapper.SameActualStore to check per-child and fall back to the full CreateEntry path when stores differ. * add nil and create edge-case tests for metadata event scope helpers * extract pathIsEqualOrUnder into util.IsEqualOrUnder Identical implementations existed in both replication/replicator.go and command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the existing FullPath.IsUnder) and remove the duplicates. * use MetadataEventTargetDirectory for new-side directory in filer sync The new-side directory checks and sourceNewKey computation used message.NewParentPath directly. If NewParentPath were empty (legacy events, older filer versions during rolling upgrades), sourceNewKey would be wrong (/filename instead of /dir/filename) and the UpdateEntry parent path rewrite would panic on slice bounds. Derive targetDir once from MetadataEventTargetDirectory, which falls back to resp.Directory when NewParentPath is empty, and use it consistently for all new-side checks and the sink parent path.	2026-03-30 18:25:11 -07:00
Chris Lu	e5ad5e8d4a	fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume (#8836 ) * fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume The gRPC AssignVolume path was applying the filer's default DiskType to the request before calling detectStorageOption. This caused the default to shadow any disk type configured via a filer location-prefix rule, diverging from the HTTP write path which applies the default only when no rule matches. Extract resolveAssignStorageOption to apply the filer default disk type after detectStorageOption, so location-prefix rules take precedence. * fix(filer): apply default disk type after location-prefix resolution in TUS upload path Same class of bug as the gRPC AssignVolume fix: the TUS tusWriteData handler called detectStorageOption0 but never applied the filer's default DiskType when no location-prefix rule matched. This made TUS uploads ignore the -disk flag entirely.	2026-03-29 14:18:24 -07:00
Chris Lu	c2c58419b8	filer.sync: send log file chunk fids to clients for direct volume server reads (#8792 ) * filer.sync: send log file chunk fids to clients for direct volume server reads Instead of the server reading persisted log files from volume servers, parsing entries, and streaming them over gRPC (serial bottleneck), clients that opt in via client_supports_metadata_chunks receive log file chunk references (fids) and read directly from volume servers in parallel. New proto messages: - LogFileChunkRef: chunk fids + timestamp + filer ID for one log file - SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in - SubscribeMetadataResponse.log_file_refs: server sends refs during backlog Server changes: - CollectLogFileRefs: lists log files and returns chunk refs without any volume server I/O (metadata-only operation) - SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs during persisted log phase, then falls back to normal streaming for in-memory events Client changes: - ReadLogFileRefs: reads log files from volume servers, parses entries, filters by path prefix, invokes processEventFn - MetadataFollowOption.LogFileReaderFn: factory for chunk readers, enables metadata chunks when non-nil - Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate refs then process them at the disk→memory transition Backward compatible: old clients don't set the flag, get existing behavior. Ref: #8771 * filer.sync: merge entries across filers in timestamp order on client side ReadLogFileRefs now groups refs by filer ID and merges entries from multiple filers using a min-heap priority queue — the same algorithm the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue. This ensures events are processed in correct timestamp order even when log files from different filers have interleaved timestamps. Single-filer case takes the fast path (no heap allocation). * filer.sync: integration tests for direct-read metadata chunks Three test categories: 1. Merge correctness (TestReadLogFileRefsMergeOrder): Verifies entries from 3 filers are delivered in strict timestamp order, matching the server-side OrderedLogVisitor guarantee. 2. Path filtering (TestReadLogFileRefsPathFilter): Verifies client-side path prefix filtering works correctly. 3. Throughput comparison (TestDirectReadVsServerSideThroughput): 3 filers × 7 files × 300 events = 6300 events, 2ms per file read: server-side: 6300 events 218ms 28,873 events/sec direct-read: 6300 events 51ms 123,566 events/sec (4.3x) parallel: 6300 events 17ms 378,628 events/sec (13.1x) Direct-read eliminates gRPC send overhead per event (4.3x). Parallel per-filer reading eliminates serial file I/O (13.1x). * filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs ReadLogFileRefs now has two levels of I/O overlap: 1. Cross-filer parallelism: one goroutine per filer reads its files concurrently. Entries feed into per-filer channels, merged by the main goroutine via min-heap (same ordering guarantee as the server's OrderedLogVisitor). 2. Within-filer prefetching: while the current file's entries are being consumed by the merge heap, the next file is already being read from the volume server in a background goroutine. Single-filer fast path avoids the heap and channels. Test results (3 filers × 7 files × 300 events, 2ms per file read): server-side sequential: 6300 events 212ms 29,760 events/sec parallel + prefetch: 6300 events 36ms 177,443 events/sec Speedup: 6.0x * filer.sync: address all review comments on metadata chunks PR Critical fixes: - sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream. Ref messages have TsNs=0 and were being incorrectly batched into the Events field by the adaptive batching logic, corrupting ref delivery. - readLogFileEntries: use io.ReadFull instead of reader.Read to prevent partial reads from corrupting size values or protobuf data. - Error handling: only skip chunk-not-found errors (matching server-side isChunkNotFoundError). Other I/O or decode failures are propagated so the follower can retry. High-priority fixes: - CollectLogFileRefs: remove incorrect +24h padding from stopTime. The extra day caused unnecessary log file refs to be collected. - Path filtering: ReadLogFileRefs now accepts PathFilter struct with PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses util.Join for path construction (avoids "//foo" on root). Excludes /.system/log/ internal entries. Matches server-side eachEventNotificationFn filtering logic. Medium-priority fixes: - CollectLogFileRefs: accept context.Context, propagate to ListDirectoryEntries calls for cancellation support. - NewChunkStreamReaderFromLookup: accept context.Context, propagate to doNewChunkStreamReader. Test fixes: - Check error returns from ReadLogFileRefs in all test call sites. --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-27 11:01:29 -07:00
Chris Lu	d97660d0cd	filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791 ) * filer.sync: pipelined subscription with adaptive batching for faster catch-up The SubscribeMetadata pipeline was fully serial: reading a log entry from a volume server, unmarshaling, filtering, and calling stream.Send() all happened one-at-a-time. stream.Send() blocked the entire pipeline until the client acknowledged each event, limiting throughput to ~80 events/sec regardless of the -concurrency setting. Three server-side optimizations that stack: 1. Pipelined sender: decouple stream.Send() from the read loop via a buffered channel (1024 messages). A dedicated goroutine handles gRPC delivery while the reader continues processing the next events. 2. Adaptive batching: when event timestamps are >2min behind wall clock (backlog catch-up), drain multiple events from the channel and pack them into a single stream.Send() using a new `repeated events` field on SubscribeMetadataResponse. When events are recent (real-time), send one-by-one for low latency. Old clients ignore the new field (backward compatible). 3. Persisted log readahead: run the OrderedLogVisitor in a background goroutine so volume server I/O for the next log file overlaps with event processing and gRPC delivery. 4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling in SubscribeMetadata with notification-driven wake-up using the MetaLogBuffer subscriber mechanism, reducing real-time latency from ~1127ms to sub-millisecond. Combined, these create a 3-stage pipeline: [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send] Test results (simulated backlog with 50µs gRPC latency per Send): direct (old): 2100 events 2100 sends 168ms 12,512 events/sec pipelined+batched: 2100 events 14 sends 40ms 52,856 events/sec Speedup: 4.2x single-stream throughput Ref: #8771 * filer.sync: require client opt-in for batch event delivery Add ClientSupportsBatching field to SubscribeMetadataRequest. The server only packs events into the Events batch field when the client explicitly sets this flag to true. Old clients (Java SDK, third-party) that don't set the flag get one-event-per-Send, preserving backward compatibility. All Go callers (FollowMetadata, MetaAggregator) set the flag to true since their recv loops already unpack batched events. * filer.sync: clear batch Events field after Send to release references Prevents the envelope message from holding references to the rest of the batch after gRPC serialization, allowing the GC to collect them sooner. * filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe - pipelinedSender.Send: add case <-s.done to unblock when sender goroutine exits (fixes deadlock when errCh was already consumed by a prior Send). - pipelinedSender.reportErr: remove for-range drain on sendCh that could block indefinitely. Send() now detects exit via s.done instead. - SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the gap-detected-no-memory-data path with event-driven listenersCond.Wait(), consistent with the rest of the subscription paths. - Add TestPipelinedSenderErrorPropagation: verifies error surfaces via Send and Close when the underlying stream fails. - Replace goto with labeled break in test simulatePipeline. * filer.sync: check error returns in test code - direct_send: check slowStream.Send error return - pipelined_batched_send: check sender.Close error return - simulatePipeline: return error from sender.Close, propagate to callers --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-26 23:55:42 -07:00
Chris Lu	3a3fff1399	Fix TUS chunked upload and resume failures (#8783 ) (#8786 ) * Fix TUS chunked upload and resume failures caused by request context cancellation (#8783) The filer's TCP connections use a 10-second inactivity timeout (net_timeout.go). After the TUS PATCH request body is fully consumed, internal operations (assigning file IDs via gRPC to the master, uploading data to volume servers, completing uploads) do not generate any activity on the client connection, so the inactivity timer fires and Go's HTTP server cancels the request context. This caused HTTP 500 errors on PATCH requests where body reading + internal processing exceeded the timeout. Fix by using context.WithoutCancel in TUS create and patch handlers, matching the existing pattern used by assignNewFileInfo. This ensures internal operations complete regardless of client connection state. Fixes seaweedfs/seaweedfs#8783 * Add comment to tusCreateHandler explaining context.WithoutCancel rationale * Run TUS integration tests on all PRs, not just TUS file changes The previous path filter meant these tests only ran when TUS-specific files changed. This allowed regressions from changes to shared infrastructure (net_timeout.go, upload paths, gRPC) to go undetected — which is exactly how the context cancellation bug in #8783 was missed. Matches the pattern used by s3-go-tests.yml.	2026-03-26 14:06:21 -07:00
Lisandro Pin	e5cf2d2a19	Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. (#8360 ) * Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only. Also exposes this option in the shell `volume.scrub` command. * Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`. `417051bb` slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`, so calling them for invalid volume IDs will actually yield that error, instead of checking maintnenance mode first.	2026-03-26 10:20:57 -07:00
Chris Lu	94bfa2b340	mount: stream all filer mutations over single ordered gRPC stream (#8770 ) * filer: add StreamMutateEntry bidi streaming RPC Add a bidirectional streaming RPC that carries all filer mutation types (create, update, delete, rename) over a single ordered stream. This eliminates per-request connection overhead for pipelined operations and guarantees mutation ordering within a stream. The server handler delegates each request to the existing unary handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy stream adapter for rename operations to reuse StreamRenameEntry logic. The is_last field signals completion for multi-response operations (rename sends multiple events per request; create/update/delete always send exactly one response with is_last=true). * mount: add streaming mutation multiplexer (streamMutateMux) Implement a client-side multiplexer that routes all filer mutation RPCs (create, update, delete, rename) over a single bidirectional gRPC stream. Multiple goroutines submit requests through a send channel; a dedicated sendLoop serializes them on the stream; a recvLoop dispatches responses to waiting callers via per-request channels. Key features: - Lazy stream opening on first use - Automatic reconnection on stream failure - Permanent fallback to unary RPCs if filer returns Unimplemented - Monotonic request_id for response correlation - Multi-response support for rename operations (is_last signaling) The mux is initialized on WFS and closed during unmount cleanup. No call sites use it yet — wiring comes in subsequent commits. * mount: route CreateEntry and UpdateEntry through streaming mux Wire all CreateEntry call sites to use wfs.streamCreateEntry() which routes through the StreamMutateEntry stream when available, falling back to unary RPCs otherwise. Also wire Link's UpdateEntry calls through wfs.streamUpdateEntry(). Updated call sites: - flushMetadataToFiler (file flush after write) - Mkdir (directory creation) - Symlink (symbolic link creation) - createRegularFile non-deferred path (Mknod) - flushFileMetadata (periodic metadata flush) - Link (hard link: update source + create link + rollback) * mount: route UpdateEntry and DeleteEntry through streaming mux Wire remaining mutation call sites through the streaming mux: - saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry - Unlink → streamDeleteEntry (replaces RemoveWithResponse) - Rmdir → streamDeleteEntry (replaces RemoveWithResponse) All filer mutations except Rename now go through StreamMutateEntry when the filer supports it, with automatic unary RPC fallback. * mount: route Rename through streaming mux Wire Rename to use streamMutate.Rename() when available, with fallback to the existing StreamRenameEntry unary stream. The streaming mux sends rename as a StreamRenameEntryRequest oneof variant. The server processes it through the existing rename logic and sends multiple StreamRenameEntryResponse events (one per moved entry), with is_last=true on the final response. All filer mutations now go through a single ordered stream. * mount: fix stream mux connection ownership WithGrpcClient(streamingMode=true) closes the gRPC connection when the callback returns, destroying the stream. Own the connection directly via pb.GrpcDial so it stays alive for the stream's lifetime. Close it explicitly in recvLoop on stream failure and in Close on shutdown. * mount: fix rename failure for deferred-create files Three fixes for rename operations over the streaming mux: 1. lookupEntry: fall back to local metadata store when filer returns "not found" for entries in uncached directories. Files created with deferFilerCreate=true exist only in the local leveldb store until flushed; lookupEntry skipped the local store when the parent directory had never been readdir'd, causing rename to fail with ENOENT. 2. Rename: wait for pending async flushes and force synchronous flush of dirty metadata before sending rename to the filer. Covers the writebackCache case where close() defers the flush to a background worker that may not complete before rename fires. 3. StreamMutateEntry: propagate rename errors from server to client. Add error/errno fields to StreamMutateEntryResponse so the mount can map filer errors to correct FUSE status codes instead of silently returning OK. Also fix the existing Rename error handler which could return fuse.OK on unrecognized errors. * mount: fix streaming mux error handling, sendLoop lifecycle, and fallback Address PR review comments: 1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for create/update/delete errors, not just rename. Previously update errors were silently dropped and create/delete errors were only in nested response fields that the client didn't check. 2. Client: check nested error fields in CreateEntry (ErrorCode, Error) and DeleteEntry (Error) responses, matching CreateEntryWithResponse behavior. 3. Fix sendLoop lifecycle: give each stream generation a stopSend channel. recvLoop closes it on error to stop the paired sendLoop. Previously a reconnect left the old sendLoop draining sendCh, breaking ordering. 4. Transparent fallback: stream helpers and doRename fall back to unary RPCs on transport errors (ErrStreamTransport), including the first Unimplemented from ensureStream. Previously the first call failed instead of degrading. 5. Filer rotation in openStream: try all filer addresses on dial failure, matching WithFilerClient behavior. Stop early on Unimplemented. 6. Pass metadata-bearing context to StreamMutateEntry RPC call so sw-client-id header is actually sent. 7. Gate lookupEntry local-cache fallback on open dirty handle or pending async flush to avoid resurrecting deleted/renamed entries. 8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil). 9. Use string matching for rename error-to-errno mapping in the mount to stay portable across Linux/macOS (numeric errno values differ). * mount: make failAllPending idempotent with delete-before-close Change failAllPending to collect pending entries into a local slice (deleting from the sync.Map first) before closing channels. This prevents double-close panics if called concurrently. Also remove the unused err parameter. * mount: add stream generation tracking and teardownStream Introduce a generation counter on streamMutateMux that increments each time a new stream is created. Requests carry the generation they were enqueued for so sendLoop can reject stale requests after reconnect. Add teardownStream(gen) which is idempotent (only acts when gen matches current generation and stream is non-nil). Both sendLoop and recvLoop call it on error, replacing the inline cleanup in recvLoop. sendLoop now actively triggers teardown on send errors instead of silently exiting. ensureStream waits for the prior generation's recvDone before creating a new stream, ensuring all old pending waiters are failed before reconnect. recvLoop now takes the stream, generation, and recvDone channel as parameters to avoid accessing shared fields without the lock. * mount: harden Close to prevent races with teardownStream Nil out stream, cancel, and grpcConn under the lock so that any concurrent teardownStream call from recvLoop/sendLoop becomes a no-op. Call failAllPending before closing sendCh to unblock waiters promptly. Guard recvDone with a nil check for the case where Close is called before any stream was ever opened. * mount: make errCh receive ctx-aware in doUnary and Rename Replace the blocking <-sendReq.errCh with a select that also observes ctx.Done(). If sendLoop exits via stopSend without consuming a buffered request, the caller now returns ctx.Err() instead of blocking forever. The buffered errCh (capacity 1) ensures late acknowledgements from sendLoop don't block the sender. * mount: fix sendLoop/Close race and recvLoop/teardown pending channel race Three related fixes: 1. Stop closing sendCh in Close(). Closing the shared producer channel races with callers who passed ensureStream() but haven't sent yet, causing send-on-closed-channel panics. sendCh is now left open; ensureStream checks m.closed to reject new callers. 2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh() on exit so buffered requests get an ErrStreamTransport on their errCh instead of blocking forever. Close() drains again for any stragglers enqueued between sendLoop's drain and the final shutdown. 3. Move failAllPending from teardownStream into recvLoop's defer. teardownStream (called from sendLoop on send error) was closing pending response channels while recvLoop could be between pending.Load and the channel send — a send-on-closed-channel panic. recvLoop is now the sole closer of pending channels, eliminating the race. Close() waits on recvDone (with cancel() to guarantee Recv unblocks) so pending cleanup always completes. * filer/mount: add debug logging for hardlink lifecycle Add V(0) logging at every point where a HardLinkId is created, stored, read, or deleted to trace orphaned hardlink references. Logging covers: - gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId - FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId - handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count - setHardLink: KvPut with blob size - maybeReadHardLink: V(1) on read attempt and successful decode - DeleteHardLink: counter decrement/deletion events - Mount Link(): when NewHardLinkId is generated and link is created This helps diagnose how a git pack .rev file ended up with a HardLinkId during a clone (no hard links should be involved). * test: add git clone/pull integration test for FUSE mount Shell script that exercises git operations on a SeaweedFS mount: 1. Creates a bare repo on the mount 2. Clones locally, makes 3 commits, pushes to mount 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (files, content, commit hashes) 5. Pushes 2 more commits with renames and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real changes 8. Verifies file content, renames, deletes after pull 9. Checks git log integrity and clean status 27 assertions covering file existence, content, commit hashes, file counts, renames, deletes, and git status. Run against any existing mount: bash test-git-on-mount.sh /path/to/mount * test: add git clone/pull FUSE integration test to CI suite Add TestGitOperations to the existing fuse_integration test framework. The test exercises git's full file operation surface on the mount: 1. Creates a bare repo on the mount (acts as remote) 2. Clones locally, makes 3 commits (files, bulk data, renames), pushes 3. Clones from mount bare repo into an on-mount working dir 4. Verifies clone integrity (content, commit hash, file count) 5. Pushes 2 more commits with new files, renames, and deletes 6. Checks out an older revision on the mount clone 7. Returns to branch and pulls with real fast-forward changes 8. Verifies post-pull state: content, renames, deletes, file counts 9. Checks git log integrity (5 commits) and clean status Runs automatically in the existing fuse-integration.yml CI workflow. * mount: fix permission check with uid/gid mapping The permission checks in createRegularFile() and Access() compared the caller's local uid/gid against the entry's filer-side uid/gid without applying the uid/gid mapper. With -map.uid 501:0, a directory created as uid 0 on the filer would not match the local caller uid 501, causing hasAccess() to fall through to "other" permission bits and reject write access (0755 → other has r-x, no w). Fix: map entry uid/gid from filer-space to local-space before the hasAccess() call so both sides are in the same namespace. This fixes rsync -a failing with "Permission denied" on mkstempat when using uid/gid mapping. * mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to restore local uid/gid, but `outputPbEntry` writes the kernel response before the function returns — so the kernel received filer-side uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent child operations (mkdir, create) because the caller uid (501) doesn't match the directory owner (0), and "other" bits (0755 → r-x) lack write permission. Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal before outputPbEntry, so the kernel gets local uid/gid. Also add nil guards for UidGidMapper in Access and createRegularFile to prevent panics in tests that don't configure a mapper. This fixes rsync -a "Permission denied" on mkpathat for nested directories when using uid/gid mapping. * mount: fix Link outputting filer-side uid/gid to kernel, add nil guards Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink: the kernel received filer-side uid/gid because the defer hadn't run yet when outputPbEntry wrote the response. Also add nil guards for UidGidMapper in Access and createRegularFile so tests without a mapper don't panic. Audit of all outputPbEntry/outputFilerEntry call sites: - Mkdir: fixed in prior commit (explicit map before output) - Symlink: fixed in prior commit (explicit map before output) - Link: fixed here (explicit map before output) - Create (existing file): entry from maybeLoadEntry (already mapped) - Create (deferred): entry has local uid/gid (never mapped to filer) - Create (non-deferred): createRegularFile defer runs before return - Mknod: createRegularFile defer runs before return - Lookup: entry from lookupEntry (already mapped) - GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped) - readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped) - saveEntry: no kernel output - flushMetadataToFiler: no kernel output - flushFileMetadata: no kernel output * test: fix git test for same-filesystem FUSE clone When both the bare repo and working clone live on the same FUSE mount, git's local transport uses hardlinks and cross-repo stat calls that fail on FUSE. Fix: - Use --no-local on clone to disable local transport optimizations - Use reset --hard instead of checkout to stay on branch - Use fetch + reset --hard origin/<branch> instead of git pull to avoid local transport stat failures during fetch * adjust logging * test: use plain git clone/pull to exercise real FUSE behavior Remove --no-local and fetch+reset workarounds. The test should use the same git commands users run (clone, reset --hard, pull) so it reveals real FUSE issues rather than hiding them. * test: enable V(1) logging for filer/mount and collect logs on failure - Run filer and mount with -v=1 so hardlink lifecycle logs (V(0): create/delete/insert, V(1): read attempts) are captured - On test failure, automatically dump last 16KB of all process logs (master, volume, filer, mount) to test output - Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload - Update CI workflow to upload SeaweedFS process logs alongside test output * mount: clone entry for filer flush to prevent uid/gid race flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which returns the file handle's live proto entry pointer, then mutated it in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable) could observe filer-side uid/gid (e.g., 0:0) on the file handle entry and return it to the kernel. The kernel caches these attributes, so subsequent opens by the local user (uid 501) fail with EACCES. Fix: proto.Clone the entry before mapping uid/gid for the filer request. The file handle's live entry is never mutated, so concurrent Lookup always sees local uid/gid. This fixes the intermittent "Permission denied" on .git/FETCH_HEAD after the first git pull on a mount with uid/gid mapping. * mount: add debug logging for stale lock file investigation Add V(0) logging to trace the HEAD.lock recreation issue: - Create: log when O_EXCL fails (file already exists) with uid/gid/mode - completeAsyncFlush: log resolved path, saved path, dirtyMetadata, isDeleted at entry to trace whether async flush fires after rename - flushMetadataToFiler: log the dir/name/fullpath being flushed This will show whether the async flush is recreating the lock file after git renames HEAD.lock → HEAD. * mount: prevent async flush from recreating renamed .lock files When git renames HEAD.lock → HEAD, the async flush from the prior close() can run AFTER the rename and re-insert HEAD.lock into the meta cache via its CreateEntryRequest response event. The next git pull then sees HEAD.lock and fails with "File exists". Fix: add isRenamed flag on FileHandle, set by Rename before waiting for the pending async flush. The async flush checks this flag and skips the metadata flush for renamed files (same pattern as isDeleted for unlinked files). The data pages still flush normally. The Rename handler flushes deferred metadata synchronously (Case 1) before setting isRenamed, ensuring the entry exists on the filer for the rename to proceed. For already-released handles (Case 2), the entry was created by a prior flush. * mount: also mark renamed inodes via entry.Attributes.Inode fallback When GetInode fails (Forget already removed the inode mapping), the Rename handler couldn't find the pending async flush to set isRenamed. The async flush then recreated the .lock file on the filer. Fix: fall back to oldEntry.Attributes.Inode to find the pending async flush when the inode-to-path mapping is gone. Also extract MarkInodeRenamed into a method on FileHandleToInode for clarity. * mount: skip async metadata flush when saved path no longer maps to inode The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock because neither GetInode nor oldEntry.Attributes.Inode could find the inode (Forget already evicted the mapping, and the entry's stored inode was 0). Add a direct check in completeAsyncFlush: before flushing metadata, verify that the saved path still maps to this inode in the inode-to-path table. If the path was renamed or removed (inode mismatch or not found), skip the metadata flush to avoid recreating a stale entry. This catches all rename cases regardless of whether the Rename handler could set the isRenamed flag. * mount: wait for pending async flush in Unlink before filer delete Unlink was deleting the filer entry first, then marking the draining async-flush handle as deleted. The async flush worker could race between these two operations and recreate the just-unlinked entry on the filer. This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to persist after git pull, breaking subsequent git operations. Move the isDeleted marking and add waitForPendingAsyncFlush() before the filer delete so any in-flight flush completes first. Even if the worker raced past the isDeleted check, the wait ensures it finishes before the filer delete cleans up any recreated entry. * mount: reduce async flush and metadata flush log verbosity Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for every file close with writebackCache and are too noisy for normal use. * filer: reduce hardlink debug log verbosity from V(0) to V(4) HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and filer_grpc_server fire on every hardlinked file operation (git pack files use hardlinks extensively) and produce excessive noise. * mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check - weedfs_link.go: hardlink creation logs V(0) → V(4) - weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1) - empty_folder_cleaner.go: "not empty" check log V(0) → V(4) * filer: handle missing hardlink KV as expected, not error A "kv: not found" on hardlink read is normal when the link blob was already cleaned up but a stale entry still references it. Log at V(1) for not-found; keep Error level for actual KV failures. * test: add waitForDir before git pull in FUSE git operations test After git reset --hard, the FUSE mount's metadata cache may need a moment to settle on slow CI. The git pull subprocess (unpack-objects) could fail to stat the working directory. Poll for up to 5s. * Update git_operations_test.go * wait * test: simplify FUSE test framework to use weed mini Replace the 4-process setup (master + volume + filer + mount) with 2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies startup, reduces port allocation, and is faster on CI. * test: fix mini flag -admin → -admin.ui	2026-03-25 20:06:34 -07:00
Chris Lu	0b3867dca3	filer: add structured error codes to CreateEntryResponse (#8767 ) * filer: add FilerError enum and error_code field to CreateEntryResponse Add a machine-readable error code alongside the existing string error field. This follows the precedent set by PublishMessageResponse in the MQ broker proto. The string field is kept for human readability and backward compatibility. Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE, EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS. * filer: add sentinel errors and error code mapping in filer_pb Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in the filer_pb package so both the filer and consumers can reference them without circular imports. Add FilerErrorToSentinel() to map proto error codes to sentinels, and update CreateEntryWithResponse() to check error_code first, falling back to the string-based path for backward compatibility with old servers. * filer: return wrapped sentinel errors and set proto error codes Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using %w). This preserves errors.Is() traversal on the server side. In the gRPC CreateEntry handler, map sentinel errors to the corresponding FilerError proto codes using errors.Is(), setting both resp.Error (string, for backward compat) and resp.ErrorCode (enum). * S3: use errors.Is() with filer sentinels instead of string matching Replace fragile string-based error matching in filerErrorToS3Error and other S3 API consumers with errors.Is() checks against filer_pb sentinel errors. This works because the updated CreateEntryWithResponse helper reconstructs sentinel errors from the proto FilerError code. Update iceberg stage_create and metadata_files to check resp.ErrorCode instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is() for the already-exists check. String matching is retained only for non-filer errors (gRPC transport errors, checksum validation) that don't go through CreateEntryResponse. * filer: remove backward-compat string fallbacks for error codes Clients and servers are always deployed together, so there is no need for backward-compatibility fallback paths that parse resp.Error strings when resp.ErrorCode is unset. Simplify all consumers to rely solely on the structured error code. * iceberg: ensure unknown non-OK error codes are not silently ignored When FilerErrorToSentinel returns nil for an unrecognized error code, return an error including the code and message rather than falling through to return nil. * filer: fix redundant error message and restore error wrapping in helper Use request path instead of resp.Error in the sentinel error format string to avoid duplicating the sentinel message (e.g. "entry already exists: entry already exists"). Restore %w wrapping with errors.New() in the fallback paths so callers can use errors.Is()/errors.As(). * filer: promote file to directory on path conflict instead of erroring S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object) to coexist because S3 has a flat key space. When ensureParentDirectoryEntry finds a parent path that is a file instead of a directory, promote it to a directory by setting ModeDir while preserving the original content and chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry type-change guard. This fixes the S3 compatibility test failures where creating overlapping keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.	2026-03-24 17:08:22 -07:00
Chris Lu	c31e6b4684	Use filer-side copy for mounted whole-file copy_file_range (#8747 ) * Optimize mounted whole-file copy_file_range * Address mounted copy review feedback * Harden mounted copy fast path --------- Co-authored-by: Copilot <copilot@github.com>	2026-03-23 18:35:15 -07:00
Chris Lu	6bf654c25c	fix: keep metadata subscriptions progressing (#8730 ) (#8746 ) * fix: keep metadata subscriptions progressing (#8730) * test: cancel slow metadata writers with parent context * filer: ignore missing persisted log chunks	2026-03-23 15:26:54 -07:00
Chris Lu	15f4a97029	fix: improve raft leader election reliability and failover speed (#8692 ) * fix: clear raft vote state file on non-resume startup The seaweedfs/raft library v1.1.7 added a persistent `state` file for currentTerm and votedFor. When RaftResumeState=false (the default), the log, conf, and snapshot directories are cleared but this state file was not. On repeated restarts, different masters accumulate divergent terms, causing AppendEntries rejections and preventing leader election. Fixes #8690 * fix: recover TopologyId from snapshot before clearing raft state When RaftResumeState=false clears log/conf/snapshot, the TopologyId (used for license validation) was lost. Now extract it from the latest snapshot before cleanup and restore it on the topology. Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared recoverTopologyIdFromState helper in raft_common.go. * fix: stagger multi-master bootstrap delay by peer index Previously all masters used a fixed 1500ms delay before the bootstrap check. Now the delay is proportional to the peer's sorted index with randomization (matching the hashicorp raft path), giving the designated bootstrap node (peer 0) a head start while later peers wait for gRPC servers to be ready. Also adds diagnostic logging showing why DoJoinCommand was or wasn't called, making leader election issues easier to diagnose from logs. * fix: skip unreachable masters during leader reconnection When a master leader goes down, non-leader masters still redirect clients to the stale leader address. The masterClient would follow these redirects, fail, and retry — wasting round-trips each cycle. Now tryAllMasters tracks which masters failed within a cycle and skips redirects pointing to them, reducing log spam and connection overhead during leader failover. * fix: take snapshot after TopologyId generation for recovery After generating a new TopologyId on the leader, immediately take a raft snapshot so the ID can be recovered from the snapshot on future restarts with RaftResumeState=false. Without this, short-lived clusters would lose the TopologyId on restart since no automatic snapshot had been taken yet. * test: add multi-master raft failover integration tests Integration test framework and 5 test scenarios for 3-node master clusters: - TestLeaderConsistencyAcrossNodes: all nodes agree on leader and TopologyId - TestLeaderDownAndRecoverQuickly: leader stops, new leader elected, old leader rejoins as follower - TestLeaderDownSlowRecover: leader gone for extended period, cluster continues with 2/3 quorum - TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered when both restart - TestAllMastersDownAndRestart: full cluster restart, leader elected, all nodes agree on TopologyId * fix: address PR review comments - peerIndex: return -1 (not 0) when self not found, add warning log - recoverTopologyIdFromSnapshot: defer dir.Close() - tests: check GetTopologyId errors instead of discarding them * fix: address review comments on failover tests - Assert no leader after quorum loss (was only logging) - Verify follower cs.Leader matches expected leader via ServerAddress.ToHttpAddress() comparison - Check GetTopologyId error in TestTwoMastersDownAndRestart	2026-03-18 23:28:07 -07:00
Jayshan Raghunandan	1f1eac4f08	feat: improve aio support for admin/volume ingress and fix UI links (#8679 ) * feat: improve allInOne mode support for admin/volume ingress and fix master UI links - Add allInOne support to admin ingress template, matching the pattern used by filer and s3 ingress templates (or-based enablement with ternary service name selection) - Add allInOne support to volume ingress template, which previously required volume.enabled even when the volume server runs within the allInOne pod - Expose admin ports in allInOne deployment and service when allInOne.admin.enabled is set - Add allInOne.admin config section to values.yaml (enabled by default, ports inherit from admin.) - Fix legacy master UI templates (master.html, masterNewRaft.html) to prefer PublicUrl over internal Url when linking to volume server UI. The new admin UI already handles this correctly. fix: revert admin allInOne changes and fix PublicUrl in admin dashboard The admin binary (`weed admin`) is a separate process that cannot run inside `weed server` (allInOne mode). Revert the admin-related allInOne helm chart changes that caused 503 errors on admin ingress. Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to node.Id (internal pod address) instead of the actual public URL. Add public_url field to DataNodeInfo proto message so the topology gRPC response carries the public URL set via -volume.publicUrl flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use HTTP /dir/status to populate PublicUrl in admin dashboard The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL. Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it in both GetClusterTopology and GetClusterVolumeServers code paths. Also reverts the unnecessary proto field additions from the previous commit and cleans up a stray blank line in all-in-one-service.yml. * fix: apply PublicUrl link fix to masterNewRaft.html Match the same conditional logic already applied to master.html — prefer PublicUrl when set and different from Url. * fix: add HTTP timeout and status check to fetchPublicUrlMap Use a 5s-timeout client instead of http.DefaultClient to prevent blocking indefinitely when the master is unresponsive. Also check the HTTP status code before attempting to parse the response body. * fix: fall back to node address when PublicUrl is empty Prevents blank links in the admin dashboard when PublicUrl is not configured, such as in standalone or mixed-version clusters. * fix: log io.ReadAll error in fetchPublicUrlMap --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Chris Lu <chris.lu@gmail.com>	2026-03-18 13:20:55 -07:00
Chris Lu	81369b8a83	improve: large file sync throughput for remote.cache and filer.sync (#8676 ) * improve large file sync throughput for remote.cache and filer.sync Three main throughput improvements: 1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file instead of always starting at 5MB. A 500MB file now uses ~16MB chunks (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk overhead (volume assign, gRPC call, needle write) by 3x. 2. Configurable concurrency at every layer: - remote.cache chunk concurrency: -chunkConcurrency flag (default 8) - remote.cache S3 download concurrency: -downloadConcurrency flag (default raised from 1 to 5 per chunk) - filer.sync chunk concurrency: -chunkConcurrency flag (default 32) 3. S3 multipart download concurrency raised from 1 to 5: the S3 manager downloader was using Concurrency=1, serializing all part downloads within each chunk. This alone can 5x per-chunk download speed. The concurrency values flow through the gRPC request chain: shell command → CacheRemoteObjectToLocalClusterRequest → FetchAndWriteNeedleRequest → S3 downloader Zero values in the request mean "use server defaults", maintaining full backward compatibility with existing callers. Ref #8481 * fix: use full maxMB for chunk size cap and remove loop guard Address review feedback: - Use full maxMB instead of maxMB/2 for maxChunkSize to avoid unnecessarily limiting chunk size for very large files. - Remove chunkSize < maxChunkSize guard from the safety loop so it can always grow past maxChunkSize when needed to stay under 1000 chunks (e.g., extremely large files with small maxMB). * address review feedback: help text, validation, naming, docs - Fix help text for -chunkConcurrency and -downloadConcurrency flags to say "0 = server default" instead of advertising specific numeric defaults that could drift from the server implementation. - Validate chunkConcurrency and downloadConcurrency are within int32 range before narrowing, returning a user-facing error if out of range. - Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions. - Add doc comment to SetChunkConcurrency noting it must be called during initialization before replication goroutines start. - Replace doubling loop in chunk size safety check with direct ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap. * address Copilot review: clamp concurrency, fix chunk count, clarify proto docs - Use ceiling division for chunk count check to avoid overcounting when file size is an exact multiple of chunk size. - Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024 at filer, max 64 at volume server) to prevent excessive goroutines. - Always use ReadFileWithConcurrency when the client supports it, falling back to the implementation's default when value is 0. - Clarify proto comments that download_concurrency only applies when the remote storage client supports it (currently S3). - Include specific server defaults in help text (e.g., "0 = server default 8") so users see the actual values in -h output. * fix data race on executionErr and use %w for error wrapping - Protect concurrent writes to executionErr in remote.cache worker goroutines with a sync.Mutex to eliminate the data race. - Use %w instead of %v in volume_grpc_remote.go error formatting to preserve the error chain for errors.Is/errors.As callers.	2026-03-17 16:49:56 -07:00
Chris Lu	f4073107cb	fix: clean up orphaned needles on remote.cache partial download failure (#8675 ) When remote.cache downloads a file in parallel chunks and a gRPC connection drops mid-transfer, chunks already written to volume servers were not cleaned up. Since the filer metadata was never updated, these needles became orphaned — invisible to volume.vacuum and never referenced by the filer. On subsequent cache cycles the file was still treated as uncached, creating more orphans each attempt. Call DeleteUncommittedChunks on the download-error path, matching the cleanup already present for the metadata-update-failure path. Fixes #8481	2026-03-17 13:47:54 -07:00
Chris Lu	acea36a181	filer: add conditional update preconditions (#8647 ) * filer: add conditional update preconditions * iceberg: tighten metadata CAS preconditions	2026-03-16 12:33:32 -07:00
Chris Lu	8cde3d4486	Add data file compaction to iceberg maintenance (Phase 2) (#8503 ) * Add iceberg_maintenance plugin worker handler (Phase 1) Implement automated Iceberg table maintenance as a new plugin worker job type. The handler scans S3 table buckets for tables needing maintenance and executes operations in the correct Iceberg order: expire snapshots, remove orphan files, and rewrite manifests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add data file compaction to iceberg maintenance handler (Phase 2) Implement bin-packing compaction for small Parquet data files: - Enumerate data files from manifests, group by partition - Merge small files using parquet-go (read rows, write merged output) - Create new manifest with ADDED/DELETED/EXISTING entries - Commit new snapshot with compaction metadata Add 'compact' operation to maintenance order (runs before expire_snapshots), configurable via target_file_size_bytes and min_input_files thresholds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix memory exhaustion in mergeParquetFiles by processing files sequentially Previously all source Parquet files were loaded into memory simultaneously, risking OOM when a compaction bin contained many small files. Now each file is loaded, its rows are streamed into the output writer, and its data is released before the next file is loaded — keeping peak memory proportional to one input file plus the output buffer. * Validate bucket/namespace/table names against path traversal Reject names containing '..', '/', or '\' in Execute to prevent directory traversal via crafted job parameters. * Add filer address failover in iceberg maintenance handler Try each filer address from cluster context in order instead of only using the first one. This improves resilience when the primary filer is temporarily unreachable. * Add separate MinManifestsToRewrite config for manifest rewrite threshold The rewrite_manifests operation was reusing MinInputFiles (meant for compaction bin file counts) as its manifest count threshold. Add a dedicated MinManifestsToRewrite field with its own config UI section and default value (5) so the two thresholds can be tuned independently. * Fix risky mtime fallback in orphan removal that could delete new files When entry.Attributes is nil, mtime defaulted to Unix epoch (1970), which would always be older than the safety threshold, causing the file to be treated as eligible for deletion. Skip entries with nil Attributes instead, matching the safer logic in operations.go. * Fix undefined function references in iceberg_maintenance_handler.go Use the exported function names (ShouldSkipDetectionByInterval, BuildDetectorActivity, BuildExecutorActivity) matching their definitions in vacuum_handler.go. * Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage The IcebergMaintenanceHandler and its compaction code in the parent pluginworker package duplicated the logic already present in the iceberg/ subpackage (which self-registers via init()). The old code lacked stale-plan guards, proper path normalization, CAS-based xattr updates, and error-returning parseOperations. Since the registry pattern (default "all") makes the old handler unreachable, remove it entirely. All functionality is provided by iceberg.Handler with the reviewed improvements. * Fix MinManifestsToRewrite clamping to match UI minimum of 2 The clamp reset values below 2 to the default of 5, contradicting the UI's advertised MinValue of 2. Clamp to 2 instead. * Sort entries by size descending in splitOversizedBin for better packing Entries were processed in insertion order which is non-deterministic from map iteration. Sorting largest-first before the splitting loop improves bin packing efficiency by filling bins more evenly. * Add context cancellation check to drainReader loop The row-streaming loop in drainReader did not check ctx between iterations, making long compaction merges uncancellable. Check ctx.Done() at the top of each iteration. * Fix splitOversizedBin to always respect targetSize limit The minFiles check in the split condition allowed bins to grow past targetSize when they had fewer than minFiles entries, defeating the OOM protection. Now bins always split at targetSize, and a trailing runt with fewer than minFiles entries is merged into the previous bin. * Add integration tests for iceberg table maintenance plugin worker Tests start a real weed mini cluster, create S3 buckets and Iceberg table metadata via filer gRPC, then exercise the iceberg.Handler operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against the live filer. A full maintenance cycle test runs all operations in sequence and verifies metadata consistency. Also adds exported method wrappers (testing_api.go) so the integration test package can call the unexported handler methods. * Fix splitOversizedBin dropping files and add source path to drainReader errors The runt-merge step could leave leading bins with fewer than minFiles entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop the first 80-byte file). Replace the filter-based approach with an iterative merge that folds any sub-minFiles bin into its smallest neighbor, preserving all eligible files. Also add the source file path to drainReader error messages so callers can identify which Parquet file caused a read/write failure. * Harden integration test error handling - s3put: fail immediately on HTTP 4xx/5xx instead of logging and continuing - lookupEntry: distinguish NotFound (return nil) from unexpected RPC errors (fail the test) - writeOrphan and orphan creation in FullMaintenanceCycle: check CreateEntryResponse.Error in addition to the RPC error * go fmt --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 11:27:42 -07:00
Chris Lu	c4d642b8aa	fix(ec): gather shards from all disk locations before rebuild (#8633 ) * fix(ec): gather shards from all disk locations before rebuild (#8631) Fix "too few shards given" error during ec.rebuild on multi-disk volume servers. The root cause has two parts: 1. VolumeEcShardsRebuild only looked at a single disk location for shard files. On multi-disk servers, the existing local shards could be on one disk while copied shards were placed on another, causing the rebuild to see fewer shards than actually available. 2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 && len(vs.store.Locations) > 0) that was always true, making the FindFreeLocation fallback dead code. This meant copies always went to Locations[0] regardless of where existing shards were. Changes: - VolumeEcShardsRebuild now finds the location with the most shards, then gathers shard files from other locations via hard links (or symlinks for cross-device) before rebuilding. Gathered files are cleaned up after rebuild. - VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0 (explicitly set). Otherwise, it prefers the location that already has the EC volume, falling back to HDD then any free location. - generateMissingEcFiles now logs shard counts and provides a clear error message when not enough shards are found, instead of passing through to the opaque reedsolomon "too few shards given" error. * fix(ec): update test to match skip behavior for unrepairable volumes The test expected an error for volumes with insufficient shards, but commit `5acb4578a` changed unrepairable volumes to be skipped with a log message instead of returning an error. Update the test to verify the skip behavior and log output. * fix(ec): address PR review comments - Add comment clarifying DiskId=0 means "not specified" (protobuf default), callers must use DiskId >= 1 to target a specific disk. - Log warnings on cleanup failures for gathered shard links. * fix(ec): read shard files from other disks directly instead of linking Replace the hard link / symlink gathering approach with passing additional search directories into RebuildEcFiles. The rebuild function now opens shard files directly from whichever disk they live on, avoiding filesystem link operations and cleanup. RebuildEcFiles and RebuildEcFilesWithContext gain a variadic additionalDirs parameter (backward compatible with existing callers). * fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment * fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations - generateMissingEcFiles: two-pass approach — first discover present/missing shards and check reconstructability, only then create output files. This avoids leaving behind empty truncated shard files when there are too few shards to rebuild. - VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations. A location with an .ecx file but no shard files (all shards on other disks) is now a valid rebuild candidate instead of being silently skipped. * fix(ec): select ecx-only location as rebuildLocation when none chosen yet When rebuildLocation is nil and a location has hasEcx=true but existingShardCount=0 (all shards on other disks), the condition 0 > 0 was false so it was never promoted to rebuildLocation. Add rebuildLocation == nil to the predicate so the first location with an .ecx file is always selected as a candidate.	2026-03-14 20:59:47 -07:00

1 2 3 4 5 ...

1859 Commits