1859 Commits

Author SHA1 Message Date
Lisandro Pin
93247d6de4 Export REST file_{read,write}_failures metrics on volume servers (#9215)
* Export gRPC `file_{read,write}_failures` metrics on volume servers.

Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

* Export REST `file_{read,write}_failures` metrics on volume servers.
2026-04-24 11:45:21 -07:00
Chris Lu
3d39324bc1 fix(nfs): make Linux mount -t nfs work without client workaround (#9199) (#9201)
* fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199)

The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and
does not register with portmap. Linux mount.nfs queries portmap on port 111
first, so the plain `mount -t nfs host:/export /mnt` form failed with
"portmap query failed" / "requested NFS version or transport protocol is
not supported" against a default `weed nfs` deployment.

- Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with
  TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and
  proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses.
  Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port.

- New CLI flag `-portmap.bind` (empty, disabled by default) to opt into
  the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE
  and must not collide with a system rpcbind.

- Extended `weed nfs -h` help with the two supported ways to mount from
  Linux (client-side portmap bypass, or server-side `-portmap.bind`).

- Startup log now prints a copy-pasteable mount command tailored to
  whether portmap is enabled.

Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP
round-trip against the real listener.

Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0`
the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock
host:/export /mnt`) now succeeds and both read and write work.

* fix(nfs): harden portmap responder per review feedback (#9201)

Addresses three review findings on the portmap responder:

- parseRPCCall: validate opaque_auth length against the record limit
  before applying the XDR 4-byte padding, so a near-uint32-max authLen
  can no longer overflow (authLen + 3) and bypass the bounds check.
  (gemini-code-assist)

- serveTCP/Close: track live TCP connections and evict them on Close()
  so shutdown does not block on idle clients waiting for the read
  deadline to trip. serveTCP also no longer tears the listener down on
  a non-fatal Accept error (e.g. EMFILE); it logs and retries after a
  small back-off. Replaces the atomic.Bool closed flag with a
  mutex-guarded one so closed, conns, and the shutdown transition stay
  consistent. (coderabbit, minor)

- handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s
  in-flight) so a peer that opens the privileged port 111 and stalls
  cannot pin a goroutine indefinitely. (coderabbit, major)

Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP
connection idle and asserts Close() returns within 2s (well under the
30s idle deadline) and that the client sees the eviction.

All existing tests still pass, including under -race.

* fix(nfs): keep portmap UDP responder alive on transient read errors (#9201)

- serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and
  continue instead of returning. Matches how serveTCP now treats
  non-fatal Accept errors so a transient network blip doesn't take
  UDP portmap down until restart. (coderabbit)

- Rename portmapAcceptBackoff -> portmapRetryBackoff now that both
  paths use it.

- pmapProcDump: fix the pre-allocation capacity to match the actual
  encoding (20 bytes per entry + 4-byte terminator), replacing the
  old over-estimate of 24 per entry. No behavior change; just
  documents intent. (coderabbit nit)

* docs(nfs): clarify encodeAcceptedReply body semantics (#9201)

The prior comment said body is "nil when the accept_stat is itself an
error", which was misleading: the PROG_MISMATCH branch already passes
an 8-byte mismatch_info body. Rewrite to enumerate which error
accept_stat values omit the body and call out PROG_MISMATCH as the
exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit)

* fix(nfs): make portmap retry backoff interruptible by Close() (#9201)

serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a
non-fatal listener error. If Close() races in during that sleep, the
goroutine can't be interrupted, so Close() has to wait out the
remaining backoff before wg.Wait() returns.

Add a done channel that Close() closes once, and replace both
time.Sleep calls with a select on ps.done + time.After. The window
was tiny in practice but the select makes shutdown strictly bounded
by Close()'s own work. (coderabbit nit)
2026-04-23 13:53:53 -07:00
steve.wei
1a7ab2ea82 fix(upload): keep Content-MD5 on 204 unchanged writes (#9198)
Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved.
2026-04-23 10:59:59 -07:00
Chris Lu
592d6d6021 fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) (#9193)
* fix(filer/remote): keep re-cache work alive past caller cancellation (#9174)

For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot
finish before the S3 gateway's initial cache wait elapses. When it
does, the gRPC ctx cancellation cascades into the filer's chunk
downloads, the error path calls DeleteUncommittedChunks on every chunk
already written, and the next retry starts over. boto3 splitting the
GET into concurrent ranges (or any client tear-down on first failure)
shortens the window between retries, so the loop never converges.

Detach the caller's ctx with context.WithoutCancel before invoking
the singleflight work so the download runs to completion regardless
of client cancellations. Subsequent waiters — via the in-flight
singleflight, or a fresh retry landing after completion — observe the
cached entry and stream normally.

Same detach pattern is used in filer_server_handlers_write.go:53 and
volume_server_handlers_write.go:51.

* simplify rationale comment

* switch to DoChan so handler can return on caller cancel

Do keeps the handler goroutine blocked for the full detached download
even after the client is gone. DoChan lets the handler select on
ctx.Done() and exit immediately; the singleflight goroutine continues
on bgCtx and the next request either joins it or finds the entry
cached.
2026-04-22 17:56:15 -07:00
Chris Lu
f438cc3544 fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard

ReceiveFile for an EC shard calls os.Create(filePath) which opens the
path with O_TRUNC. When the shard is already mounted, the in-memory
EcVolume holds a file descriptor against the same inode, so a second
ReceiveFile call for the same (volume, shard) truncates the live shard
file beneath the reader.

Reproducer: generate and mount shard 0 for a populated volume, capture
the on-disk size, then send a smaller payload for the same shard via
ReceiveFile. The current handler accepts the overwrite and leaves the
shard truncated in place; this test pins that behavior. When the fix
lands the server should reject (or rename-then-swap) and this test
must be inverted.

* fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard

ReceiveFile used os.Create on EC shard paths, which opens with
O_TRUNC and truncates in place. When an EC shard is already
mounted, the in-memory EcVolume holds file descriptors against the
same inodes, so the truncation corrupts the live shard beneath any
ongoing read. On retries of an EC task this produced the "missing
parts" class of errors in #9184.

The fix rejects any ReceiveFile for an EC volume that currently
has mounted shards. The caller must unmount before retrying —
silent truncation is never an acceptable outcome. Non-EC writes and
ReceiveFile for volumes that have never been mounted on this server
continue to work as before.

Tests:
- TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard,
  attempts an overwrite, asserts the error response and that the
  on-disk file and live reads are undisturbed.
- TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case
  contract that a first write to a target still succeeds.

* fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard

Mirror the Go-side change: reject receive_file for any EC volume that
currently has mounted shards on this server. std::fs::File::create
truncates in place and the in-memory EcVolume holds fds on the same
inodes, so an overwrite would corrupt live readers.
2026-04-22 16:47:01 -07:00
Lisandro Pin
fff243d463 Export gRPC file_{read,write}_failures metrics on volume servers. (#9177)
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-22 11:22:21 -07:00
Chris Lu
c4e1885053 fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug

The plugin-worker EC task sends shards via ReceiveFile, which picks
Locations[0] as the target directory regardless of the admin planner's
TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there
is no wire channel to honor the plan.

Adds StartSingleVolumeClusterWithDataDirs to the integration framework
so tests can launch a volume server with N data directories. The new
repro asserts the current (buggy) behavior: sending three distinct EC
shards via ReceiveFile leaves all three files in dir[0] and the other
dirs empty. When the fix adds disk_id to ReceiveFileInfo, this
assertion must flip to verify the planned placement is respected.

* fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement

Before this change, VolumeServer.ReceiveFile for EC shards always
selected the first HDD location (Locations[0]). The plugin-worker EC
task had no way to pass the admin planner's per-shard disk
assignment — ReceiveFileInfo carried no disk_id field — so every
received EC shard piled onto a single disk per destination server.
On multi-disk servers this caused uneven load (one disk absorbing all
EC shard I/O), frequent ENOSPC retries, and a growing EC backlog
under sustained ingest (see issue #9184).

Changes:
- proto: add disk_id to ReceiveFileInfo, mirroring
  VolumeEcShardsCopyRequest.disk_id.
- worker: DistributeEcShards tracks the planner-assigned disk per
  shard; sendShardFileToDestination forwards that disk id. Metadata
  files (ecx/ecj/vif) inherit the disk of the first data shard
  targeting the same node so they land next to the shards.
- server: ReceiveFile honors disk_id when > 0 with bounds
  validation; disk_id=0 (unset) falls back to the same
  auto-selection pattern as VolumeEcShardsCopy (prefer disk that
  already has shards for this volume, then any HDD with free space,
  then any location with free space).

Tests updated:
- TestReceiveFileEcShardHonorsDiskID asserts three shards sent with
  disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively.
- TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range
  disk_id rejection path.

* fix(volume-rust): honor disk_id in ReceiveFile for EC shards

Mirror the Go-side change: when disk_id > 0 place the EC shard on the
requested disk; when unset, auto-select with the same preference order
as volume_ec_shards_copy (disk already holding shards, then any HDD,
then any disk).

* fix(volume): compare disk_id as uint32 to avoid 32-bit overflow

On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a
high-bit uint32 to a negative int, bypassing the bounds check before the
index operation. Compare in the uint32 domain instead.

* test(ec): fail invalid-disk_id test on transport error

Previously a transport-level error from CloseAndRecv silently passed the
test by returning early, masking any real gRPC failure. Fail loudly so
only the structured ReceiveFileResponse rejection path counts as a pass.

* docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test

Documents the load-bearing assumption that shards are never mounted in
this test, so loc.FindEcVolume always returns false and auto-select
falls through to the first HDD. Saves future readers from re-deriving
the expected directory for the DiskId=0 case.

* fix(test): preserve baseDir/volume path for single-dir clusters

StartSingleVolumeClusterWithDataDirs started naming the data directory
volume0 even in the dataDirCount=1 case, which broke Scrub tests that
reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile /
CorruptEcxFile. Keep the legacy name for single-dir clusters; only use
the indexed "volumeN" layout when multiple disks are requested.
2026-04-22 10:30:13 -07:00
Chris Lu
7f67995c24 chore(filer): remove -mount.p2p flag; registry is always on (#9183)
The filer-side mount peer registry (tier 1 of peer chunk sharing) was
gated behind -mount.p2p (default true). Idle cost is negligible — a
tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the
surface area.

Removes the flag from weed filer, weed server (-filer.mount.p2p), and
weed mini, and always constructs the registry in NewFilerServer. Also
drops the now-dead nil guards in MountRegister/MountList/sweeper and
the TestMountRegister_DisabledIsNoOp case.
2026-04-21 23:00:11 -07:00
Chris Lu
c40db5a52d perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171)
* perf(filer): parallelize StreamMutateEntry with path-keyed scheduler

The server handler processed one mutation at a time per stream, capping a
mount's aggregate throughput at ~1/filer_store_service_time regardless of
client concurrency (see issue #9138). With 12 rclone processes this showed
as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via
unary CreateEntry.

Replace the serial for-loop with a per-request goroutine admitted by a
path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor
(weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind
taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier
and descendant-barrier rules. Cross-path mutations run in parallel; same-
path mutations serialize on arrival order; recursive delete and directory
rename act as subtree barriers; directory attribute bumps stay non-barrier
so they do not serialize file writes under them.

Correctness and safety:
- Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource
  use from a single noisy mount.
- syncStream wrapper serializes stream.Send across worker goroutines (gRPC
  Send is not concurrent-safe).
- Handler waits on in-flight workers before returning on recv EOF/error so
  no worker writes to a torn-down stream.
- First fatal Send error from any worker propagates as the handler's
  return, causing the stream to tear down.

Benchmark (2 ms simulated filer-store service delay, 12 client workers):
  serial    : 440 QPS
  sem only  : 4902 QPS (unsafe — reorders same-path ops)
  scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct)

The sem-only number shows the upper bound of raw parallelism; the
scheduler matches it on distinct paths (the realistic 12-rclone case) and
correctly falls back to serial when the workload demands ordering. Peak
concurrent mutations at the handler equals client worker count on the
distinct-path workload and pins to 1 on the same-path workload, as the
scheduler intends.

* perf(filer): decouple StreamMutateEntry admission from receive loop

The previous StreamMutateEntry handler called sched.Admit directly in the
Recv loop. A single request conflicting on path /hot would head-of-line
block stream.Recv, so later requests targeting unrelated paths could not
be received or admitted until /hot drained — cross-path parallelism then
depended on request ordering instead of being a property of the scheduler.

Spawn the worker goroutine immediately on Recv and move sched.Admit into
that goroutine. A new streamMutatePendingLimit (1024) caps total per-
stream outstanding goroutines (pending + active) so a client flooding a
conflicted path cannot explode goroutine count without bound.

Addresses #9171 review comment (coderabbitai, Major).

* fix(filer): reply with EINVAL on unknown StreamMutateEntry request type

Returning nil when req.Request is a future oneof variant or a malformed
request left the client's per-RequestId waiter blocked forever, because
no response was ever sent for that id. Reply with IsLast=true and EINVAL
so the waiter completes with a well-formed error.

Addresses #9171 review comments (gemini-code-assist, coderabbitai).

* fix(filer): make classifyMutation crash-free and correct for deletes

Two issues addressed together because they share one function:

1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without
   a nil guard; an empty create_request / update_request / rename_request
   from a misbehaving client crashed the scheduler. Guard each oneof
   variant and fall back to a "/" barrier; the handler then sends EINVAL
   via the unknown-request path.

2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry-
   Request does not carry IsDirectory, so the previous kindMutateFile
   classification for non-recursive deletes did not conflict with an in-
   flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same
   path — a race in scheduler terms. Classify every delete as
   kindMutateBarrierDir regardless of IsRecursive. The incremental cost
   of a descendant-wait for a non-recursive delete of a non-empty dir is
   negligible since that call fails at the store anyway.

Adds classifyMutation tests for malformed create/update, empty oneof,
and updates the delete-non-recursive case to the new expected kind.

Addresses #9171 review comments (coderabbitai Critical, Major).

* fix(filer): route renameStreamProxy.SendMsg through the wrapping Send

The default pass-through SendMsg on renameStreamProxy bypassed the
syncStream mutex and the StreamMutateEntryResponse wrapping: anything
the rename helpers happened to push via SendMsg would have been emitted
on the wire as the wrong protobuf type and could interleave with other
workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry
Recv loop and could steal unrelated mutation requests.

Route SendMsg through the wrapping Send (rejecting other payload types)
and fail RecvMsg explicitly — the rename logic is a strictly server-push
stream and never calls RecvMsg, so loud failure is safer than silent
stealing.

Addresses #9171 review comment (coderabbitai, Major).

* test(filer): run exactly ops in stream-mutate workloads

perGoroutine := ops / concurrency silently truncated the total when the
values were not divisible — e.g. 2400 ops with 64 workers actually ran
2368 and with 256 workers ran 2304, making the logged "ops per run"
inaccurate and introducing measurement noise that varied across the
concurrency sweep.

Introduce opsForWorker(g, concurrency, ops) which distributes the
remainder to the first (ops % concurrency) workers so the three
workloads (unary, stream sync, stream async) each dispatch exactly
`ops` operations. No changes to the timing methodology.

Addresses #9171 review comment (coderabbitai, Minor).

* fix(filer): enforce per-path FIFO admission in mutateScheduler

sync.Cond.Broadcast wakes every waiter; the first to re-acquire the
mutex wins, so two conflicting same-path admissions could be reordered
by the Go runtime even though they arrived serially on the stream. A
single stream is supposed to carry ordered mutations — the PR's original
#8770 claim — so admission must be FIFO per path.

Replace the single cond with a per-path FIFO queue. Each Admit enqueues
a waiter on every path it touches (primary, and on rename the secondary
too) and blocks on a ready channel. tryPromoteLocked admits any waiter
that is at the head of every queue it joined, passes pathConflictsLocked
against the active-state indexes, and is under concurrencyLimit. Done
removes the heads and re-runs tryPromoteLocked so waiters freed by the
completion move in arrival order.

Side effect: two non-barrier directory updates on the same path now
serialize instead of overlapping. filer.sync's MetadataProcessor
intentionally allows them to overlap because its events come from a
committed log where last-writer-wins coalescing is safe; streamed
mutations carry client operations whose order matters, so we drop that
optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier
release) and TestAdmitSamePathNonBarrierSerializes to cover both.

Also refreshed the kindMutateFile doc comment that still referenced the
pre-#1ecf805f5 "non-recursive delete" classification.

Addresses #9171 review comments (coderabbitai Critical, Minor).

* test(filer): make TestAdmitSamePathFIFO deterministic without sleeps

The previous arrival-ordering sync (send to `started` before calling
Admit, plus a 1 ms sleep) relied on the goroutine actually entering
Admit and reaching the per-path queue during that sleep. Under -race on
a loaded CI that is a real flake source, which is ironic for a test
whose job is catching non-deterministic wake-ups.

Observe the scheduler's own pathQueue length between spawns instead —
waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected
number of waiters (1 barrier holder + i+1 file waiters) is enqueued.
That's the exact event the test wants to synchronise on, so there is no
fudge factor. Verified by `go test -race -count=5`.

Addresses #9171 review comment (coderabbitai, Minor).
2026-04-21 11:25:09 -07:00
Chris Lu
141413ad76 fix(tests): make tests pass on 32-bit architectures (#9168) (#9170)
Two separate failures reported on 32-bit builds (void-linux 4.21):

- weed/server: errorStreamImpl.count (and the same pattern in slowStream
  plus local totalEventsSent/totalSends) was a bare int64 sitting after
  smaller fields, so on 386/ARMv7/mips32 it landed at a 4-byte-aligned
  offset and atomic.AddInt64 panicked with "unaligned 64-bit atomic
  operation". Switched the counters to atomic.Int64, which Go guarantees
  is 8-byte aligned on every architecture.

- weed/plugin/worker/iceberg: three equality-delete tests fail on 32-bit
  because the upstream github.com/apache/iceberg-go declares
  manifestEntry.EqualityIDs as *[]int while the Iceberg Avro schema
  defines equality_ids as long, and hamba/avro refuses to map Go int
  onto Avro long when int is 32-bit. Not fixable in seaweedfs, so guard
  the affected tests with a t.Skip() when unsafe.Sizeof(int) < 8 until
  the upstream type is changed to []int32/[]int64.
2026-04-20 22:48:01 -07:00
Chris Lu
e24a443b17 peer chunk sharing 2/8: filer mount registry (#9131)
* proto: define MountRegister/MountList and MountPeer service

Adds the wire types for peer chunk sharing between weed mount clients:

* filer.proto: MountRegister / MountList RPCs so each mount can heartbeat
  its peer-serve address into a filer-hosted registry, and refresh the
  list of peers. Tiny payload; the filer stores only O(fleet_size) state.

* mount_peer.proto (new): ChunkAnnounce / ChunkLookup RPCs for the
  mount-to-mount chunk directory. Each fid's directory entry lives on
  an HRW-assigned mount; announces and lookups route to that mount.

No behavior yet — later PRs wire the RPCs into the filer and mount.
See design-weed-mount-peer-chunk-sharing.md for the full design.

* filer: add mount-server registry behind -peer.registry.enable

Implements tier 1 of the peer chunk sharing design: an in-memory registry
of live weed mount servers, keyed by peer address, refreshed by
MountRegister heartbeats and served by MountList.

* weed/filer/peer_registry.go: thread-safe map with TTL eviction; lazy
  sweep on List plus a background sweeper goroutine for bounded memory.

* weed/server/filer_grpc_server_peer.go: MountRegister / MountList RPC
  handlers. When -peer.registry.enable is false (the default), both RPCs
  are silent no-ops so probing older filers is harmless.

* -peer.registry.enable flag on weed filer; FilerOption.PeerRegistryEnabled
  wires it through.

Phase 1 is single-filer (no cross-filer replication of the registry);
mounts that fail over to another filer will re-register on the next
heartbeat, so the registry self-heals within one TTL cycle.

Part of the peer-chunk-sharing design; no behavior change at runtime
until a later PR enables the flag on both filer and mount.

* filer: nil-safe peerRegistryEnable + registry hardening

Addresses review feedback on PR #9131.

* Fix: nil pointer deref in the mini cluster. FilerOptions instances
  constructed outside weed/command/filer.go (e.g. miniFilerOptions in
  mini.go) do not populate peerRegistryEnable, so dereferencing the
  pointer panics at Filer startup. Use the same
  `nil && deref` idiom already used for distributedLock / writebackCache.

* Hardening (gemini review): registry now enforces three invariants:
  - empty peer_addr is silently rejected (no client-controlled sentinel
    mass-inserts)
  - TTL is capped at 1 hour so a runaway client cannot pin entries
  - new-entry count is capped at 10000 to bound memory; renewals of
    existing entries are always honored, so a full registry still
    heartbeats its existing members correctly

Covered by new unit tests.

* filer: rename -peer.registry.enable flag to -mount.p2p

Per review feedback: the old name "peer.registry.enable" leaked
the implementation ("registry") into the CLI surface. "mount.p2p"
is shorter and describes what it actually controls — whether this
filer participates in mount-to-mount peer chunk sharing.

Flag renames (all three keep default=true, idle cost is near-zero):
  -peer.registry.enable        ->  -mount.p2p         (weed filer)
  -filer.peer.registry.enable  ->  -filer.mount.p2p   (weed mini, weed server)

Internal variable names (mountPeerRegistryEnable, MountPeerRegistry)
keep their longer form — they describe the component, not the knob.

* filer: MountList returns DataCenter + List uses RLock

Two review follow-ups on the mount peer registry:

* weed/server/filer_grpc_server_mount_peer.go: MountList was dropping
  the DataCenter on the wire. The whole point of carrying DC separately
  from Rack is letting the mount-side fetcher re-rank peers by the
  two-level locality hierarchy (same-rack > same-DC > cross-DC); without
  DC in the response every remote peer collapsed to "unknown locality."

* weed/filer/mount_peer_registry.go: List() was taking a write lock so
  it could lazy-delete expired entries inline. But MountList is a
  read-heavy RPC hit on every mount's 30 s refresh loop, and Sweep is
  already wired as the sole reclamation path (same pattern as the
  mount-side PeerDirectory). Switch List to RLock + filter, let Sweep
  do the map mutation, so concurrent MountList callers don't serialize
  on each other.

Test updated to reflect the new contract (List no longer mutates the
map; Sweep is what drops expired entries).
2026-04-18 20:03:23 -07:00
Lisandro Pin
6bcacedda9 Export master_disconnections metrics on volume servers. (#9104)
This allows to track connection issues and master failovers in real time via Prometheus metrics.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-17 15:15:26 -07:00
Chris Lu
00a2e22478 fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes

The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.

Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.

Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.

* fix(upload): populate ExpectedDataSize from actual chunk bytes

UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.

Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.

* fix(assign): pass real ExpectedDataSize at every assign call site

After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.

- filer: assignNewFileInfo now takes expectedDataSize and stamps it
  on both primary and alternate VolumeAssignRequests. Callers pass:
  - SSE data-to-chunk: len(data)
  - copy manifest save: len(data)
  - streamCopyChunk: srcChunk.Size
  - TUS sub-chunk: bytes read
  - saveAsChunk (autochunk/manifestize): 0 (small, size unknown
    until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
  after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
  upload_chunked.go passes the buffered dataSize at the call site.
  S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
  all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
  FileSize (average for batched requests, real per-chunk size for
  sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).

* fix(assign): thread size through SaveDataAsChunkFunctionType

The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.

Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
  it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
  UploadWithRetry, which already stamps len(data) on the assign
  after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
  path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
  assignNewFileInfo so manifest-chunk assigns get a real size.

Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
2026-04-16 15:51:13 -07:00
Chris Lu
979c54f693 fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals

Raft leader is advertised as host:httpPort.grpcPort, but clients dial
host:httpPort. Raw string comparison against VolumeLocation.Leader /
HeartbeatResponse.Leader therefore never matches, causing the
masterclient and the volume server heartbeat loop to continuously
"redirect" to the already-connected master, tearing down the stream
and reconnecting.

Use ServerAddress.Equals, which normalizes the grpc-port suffix.

* fix(filer,mq): compare ServerAddress via Equals in two more sites

filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local
partition assignment check both compared a wire-supplied address string
against the local self ServerAddress with raw string equality. Both are
vulnerable to the same plain-vs-host:port.grpcPort mismatch as the
masterclient/volume heartbeat sites: filer would bootstrap from itself,
and the broker would fail to claim a partition it was actually assigned.
Route both through ServerAddress.Equals.

* fix(master,shell): more ServerAddress comparisons via Equals

- raft_server_handlers.go HealthzHandler: s.serverAddr == leader would
  skip the child-lock check on the real leader when the two carry
  different plain/grpc-suffix forms, returning 200 OK instead of 423.
- master_server.go SetRaftServer leader-change callback: the
  Leader() == Name() guard for ensureTopologyId could disagree with
  topology.IsLeader() (which already uses Equals), so leader-only
  initialization could be skipped after an election.
- command_volume_merge.go isReplicaServer: the -target guard compared
  user-supplied host:port against NewServerAddressFromDataNode(...) with
  ==, letting an existing replica slip through when topology carries
  the embedded gRPC port.

All routed through pb.ServerAddress.Equals.

* fix(mq,cluster): more ServerAddress comparisons via Equals

- broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the
  partition ownership check gated listing on raw
  LeaderBroker == BrokerAddress().String(), so listings silently omitted
  partitions hosted locally when the assignment carried the other
  host:port / host:port.grpcPort form.
- lock_client.go: LockHostMovedTo comparison and the seedFiler fallback
  guard both used raw string equality against configured filer
  addresses (which may be plain host:port while LockHostMovedTo comes
  back suffixed), causing spurious host-change churn and blocking the
  seed-filer fallback.

* fix(mq): more ServerAddress comparisons via Equals

- pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct
  activeBrokers.Get() lookup missed brokers when a persisted assignment
  carried a different address encoding than the registered broker key,
  triggering a bogus reassignment on every read/write cycle. Added a
  findActiveBroker helper that falls back to an Equals-based scan and
  canonicalizes the assignment in place so later writes are stable.
- broker_grpc_lookup.go isLockOwner: used raw string equality between
  LockOwner() and BrokerAddress().String(), so a lock owner could fail
  to recognize itself and proxy local lookup/config/admin RPCs away.
- pub_client/scheduler.go onEachAssignments: reused publisher jobs only
  on exact LeaderBroker match, so an encoding flip in lookup results
  tore down and recreated a stream to the same broker.
2026-04-15 12:29:31 -07:00
Chris Lu
08d9193fe1 [nfs] Add NFS (#9067)
* add filer inode foundation for nfs

* nfs command skeleton

* add filer inode index foundation for nfs

* make nfs inode index hardlink aware

* add nfs filehandle and inode lookup plumbing

* add read-only nfs frontend foundation

* add nfs namespace mutation support

* add chunk-backed nfs write path

* add nfs protocol integration tests

* add stale handle nfs coverage

* complete nfs hardlink and failover coverage

* add nfs export access controls

* add nfs metadata cache invalidation

* fix nfs chunk read lookup routing

* fix nfs review findings and rename regression

* address pr 9067 review comments

- filer_inode: fail fast if the snowflake sequencer cannot start, and let
  operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
  to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
  a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
  after CommitTransaction so a rolled-back rename does not strand live
  metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
  filer.path, so the experimental server does not expose the entire
  filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
  layout rather than RFC 1813 LINK3args

* mount: pre-allocate inode in Mkdir and Symlink

Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.

createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.

* sequence: mask snowflake node id on int→uint32 conversion

CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.

* add test/nfs integration suite

Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.

Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.

Framework notes:

- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
  -port.grpc explicitly so the default port+10000 convention cannot
  overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
  starting the NFS server — the NFS server's ensureIndexedEntry check
  requires the export root to exist with a real entry, which filer.Root
  does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
  not try to re-dial via portmapper (which concatenates ":111" onto the
  address).

* ci: add NFS integration test workflow

Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.

* nfs: use append for buffer growth in Write and Truncate

The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.

* filer: honor caller cancellation in collectInodeIndexEntries

Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.

* nfs: skip filer read on open when O_TRUNC is set

openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.

* nfs: allow Seek on O_APPEND files and document buffered write cap

Two related cleanups on filesystem.go:

- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
  Seek error ("append-only file descriptors may only seek to EOF")
  prevented read-and-write workloads that legitimately reposition the
  read cursor. Write already snaps the offset to EOF before persisting
  (see seaweedFile Write), so Seek can unconditionally accept any
  offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
  per-file ceiling, the memory footprint it implies, and that the real
  fix for larger whole-file rewrites is streaming / multi-chunk support.

Both changes flagged by gemini-code-assist on PR 9067.

* nfs: guard offset before casting to int in Write

CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.

* nfs: compute Write endOffset in int64 to satisfy CodeQL

The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.

* ci: drop emojis from nfs-tests workflow summary

Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.

* nfs: annotate remaining DEV_PLAN TODOs with status

Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:

- Reuse chunk cache and mutation stream helpers without FUSE deps:
  checked off — the NFS server imports weed/filer.ReaderCache and
  weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
  as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
  annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
  go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
  "Current Blockers" note.

* test/nfs: share port+readiness helpers with test/testutil

Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.

Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.

* nfs: stream writes to volume servers instead of buffering the whole file

Before this change the NFS write path held the full contents of every
writable open in memory:

  - OpenFile(write) called loadWritableContent which read the existing
    file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
  - each Write() extended content in-place
  - Close() uploaded the whole buffer as a single chunk via
    persistContent + AssignVolume

The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:

  - openFile(write) no longer loads the existing content at all; it
    only issues an UpdateEntry when O_TRUNC is set *and* the file is
    non-empty (so a fresh create+trunc is still zero-RPC)
  - Write() streams the caller's bytes straight to a volume server via
    one AssignVolume + one chunk upload, then atomically appends the
    resulting chunk to the filer entry through mutateEntry. Any
    previously inlined entry.Content is migrated to a chunk in the same
    update so the chunk list becomes the authoritative representation.
  - Truncate() becomes a direct mutateEntry (drop chunks past the new
    size, clip inline content, update FileSize) instead of resizing an
    in-memory buffer.
  - Close() is a no-op because everything was flushed inline.

The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.

Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).

* filer: extract shared gateway upload helper for NFS and WebDAV

Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.

- NFS's saveDataAsChunk is now a thin adapter that assembles the
  GatewayChunkUploadRequest from server options and calls the helper.
  The chunkUploader interface keeps working for test injection because
  the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
  local operation.NewUploader call plus the AssignVolume/UploadOption
  scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
  features that do not fit the shared helper (a pre-allocated file-id
  pool used to skip AssignVolume entirely, and a chunkCache
  write-through at offset 0 so future reads hit the mount's local
  cache), both of which are mount-specific.

Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.

* nfs: remove DESIGN.md and DEV_PLAN.md

The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
2026-04-14 20:48:24 -07:00
Lisandro Pin
67a2810d2d Export start_time_seconds metrics on both master & volume servers. (#9046)
These are to be used to track uptimes.

See https://github.com/seaweedfs/seaweedfs/issues/8535 for details.

Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
2026-04-13 09:34:08 -07:00
Chris Lu
edf7d2a074 fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039)
* fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035)

Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to
readPersistedLogBufferPosition, causing LoopProcessLogData to call
ReadPersistedLogBuffer on every 250ms health-check tick when a
subscriber encounters ResumeFromDiskError.  Each call creates an
OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a
readahead goroutine with a 1024-element channel, finds no data, and
returns — 4 times per second even on an idle filer.

This is redundant because SubscribeLocalMetadata already manages disk
reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs
tracking in the outer loop.

Set ReadFromDiskFn back to nil for LocalMetaLogBuffer.  When
LoopProcessLogData encounters ResumeFromDiskError with nil
ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the
caller (SubscribeLocalMetadata), which blocks efficiently on
listenersCond.Wait() instead of polling.

* fix(filer): add gap detection for slow consumers after disk-read stall

When a slow consumer falls behind and LoopProcessLogData returns
ResumeFromDiskError with no flush or read-position progress, there may
be a gap between persisted data and in-memory data (e.g. writes stopped
while consumer was still catching up). Without this, the consumer would
block on listenersCond.Wait() forever.

Skip forward to the earliest in-memory time to resume progress, matching
the gap-handling pattern already used in the shouldReadFromDisk path.

* fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall

The gap-detection block added in the previous commit skips lastReadTime
forward to GetEarliestTime() and continues the outer loop.  On the next
iteration, shouldReadFromDisk becomes true (currentReadTsNs >
lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the
existing gap handler at the top of the loop runs its own gap check.
That check uses readInMemoryLogErr == ResumeFromDiskError as the entry
condition — but readInMemoryLogErr is still the stale error from two
iterations ago.  GetEarliestTime() now equals lastReadTime.Time (we
already advanced to it), so earliestTime.After(lastReadTime.Time) is
false and the handler falls into listenersCond.Wait() — stuck.

Clear readInMemoryLogErr at the gap-skip point, matching the existing
pattern at the earlier gap handler that already clears it for the same
reason.

* fix(log_buffer): GetEarliestTime must include sealed prev buffers

GetEarliestTime previously returned only logBuffer.startTime (the active
buffer's first timestamp).  That is narrower than ReadFromBuffer's
tsMemory, which is the min across active + prev buffers.  Callers using
GetEarliestTime for gap detection after ResumeFromDiskError (the
SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in
the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time
that was *newer* than the real earliest in-memory data.

Impact in SubscribeLocalMetadata's slow-consumer path:
  - tsMemory = earliest prev buffer time (T_prev)
  - GetEarliestTime() = active startTime (T_active, later than T_prev)
  - Consumer position = T1, with T_prev < T1 < T_active
  - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory)
  - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true
  - Skip forward to T_active -- silently drops the prev-buffer data
  - And when T_active happens to equal the stuck position, gap detect
    evaluates false, and the subscriber stalls on listenersCond.Wait()

This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing
failure in CI where the consumer stalled at 10220/20000 after writing
stopped -- the buffer still had data in prev[0..3], but gap detection
was comparing against the active buffer's startTime.

Fix: scan all sealed prev buffers under RLock, return the true minimum
startTime.  Matches the min-of-buffers logic in ReadFromBuffer.

* test(log_buffer): make DiskReadRetry test deterministic

The previous test added the message via AddToBuffer + ForceFlush and
relied on a race: the second disk read had to happen before the data
was delivered through the in-memory path.  Under the race detector or
on a slow CI runner, the reader is woken by AddToBuffer's notification,
finds the data in the active buffer or its prev slot, and returns after
exactly one disk read — failing the >= 2 disk reads assertion even
though the loop behaved correctly.

Reproduced on master with race detector (2/5 failures).

Rewrite the test to deliver the data exclusively through the disk-read
path: no AddToBuffer, no ForceFlush.  The test waits until the reader
has issued at least one no-op disk read, then atomically flips a
"dataReady" flag.  The reader's next iteration through readFromDiskFn
returns the entry.  This deterministically exercises the retry-loop
behavior the test was originally written to protect, and removes the
in-memory delivery race entirely.
2026-04-11 23:12:54 -07:00
os-pradipbabar
9cae95d749 fix(filer): prevent data corruption during graceful shutdown (#9037)
* fix: wait for in-flight uploads to complete before filer shutdown

Prevents data corruption when SIGTERM is received during active uploads.
The filer now waits for all in-flight operations to complete before
calling the underlying shutdown logic.

This affects all deployment types (Kubernetes, Docker, systemd) and
fixes corruption issues during rolling updates, certificate rotation,
and manual restarts.

Changes:
- Add FilerServer.Shutdown() method with upload wait logic
- Update grace.OnInterrupt hook to use new shutdown method

Fixes data corruption reported by production users during pod restarts.

* fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete

* fix: address review comments on graceful shutdown

- Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking
  from long-lived streams (falls back to Stop on timeout)
- Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within
  Kubernetes default 30s termination grace period
- Move fs.Shutdown() (database close) after Serve() returns instead
  of a separate hook to eliminate race where main goroutine exits
  before the shutdown hook runs

* fix: shut down all HTTP servers before filer database close

Address remaining review comments:
- Shut down auxiliary HTTP servers (Unix socket, local listener) during
  graceful shutdown so they can't serve write traffic after the main
  server stops
- Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it
  completes before os.Exit(0), fixing the race between the grace
  goroutine and the main goroutine
- Use sync.Once to ensure fs.Shutdown() runs exactly once regardless
  of whether shutdown is signal-driven or context-driven (MiniCluster)

---------

Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-11 21:18:22 -07:00
Chris Lu
b37bbf541a feat(master): drain pending size before marking volume readonly (#9036)
* feat(master): drain pending size before marking volume readonly

When vacuum, volume move, or EC encoding marks a volume readonly,
in-flight assigned bytes may still be pending. This adds a drain step:
immediately remove from writable list (stop new assigns), then wait
for pending to decay below 4MB or 30s timeout.

- Add volumeSizeTracking struct consolidating effectiveSize,
  reportedSize, and compactRevision into a single map
- Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable,
  DrainAndSetVolumeReadOnly to VolumeLayout
- UpdateVolumeSize detects compaction via compactRevision change and
  resets effectiveSize instead of decaying
- Wire drain into vacuum (topology_vacuum.go) and volume mark readonly
  (master_grpc_server_volume.go)

* fix: use 2MB pending size drain threshold

* fix: check crowded state on initial UpdateVolumeSize registration

* fix: respect context cancellation in drain, relax test timing

- DrainAndSetVolumeReadOnly now accepts context.Context and returns
  early on cancellation (for gRPC handler timeout/cancel)
- waitForPendingDrain uses select on ctx.Done instead of time.Sleep
- Increase concurrent heartbeat test timeout from 10s to 15s for CI

* fix: use time-based dedup so decay runs even when reported size is unchanged

The value-based dedup (same reportedSize + compactRevision = skip) prevented
decay from running when pending bytes existed but no writes had landed on
disk yet. The reported size stayed the same across heartbeats, so the excess
never decayed.

Fix: dedup replicas within the same heartbeat cycle using a 2-second time
window instead of comparing values. This allows decay to run once per
heartbeat cycle even when the reported size is unchanged.

Also confirmed finding 1 (draining re-add race) is a false positive:
- Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes
- Move/EC: readonlyVolumes flag prevents re-adding during drain

* fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout

The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up
to 30s waiting for pending bytes to decay. In integration tests (and
real clusters during EC encoding), this caused timeouts because multiple
volumes are marked readonly sequentially and heartbeats may not arrive
fast enough to decay pending within the drain window.

Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops
new assigns) and only logs a warning if pending bytes remain. The drain
wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs
inside the master's own goroutine pool.

Remove DrainAndSetVolumeReadOnly as it's no longer used.

* fix: relax test timing, rename test, add post-condition assert

* test: add vacuum integration tests with CI workflow

Full-cluster integration test for vacuum, modeled on the EC integration
tests. Starts a real master + 2 volume servers, uploads data, deletes
entries to create garbage, runs volume.vacuum via shell command, and
verifies garbage cleanup and data integrity.

Test flow:
1. Start cluster (master + 2 volume servers)
2. Upload 10 files to create volume with data
3. Delete 5 files to create ~50% garbage
4. Verify garbage ratio > 10%
5. Run volume.vacuum command
6. Verify garbage cleaned up
7. Verify remaining 5 files are still accessible

CI workflow runs on push/PR to master with 15-minute timeout.
Log collection on failure via artifact upload.

* fix: use 500KB files and delete 75% to exceed vacuum garbage threshold

* fix: add shell lock before vacuum command, fix compilation error

* fix: strengthen vacuum integration test assertions

- waitForServer: use net.DialTimeout instead of grpc.NewClient for
  real TCP readiness check
- verify_garbage_before_vacuum: t.Fatal instead of warning when no
  garbage detected
- verify_cleanup_after_vacuum: t.Fatal if no server reported the
  volume or cleanup wasn't verified
- verify_remaining_data: read actual file contents via HTTP and
  compare byte-for-byte against original uploaded payloads

* fix: use http.Client with timeout and close body before retry
2026-04-11 18:29:11 -07:00
Chris Lu
10b0bdce02 feat: pass expected_data_size from clients for size-aware assignment (#9032)
* feat: pass expected_data_size from clients for size-aware assignment

Add expected_data_size field to AssignRequest (master proto) and
AssignVolumeRequest (filer proto) so clients can hint how large the
data will be. The master uses this instead of the 1MB default when
tracking pending volume sizes for weighted assignment.

- Add expected_data_size to master.proto AssignRequest
- Add expected_data_size to filer.proto AssignVolumeRequest
- Wire through filer AssignVolume handler
- Wire through HTTP submit handler (uses actual upload size)
- Add ExpectedDataSize to VolumeAssignRequest in operation package
- Topology.PickForWrite accepts optional expectedDataSize parameter

* fix: guard integer conversions in expected_data_size path

- common.go: clamp OriginalDataSize to non-negative before uint64 cast
- topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast

* fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize

- HTTP /dir/assign now parses optional "dataSize" query parameter
  and passes it to PickForWrite instead of hardcoded 0
- Add test assertion for PickForWrite with non-zero expectedDataSize
2026-04-11 11:30:47 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
6f036c7015 fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock (#8998)
* fix(master): skip redundant DoJoinCommand on resumeState to prevent deadlock

When fastResume is active (single-master + resumeState + non-empty log),
the raft server becomes leader within ~1ms. DoJoinCommand then enters
the leaderLoop's processCommand path, which calls setCommitIndex to
commit all pending entries. The goraft setCommitIndex implementation
returns early when it encounters a JoinCommand entry (to recalculate
quorum), which can prevent the new entry's event channel from being
notified — leaving DoJoinCommand blocked forever.

Each restart appends a new raft:join entry to the log, while the conf
file's commitIndex (only persisted on AddPeer) lags behind. After 3-4
restarts the uncommitted range contains old JoinCommand entries that
trigger the early return before the new entry is reached.

Fix: skip DoJoinCommand when the raft log already has entries (the
server was already joined in a previous run). The fastResume mechanism
handles leader election independently.

* fix(master): handle Hashicorp Raft in HasExistingState

Add Hashicorp Raft support to HasExistingState by checking
AppliedIndex, consistent with how other RaftServer methods
handle both raft implementations.

* fix(master): use LastIndex() instead of AppliedIndex() for Hashicorp Raft

AppliedIndex() reflects in-memory FSM state which starts at 0 before
log replay completes. LastIndex() reads from persisted stable storage,
correctly mirroring the non-Hashicorp IsLogEmpty() check.
2026-04-08 21:08:50 -07:00
Lars Lehtonen
8edadf7f4a chore(weed/server): prune unused unexported struct fields (#8980) 2026-04-07 21:24:30 -07:00
Chris Lu
940eed0bd3 fix(ec): generate .ecx before EC shards to prevent data inconsistency (#8972)
* fix(ec): generate .ecx before EC shards to prevent data inconsistency

In VolumeEcShardsGenerate, the .ecx index was generated from .idx AFTER
the EC shards were generated from .dat. If any write occurred between
these two steps (e.g. WriteNeedleBlob during replica sync, which bypasses
the read-only check), the .ecx would contain entries pointing to data
that doesn't exist in the EC shards, causing "shard too short" and
"size mismatch" errors on subsequent reads and scrubs.

Fix by generating .ecx FIRST, then snapshotting datFileSize, then
encoding EC shards. If a write sneaks in after .ecx generation, the
EC shards contain more data than .ecx references — which is harmless
(the extra data is simply not indexed).

Also snapshot datFileSize before EC encoding to ensure the .vif
reflects the same .dat state that .ecx was generated from.

Add TestEcConsistency_WritesBetweenEncodeAndEcx that reproduces the
race condition by appending data between EC encoding and .ecx generation.

* fix: pass actual offset to ReadBytes, improve test quality

- Pass offset.ToActualOffset() to ReadBytes instead of 0 to preserve
  correct error metrics and error messages within ReadBytes
- Handle Stat() error in assembleFromIntervalsAllowError
- Rename TestEcConsistency_DatFileGrowsDuringEncoding to
  TestEcConsistency_ExactLargeRowEncoding (test verifies fixed-size
  encoding, not concurrent growth)
- Update test comment to clarify it reproduces the old buggy sequence
- Fix verification loop to advance by readSize for full data coverage

* fix(ec): add dat/idx consistency check in worker EC encoding

The erasure_coding worker copies .dat and .idx as separate network
transfers. If a write lands on the source between these copies, the
.idx may have entries pointing past the end of .dat, leading to EC
volumes with .ecx entries that reference non-existent shard data.

Add verifyDatIdxConsistency() that walks the .idx and verifies no
entry's offset+size exceeds the .dat file size. This fails the EC
task early with a clear error instead of silently producing corrupt
EC volumes.

* test(ec): add integration test verifying .ecx/.ecd consistency

TestEcIndexConsistencyAfterEncode uploads multiple needles of varying
sizes (14B to 256KB), EC-encodes the volume, mounts data shards, then
reads every needle back via the EC read path and verifies payload
correctness. This catches any inconsistency between .ecx index entries
and EC shard data.

* fix(test): account for needle overhead in test volume fixture

WriteTestVolumeFiles created a .dat of exactly datSize bytes but the
.idx entry claimed a needle of that same size. GetActualSize adds
header + checksum + timestamp overhead, so the consistency check
correctly rejects this as the needle extends past the .dat file.

Fix by sizing the .dat to GetActualSize(datSize) so the .idx entry
is consistent with the .dat contents.

* fix(test): remove flaky shard ID assertion in EC scrub test

When shard 0 is truncated on disk after mount, the volume server may
detect corruption via parity mismatches (shards 10-13) rather than a
direct read failure on shard 0, depending on OS caching/mmap behavior.
Replace the brittle shard-0-specific check with a volume ID validation.

* fix(test): close upload response bodies and tighten file count assertion

Wrap UploadBytes calls with ReadAllAndClose to prevent connection/fd
leaks during test execution. Also tighten TotalFiles check from >= 1
to == 1 since ecSetup uploads exactly one file.
2026-04-07 19:05:36 -07:00
Chris Lu
a4753b6a3b S3: delay empty folder cleanup to prevent Spark write failures (#8970)
* S3: delay empty folder cleanup to prevent Spark write failures (#8963)

Empty folders were being cleaned up within seconds, causing Apache Spark
(s3a) writes to fail when temporary directories like _temporary/0/task_xxx/
were briefly empty.

- Increase default cleanup delay from 5s to 2 minutes
- Only process queue items that have individually aged past the delay
  (previously the entire queue was drained once any item triggered)
- Make the delay configurable via filer.toml:
  [filer.options]
  s3.empty_folder_cleanup_delay = "2m"

* test: increase cleanup wait timeout to match 2m delay

The empty folder cleanup delay was increased to 2 minutes, so the
Spark integration test needs to wait longer for temporary directories
to disappear.

* fix: eagerly clean parent directories after empty folder deletion

After deleting an empty folder, immediately try to clean its parent
rather than relying on cascading metadata events that each re-enter
the 2-minute delay queue. This prevents multi-minute waits when
cleaning nested temporary directory trees (e.g. Spark's _temporary
hierarchy with 3+ levels would take 6m+ vs near-instant).

Fixes the CI failure where lingering _temporary parent directories
were not cleaned within the test's 3-minute timeout.
2026-04-07 13:20:59 -07:00
Chris Lu
4efe0acaf5 fix(master): fast resume state and default resumeState to true (#8925)
* fix(master): fast resume state and default resumeState to true

When resumeState is enabled in single-master mode, the raft server had
existing log entries so the self-join path couldn't promote to leader.
The server waited the full election timeout (10-20s) before self-electing.

Fix by temporarily setting election timeout to 1ms before Start() when
in single-master + resumeState mode with existing log, then restoring
the original timeout after leader election. This makes resume near-instant.

Also change the default for resumeState from false to true across all
CLI commands (master, mini, server) so state is preserved by default.

* fix(master): prevent fastResume goroutine from hanging forever

Use defer to guarantee election timeout is always restored, and bound
the polling loop with a timeout so it cannot spin indefinitely if
leader election never succeeds.

* fix(master): use ticker instead of time.After in fastResume polling loop
2026-04-04 14:15:56 -07:00
Chris Lu
896114d330 fix(admin): fix master leader link showing incorrect port in Admin UI (#8924)
fix(admin): use gRPC address for current server in RaftListClusterServers

The old Raft implementation was returning the HTTP address
(ms.option.Master) for the current server, while peers used gRPC
addresses (peer.ConnectionString). The Admin UI's GetClusterMasters()
converts all addresses from gRPC to HTTP via GrpcAddressToServerAddress
(port - 10000), which produced a negative port (-667) for the current
server since its address was already in HTTP format (port 9333).

Use ToGrpcAddress() for consistency with both HashicorpRaft (which
stores gRPC addresses) and old Raft peers.

Fixes #8921
2026-04-04 11:50:43 -07:00
Chris Lu
d1823d3784 fix(s3): include static identities in listing operations (#8903)
* fix(s3): include static identities in listing operations

Static identities loaded from -s3.config file were only stored in the
S3 API server's in-memory state. Listing operations (s3.configure shell
command, aws iam list-users) queried the credential manager which only
returned dynamic identities from the backend store.

Register static identities with the credential manager after loading
so they are included in LoadConfiguration and ListUsers results, and
filtered out before SaveConfiguration to avoid persisting them to the
dynamic store.

Fixes https://github.com/seaweedfs/seaweedfs/discussions/8896

* fix: avoid mutating caller's config and defensive copies

- SaveConfiguration: use shallow struct copy instead of mutating the
  caller's config.Identities field
- SetStaticIdentities: skip nil entries to avoid panics
- GetStaticIdentities: defensively copy PolicyNames slice to avoid
  aliasing the original

* fix: filter nil static identities and sync on config reload

- SetStaticIdentities: filter nil entries from the stored slice (not
  just from staticNames) to prevent panics in LoadConfiguration/ListUsers
- Extract updateCredentialManagerStaticIdentities helper and call it
  from both startup and the grace.OnReload handler so the credential
  manager's static snapshot stays current after config file reloads

* fix: add mutex for static identity fields and fix ListUsers for store callers

- Add sync.RWMutex to protect staticIdentities/staticNames against
  concurrent reads during config reload
- Revert CredentialManager.ListUsers to return only store users, since
  internal callers (e.g. DeletePolicy) look up each user in the store
  and fail on non-existent static entries
- Merge static usernames in the filer gRPC ListUsers handler instead,
  via the new GetStaticUsernames method
- Fix CI: TestIAMPolicyManagement/managed_policy_crud_lifecycle was
  failing because DeletePolicy iterated static users that don't exist
  in the store

* fix: show static identities in admin UI and weed shell

The admin UI and weed shell s3.configure command query the filer's
credential manager via gRPC, which is a separate instance from the S3
server's credential manager. Static identities were only registered
on the S3 server's credential manager, so they never appeared in the
filer's responses.

- Add CredentialManager.LoadS3ConfigFile to parse a static S3 config
  file and register its identities
- Add FilerOptions.s3ConfigFile so the filer can load the same static
  config that the S3 server uses
- Wire s3ConfigFile through in weed mini and weed server modes
- Merge static usernames in filer gRPC ListUsers handler
- Add CredentialManager.GetStaticUsernames helper
- Add sync.RWMutex to protect concurrent access to static identity
  fields
- Avoid importing weed/filer from weed/credential (which pulled in
  filer store init() registrations and broke test isolation)
- Add docker/compose/s3_static_users_example.json

* fix(admin): make static users read-only in admin UI

Static users loaded from the -s3.config file should not be editable
or deletable through the admin UI since they are managed via the
config file.

- Add IsStatic field to ObjectStoreUser, set from credential manager
- Hide edit, delete, and access key buttons for static users in the
  users table template
- Show a "static" badge next to static user names
- Return 403 Forbidden from UpdateUser and DeleteUser API handlers
  when the target user is a static identity

* fix(admin): show details for static users

GetObjectStoreUserDetails called credentialManager.GetUser which only
queries the dynamic store. For static users this returned
ErrUserNotFound. Fall back to GetStaticIdentity when the store lookup
fails.

* fix(admin): load static S3 identities in admin server

The admin server has its own credential manager (gRPC store) which is
a separate instance from the S3 server's and filer's. It had no static
identity data, so IsStaticIdentity returned false (edit/delete buttons
shown) and GetStaticIdentity returned nil (details page failed).

Pass the -s3.config file path through to the admin server and call
LoadS3ConfigFile on its credential manager, matching the approach
used for the filer.

* fix: use protobuf is_static field instead of passing config file path

The previous approach passed -s3.config file path to every component
(filer, admin). This is wrong because the admin server should not need
to know about S3 config files.

Instead, add an is_static field to the Identity protobuf message.
The field is set when static identities are serialized (in
GetStaticIdentities and LoadS3ConfigFile). Any gRPC client that loads
configuration via GetConfiguration automatically sees which identities
are static, without needing the config file.

- Add is_static field (tag 8) to iam_pb.Identity proto message
- Set IsStatic=true in GetStaticIdentities and LoadS3ConfigFile
- Admin GetObjectStoreUsers reads identity.IsStatic from proto
- Admin IsStaticUser helper loads config via gRPC to check the flag
- Filer GetUser gRPC handler falls back to GetStaticIdentity
- Remove s3ConfigFile from AdminOptions and NewAdminServer signature
2026-04-03 20:01:28 -07:00
Chris Lu
0798b274dd feat(s3): add concurrent chunk prefetch for large file downloads (#8917)
* feat(s3): add concurrent chunk prefetch for large file downloads

Add a pipe-based prefetch pipeline that overlaps chunk fetching with
response writing during S3 GetObject, SSE downloads, and filer proxy.

While chunk N streams to the HTTP response, fetch goroutines for the
next K chunks establish HTTP connections to volume servers ahead of
time, eliminating the RTT gap between sequential chunk fetches.

Uses io.Pipe for minimal memory overhead (~1MB per download regardless
of chunk size, vs buffering entire chunks). Also increases the
streaming read buffer from 64KB to 256KB to reduce syscall overhead.

Benchmark results (64KB chunks, prefetch=4):
- 0ms latency:  1058 → 2362 MB/s (2.2× faster)
- 5ms latency:  11.0 → 41.7 MB/s (3.8× faster)
- 10ms latency: 5.9  → 23.3 MB/s (4.0× faster)
- 20ms latency: 3.1  → 12.1 MB/s (3.9× faster)

* fix: address review feedback for prefetch pipeline

- Fix data race: use *chunkPipeResult (pointer) on channel to avoid
  copying struct while fetch goroutines write to it. Confirmed clean
  with -race detector.
- Remove concurrent map write: retryWithCacheInvalidation no longer
  updates fileId2Url map. Producer only reads it; consumer never writes.
- Use mem.Allocate/mem.Free for copy buffer to reduce GC pressure.
- Add local cancellable context so consumer errors (client disconnect)
  immediately stop the producer and all in-flight fetch goroutines.

* fix(test): remove dead code and add Range header support in test server

- Remove unused allData variable in makeChunksAndServer
- Add Range header handling to createTestServer for partial chunk
  read coverage (206 Partial Content, 416 Range Not Satisfiable)

* fix: correct retry condition and goroutine leak in prefetch pipeline

- Fix retry condition: use result.fetchErr/result.written instead of
  copied to decide cache-invalidation retry. The old condition wrongly
  triggered retry when the fetch succeeded but the response writer
  failed on the first write (copied==0 despite fetcher having data).
  Now matches the sequential path (stream.go:197) which checks whether
  the fetcher itself wrote zero bytes.

- Fix goroutine leak: when the producer's send to the results channel
  is interrupted by context cancellation, the fetch goroutine was
  already launched but the result was never sent to the channel. The
  drain loop couldn't handle it. Now waits on result.done before
  returning so every fetch goroutine is properly awaited.
2026-04-03 19:57:30 -07:00
Chris Lu
995dfc4d5d chore: remove ~50k lines of unreachable dead code (#8913)
* chore: remove unreachable dead code across the codebase

Remove ~50,000 lines of unreachable code identified by static analysis.

Major removals:
- weed/filer/redis_lua: entire unused Redis Lua filer store implementation
- weed/wdclient/net2, resource_pool: unused connection/resource pool packages
- weed/plugin/worker/lifecycle: unused lifecycle plugin worker
- weed/s3api: unused S3 policy templates, presigned URL IAM, streaming copy,
  multipart IAM, key rotation, and various SSE helper functions
- weed/mq/kafka: unused partition mapping, compression, schema, and protocol functions
- weed/mq/offset: unused SQL storage and migration code
- weed/worker: unused registry, task, and monitoring functions
- weed/query: unused SQL engine, parquet scanner, and type functions
- weed/shell: unused EC proportional rebalance functions
- weed/storage/erasure_coding/distribution: unused distribution analysis functions
- Individual unreachable functions removed from 150+ files across admin,
  credential, filer, iam, kms, mount, mq, operation, pb, s3api, server,
  shell, storage, topology, and util packages

* fix(s3): reset shared memory store in IAM test to prevent flaky failure

TestLoadIAMManagerFromConfig_EmptyConfigWithFallbackKey was flaky because
the MemoryStore credential backend is a singleton registered via init().
Earlier tests that create anonymous identities pollute the shared store,
causing LookupAnonymous() to unexpectedly return true.

Fix by calling Reset() on the memory store before the test runs.

* style: run gofmt on changed files

* fix: restore KMS functions used by integration tests

* fix(plugin): prevent panic on send to closed worker session channel

The Plugin.sendToWorker method could panic with "send on closed channel"
when a worker disconnected while a message was being sent. The race was
between streamSession.close() closing the outgoing channel and sendToWorker
writing to it concurrently.

Add a done channel to streamSession that is closed before the outgoing
channel, and check it in sendToWorker's select to safely detect closed
sessions without panicking.
2026-04-03 16:04:27 -07:00
Chris Lu
af68449a26 Process .ecj deletions during EC decode and vacuum decoded volume (#8863)
* Process .ecj deletions during EC decode and vacuum decoded volume (#8798)

When decoding EC volumes back to normal volumes, deletions recorded in
the .ecj journal were not being applied before computing the dat file
size or checking for live needles. This caused the decoded volume to
include data for deleted files and could produce false positives in the
all-deleted check.

- Call RebuildEcxFile before HasLiveNeedles/FindDatFileSize in
  VolumeEcShardsToVolume so .ecj deletions are merged into .ecx first
- Vacuum the decoded volume after mounting in ec.decode to compact out
  deleted needle data from the .dat file
- Add integration tests for decoding with non-empty .ecj files

* storage: add offline volume compaction helper

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: compact decoded volumes before deleting shards

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* ec: address PR review comments

- Fall back to data directory for .ecx when idx directory lacks it
- Make compaction failure non-fatal during EC decode
- Remove misleading "buffer: 10%" from space check error message

* ec: collect .ecj from all shard locations during decode

Each server's .ecj only contains deletions for needles whose data
resides in shards held by that server. Previously, sources with no
new data shards to contribute were skipped entirely, losing their
.ecj deletion entries. Now .ecj is always appended from every shard
location so RebuildEcxFile sees the full set of deletions.

* ec: add integration tests for .ecj collection during decode

TestEcDecodePreservesDeletedNeedles: verifies that needles deleted
via VolumeEcBlobDelete are excluded from the decoded volume.

TestEcDecodeCollectsEcjFromPeer: regression test for the fix in
collectEcShards. Deletes a needle only on a peer server that holds
no new data shards, then verifies the deletion survives decode via
.ecj collection.

* ec: address review nits in decode and tests

- Remove double error wrapping in mountDecodedVolume
- Check VolumeUnmount error in peer ecj test
- Assert 404 specifically for deleted needles, fail on 5xx

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-01 01:15:26 -07:00
Chris Lu
75a6a34528 dlm: resilient distributed locks via consistent hashing + backup replication (#8860)
* dlm: replace modulo hashing with consistent hash ring

Introduce HashRing with virtual nodes (CRC32-based consistent hashing)
to replace the modulo-based hashKeyToServer. When a filer node is
removed, only keys that hashed to that node are remapped to the next
server on the ring, leaving all other mappings stable. This is the
foundation for backup replication — the successor on the ring is
always the natural takeover node.

* dlm: add Generation and IsBackup fields to Lock

Lock now carries IsBackup (whether this node holds the lock as a backup
replica) and Generation (a monotonic fencing token that increments on
each fresh acquisition, stays the same on renewal). Add helper methods:
AllLocks, PromoteLock, DemoteLock, InsertBackupLock, RemoveLock, GetLock.

* dlm: add ReplicateLock RPC and generation/is_backup proto fields

Add generation field to LockResponse for fencing tokens.
Add generation and is_backup fields to Lock message.
Add ReplicateLock RPC for primary-to-backup lock replication.
Add ReplicateLockRequest/ReplicateLockResponse messages.

* dlm: add async backup replication to DistributedLockManager

Route lock/unlock via consistent hash ring's GetPrimaryAndBackup().
After a successful lock or unlock on the primary, asynchronously
replicate the operation to the backup server via ReplicateFunc
callback. Single-server deployments skip replication.

* dlm: add ReplicateLock handler and backup-aware topology changes

Add ReplicateLock gRPC handler for primary-to-backup replication.
Revise OnDlmChangeSnapshot to handle three cases on topology change:
- Promote backup locks when this node becomes primary
- Demote primary locks when this node becomes backup
- Transfer locks when this node is neither primary nor backup
Wire up SetupDlmReplication during filer server initialization.

* dlm: expose generation fencing token in lock client

LiveLock now captures the generation from LockResponse and exposes it
via Generation() method. Consumers can use this as a fencing token to
detect stale lock holders.

* dlm: update empty folder cleaner to use consistent hash ring

Replace local modulo-based hashKeyToServer with LockRing.GetPrimary()
which uses the shared consistent hash ring for folder ownership.

* dlm: add unit tests for consistent hash ring

Test basic operations, consistency on server removal (only keys from
removed server move), backup-is-successor property (backup becomes
new primary when primary is removed), and key distribution balance.

* dlm: add integration tests for lock replication failure scenarios

Test cases:
- Primary crash with backup promotion (backup has valid token)
- Backup crash with primary continuing
- Both primary and backup crash (lock lost, re-acquirable)
- Rolling restart across all nodes
- Generation fencing token increments on new acquisition
- Replication failure (primary still works independently)
- Unlock replicates deletion to backup
- Lock survives server addition (topology change)
- Consistent hashing minimal disruption (only removed server's keys move)

* dlm: address PR review findings

1. Causal replication ordering: Add per-lock sequence number (Seq) that
   increments on every mutation. Backup rejects incoming mutations with
   seq <= current seq, preventing stale async replications from
   overwriting newer state. Unlock replication also carries seq and is
   rejected if stale.

2. Demote-after-handoff: OnDlmChangeSnapshot now transfers the lock to
   the new primary first and only demotes to backup after a successful
   TransferLocks RPC. If the transfer fails, the lock stays as primary
   on this node.

3. SetSnapshot candidateServers leak: Replace the candidateServers map
   entirely instead of appending, so removed servers don't linger.

4. TransferLocks preserves Generation and Seq: InsertLock now accepts
   generation and seq parameters. After accepting a transferred lock,
   the receiving node re-replicates to its backup.

5. Rolling restart test: Add re-replication step after promotion and
   assert survivedCount > 0. Add TestDLM_StaleReplicationRejected.

6. Mixed-version upgrade note: Add comment on HashRing documenting that
   all filer nodes must be upgraded together.

* dlm: serve renewals locally during transfer window on node join

When a new node joins and steals hash ranges from surviving nodes,
there's a window between ring update and lock transfer where the
client gets redirected to a node that doesn't have the lock yet.

Fix: if the ring says primary != self but we still hold the lock
locally (non-backup, matching token), serve the renewal/unlock here
rather than redirecting. The lock will be transferred by
OnDlmChangeSnapshot, and subsequent requests will go to the new
primary once the transfer completes.

Add tests:
- TestDLM_NodeDropAndJoin_OwnershipDisruption: measures disruption
  when a node drops and a new one joins (14/100 surviving-node locks
  disrupted, all handled by transfer logic)
- TestDLM_RenewalDuringTransferWindow: verifies renewal succeeds on
  old primary during the transfer window

* dlm: master-managed lock ring with stabilization batching

The master now owns the lock ring membership. Instead of filers
independently reacting to individual ClusterNodeUpdate add/remove
events, the master:

1. Tracks filer membership in LockRingManager
2. Batches rapid changes with a 1-second stabilization timer
   (e.g., a node drop + join within 1 second → single ring update)
3. Broadcasts the complete ring snapshot atomically via the new
   LockRingUpdate message in KeepConnectedResponse

Filers receive the ring as a complete snapshot and apply it via
SetSnapshot, ensuring all filers converge to the same ring state
without intermediate churn.

This eliminates the double-churn problem where a rapid drop+join
would fire two separate ring mutations, each triggering lock
transfers and disrupting ownership on surviving nodes.

* dlm: track ring version, reject stale updates, remove dead code

SetSnapshot now takes a version parameter from the master. Stale
updates (version < current) are rejected, preventing reordered
messages from overwriting a newer ring state. Version 0 is always
accepted for bootstrap.

Remove AddServer/RemoveServer from LockRing — the ring is now
exclusively managed by the master via SetSnapshot. Remove the
candidateServers map that was only used by those methods.

* dlm: fix SelectLocks data race, advance generation on backup insert

- SelectLocks: change RLock to Lock since the function deletes map
  entries, which is a write operation and causes a data race under RLock.
- InsertBackupLock: advance nextGeneration to at least the incoming
  generation so that after failover promotion, new lock acquisitions
  get a generation strictly greater than any replicated lock.
- Bump replication failure log from V(1) to Warningf for production
  visibility.

* dlm: fix SetSnapshot race, test reliability, timer edge cases

- SetSnapshot: hold LockRing lock through both version update and
  Ring.SetServers() so they're atomic. Prevents a concurrent caller
  from seeing the new version but applying stale servers.
- Transfer window test: search for a key that actually moves primary
  when filer4 joins, instead of relying on a fixed key that may not.
- renewLock redirect: pass the existing token to the new primary
  instead of empty string, so redirected renewals work correctly.
- scheduleBroadcast: check timer.Stop() return value. If the timer
  already fired, the callback picks up latest state.
- FlushPending: only broadcast if timer.Stop() returns true (timer
  was still pending). If false, the callback is already running.
- Fix test comment: "idempotent" → "accepted, state-changing".

* dlm: use wall-clock nanoseconds for lock ring version

The lock ring version was an in-memory counter that reset to 0 on
master restart. A filer that had seen version 5 would reject version 1
from the restarted master.

Fix: use time.Now().UnixNano() as the version. This survives master
restarts without persistence — the restarted master produces a
version greater than any pre-restart value.

* dlm: treat expired lock owners as missing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: reject stale lock transfers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: order replication by generation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* dlm: bootstrap lock ring on reconnect

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-03-30 23:29:56 -07:00
Chris Lu
4705d8b82b Fix stale admin lock metric when lock expires and is reacquired (#8859)
* Fix stale admin lock metric when lock expires and is reacquired (#8857)

When a lock expired without an explicit unlock and a different client
acquired it, the old client's metric was never cleared, causing
multiple clients to appear as simultaneously holding the lock.

* Use DeleteLabelValues instead of Set(0) to remove stale metric series

Avoids cardinality explosion from accumulated stale series when
client names are dynamic.
2026-03-30 18:51:38 -07:00
Chris Lu
ced2236cc6 Adjust rename events metadata format (#8854)
* rename metadata events

* fix subscription filter to use NewEntry.Name for rename path matching

The server-side subscription filter constructed the new path using
OldEntry.Name instead of NewEntry.Name when checking if a rename
event's destination matches the subscriber's path prefix. This could
cause events to be incorrectly filtered when a rename changes the
file name.

* fix bucket events to handle rename of bucket directories

onBucketEvents only checked IsCreate and IsDelete. A bucket directory
rename via AtomicRenameEntry now emits a single rename event (both
OldEntry and NewEntry non-nil), which matched neither check. Handle
IsRename by deleting the old bucket and creating the new one.

* fix replicator to handle rename events across directory boundaries

Two issues fixed:

1. The replicator filtered events by checking if the key (old path)
   was under the source directory. Rename events now use the old path
   as key, so renames from outside into the watched directory were
   silently dropped. Now both old and new paths are checked, and
   cross-boundary renames are converted to create or delete.

2. NewParentPath was passed to the sink without remapping to the
   sink's target directory structure, causing the sink to write
   entries at the wrong location. Now NewParentPath is remapped
   alongside the key.

* fix filer sync to handle rename events crossing directory boundaries

The early directory-prefix filter only checked resp.Directory (old
parent). Rename events now carry the old parent as Directory, so
renames from outside the source path into it were dropped before
reaching the existing cross-boundary handling logic. Check both old
and new directories against sourcePath and excludePaths so the
downstream old-key/new-key logic can properly convert these to
create or delete operations.

* fix metadata event path matching

* fix metadata event consumers for rename targets

* Fix replication rename target keys

Logical rename events now reach replication sinks with distinct source and target paths.\n\nHandle non-filer sinks as delete-plus-create on the translated target key, and make the rename fallback path create at the translated target key too.\n\nAdd focused tests covering non-filer renames, filer rename updates, and the fallback path.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix filer sync rename path scoping

Use directory-boundary matching instead of raw prefix checks when classifying source and target paths during filer sync.\n\nAlso apply excludePaths per side so renames across excluded boundaries downgrade cleanly to create/delete instead of being misclassified as in-scope updates.\n\nAdd focused tests for boundary matching and rename classification.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix replicator directory boundary checks

Use directory-boundary matching instead of raw prefix checks when deciding whether a source or target path is inside the watched tree or an excluded subtree.\n\nThis prevents sibling paths such as /foo and /foobar from being misclassified during rename handling, and preserves the earlier rename-target-key fix.\n\nAdd focused tests for boundary matching and rename classification across sibling/excluded directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix etc-remote rename-out handling

Use boundary-safe source/target directory membership when classifying metadata events under DirectoryEtcRemote.\n\nThis prevents rename-out events from being processed as config updates, while still treating them as removals where appropriate for the remote sync and remote gateway command paths.\n\nAdd focused tests for update/removal classification and sibling-prefix handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Defer rename events until commit

Queue logical rename metadata events during atomic and streaming renames and publish them only after the transaction commits successfully.\n\nThis prevents subscribers from seeing delete or logical rename events for operations that later fail during delete or commit.\n\nAlso serialize notification.Queue swaps in rename tests and add failure-path coverage.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Skip descendant rename target lookups

Avoid redundant target lookups during recursive directory renames once the destination subtree is known absent.\n\nThe recursive move path now inserts known-absent descendants directly, and the test harness exercises prefixed directory listing so the optimization is covered by a directory rename regression test.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Tighten rename review tests

Return filer_pb.ErrNotFound from the bucket tracking store test stub so it follows the FilerStore contract, and add a webhook filter case for same-name renames across parent directories.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix HardLinkId format verb in InsertEntryKnownAbsent error

HardLinkId is a byte slice. %d prints each byte as a decimal number
which is not useful for an identifier. Use %x to match the log line
two lines above.

* only skip descendant target lookup when source and dest use same store

moveFolderSubEntries unconditionally passed skipTargetLookup=true for
every descendant. This is safe when all paths resolve to the same
underlying store, but with path-specific store configuration a child's
destination may map to a different backend that already holds an entry
at that path. Use FilerStoreWrapper.SameActualStore to check per-child
and fall back to the full CreateEntry path when stores differ.

* add nil and create edge-case tests for metadata event scope helpers

* extract pathIsEqualOrUnder into util.IsEqualOrUnder

Identical implementations existed in both replication/replicator.go and
command/filer_sync.go. Move to util.IsEqualOrUnder (alongside the
existing FullPath.IsUnder) and remove the duplicates.

* use MetadataEventTargetDirectory for new-side directory in filer sync

The new-side directory checks and sourceNewKey computation used
message.NewParentPath directly. If NewParentPath were empty (legacy
events, older filer versions during rolling upgrades), sourceNewKey
would be wrong (/filename instead of /dir/filename) and the
UpdateEntry parent path rewrite would panic on slice bounds.

Derive targetDir once from MetadataEventTargetDirectory, which falls
back to resp.Directory when NewParentPath is empty, and use it
consistently for all new-side checks and the sink parent path.
2026-03-30 18:25:11 -07:00
Chris Lu
e5ad5e8d4a fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume (#8836)
* fix(filer): apply default disk type after location-prefix resolution in gRPC AssignVolume

The gRPC AssignVolume path was applying the filer's default DiskType to
the request before calling detectStorageOption. This caused the default
to shadow any disk type configured via a filer location-prefix rule,
diverging from the HTTP write path which applies the default only when
no rule matches.

Extract resolveAssignStorageOption to apply the filer default disk type
after detectStorageOption, so location-prefix rules take precedence.

* fix(filer): apply default disk type after location-prefix resolution in TUS upload path

Same class of bug as the gRPC AssignVolume fix: the TUS tusWriteData
handler called detectStorageOption0 but never applied the filer's
default DiskType when no location-prefix rule matched. This made TUS
uploads ignore the -disk flag entirely.
2026-03-29 14:18:24 -07:00
Chris Lu
c2c58419b8 filer.sync: send log file chunk fids to clients for direct volume server reads (#8792)
* filer.sync: send log file chunk fids to clients for direct volume server reads

Instead of the server reading persisted log files from volume servers, parsing
entries, and streaming them over gRPC (serial bottleneck), clients that opt in
via client_supports_metadata_chunks receive log file chunk references (fids)
and read directly from volume servers in parallel.

New proto messages:
- LogFileChunkRef: chunk fids + timestamp + filer ID for one log file
- SubscribeMetadataRequest.client_supports_metadata_chunks: client opt-in
- SubscribeMetadataResponse.log_file_refs: server sends refs during backlog

Server changes:
- CollectLogFileRefs: lists log files and returns chunk refs without any
  volume server I/O (metadata-only operation)
- SubscribeMetadata/SubscribeLocalMetadata: when client opts in, sends refs
  during persisted log phase, then falls back to normal streaming for
  in-memory events

Client changes:
- ReadLogFileRefs: reads log files from volume servers, parses entries,
  filters by path prefix, invokes processEventFn
- MetadataFollowOption.LogFileReaderFn: factory for chunk readers,
  enables metadata chunks when non-nil
- Both filer_pb_tail.go and meta_aggregator.go recv loops accumulate
  refs then process them at the disk→memory transition

Backward compatible: old clients don't set the flag, get existing behavior.

Ref: #8771

* filer.sync: merge entries across filers in timestamp order on client side

ReadLogFileRefs now groups refs by filer ID and merges entries from
multiple filers using a min-heap priority queue — the same algorithm
the server uses in OrderedLogVisitor + LogEntryItemPriorityQueue.

This ensures events are processed in correct timestamp order even when
log files from different filers have interleaved timestamps. Single-filer
case takes the fast path (no heap allocation).

* filer.sync: integration tests for direct-read metadata chunks

Three test categories:

1. Merge correctness (TestReadLogFileRefsMergeOrder):
   Verifies entries from 3 filers are delivered in strict timestamp order,
   matching the server-side OrderedLogVisitor guarantee.

2. Path filtering (TestReadLogFileRefsPathFilter):
   Verifies client-side path prefix filtering works correctly.

3. Throughput comparison (TestDirectReadVsServerSideThroughput):
   3 filers × 7 files × 300 events = 6300 events, 2ms per file read:

   server-side:  6300 events  218ms   28,873 events/sec
   direct-read:  6300 events   51ms  123,566 events/sec  (4.3x)
   parallel:     6300 events   17ms  378,628 events/sec  (13.1x)

   Direct-read eliminates gRPC send overhead per event (4.3x).
   Parallel per-filer reading eliminates serial file I/O (13.1x).

* filer.sync: parallel per-filer reads with prefetching in ReadLogFileRefs

ReadLogFileRefs now has two levels of I/O overlap:

1. Cross-filer parallelism: one goroutine per filer reads its files
   concurrently. Entries feed into per-filer channels, merged by the
   main goroutine via min-heap (same ordering guarantee as the server's
   OrderedLogVisitor).

2. Within-filer prefetching: while the current file's entries are being
   consumed by the merge heap, the next file is already being read from
   the volume server in a background goroutine.

Single-filer fast path avoids the heap and channels.

Test results (3 filers × 7 files × 300 events, 2ms per file read):

  server-side sequential:  6300 events  212ms   29,760 events/sec
  parallel + prefetch:     6300 events   36ms  177,443 events/sec
  Speedup: 6.0x

* filer.sync: address all review comments on metadata chunks PR

Critical fixes:
- sendLogFileRefs: bypass pipelinedSender, send directly on gRPC stream.
  Ref messages have TsNs=0 and were being incorrectly batched into the
  Events field by the adaptive batching logic, corrupting ref delivery.
- readLogFileEntries: use io.ReadFull instead of reader.Read to prevent
  partial reads from corrupting size values or protobuf data.
- Error handling: only skip chunk-not-found errors (matching server-side
  isChunkNotFoundError). Other I/O or decode failures are propagated so
  the follower can retry.

High-priority fixes:
- CollectLogFileRefs: remove incorrect +24h padding from stopTime. The
  extra day caused unnecessary log file refs to be collected.
- Path filtering: ReadLogFileRefs now accepts PathFilter struct with
  PathPrefix, AdditionalPathPrefixes, and DirectoriesToWatch. Uses
  util.Join for path construction (avoids "//foo" on root). Excludes
  /.system/log/ internal entries. Matches server-side
  eachEventNotificationFn filtering logic.

Medium-priority fixes:
- CollectLogFileRefs: accept context.Context, propagate to
  ListDirectoryEntries calls for cancellation support.
- NewChunkStreamReaderFromLookup: accept context.Context, propagate to
  doNewChunkStreamReader.

Test fixes:
- Check error returns from ReadLogFileRefs in all test call sites.

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-27 11:01:29 -07:00
Chris Lu
d97660d0cd filer.sync: pipelined subscription with adaptive batching for faster catch-up (#8791)
* filer.sync: pipelined subscription with adaptive batching for faster catch-up

The SubscribeMetadata pipeline was fully serial: reading a log entry from a
volume server, unmarshaling, filtering, and calling stream.Send() all happened
one-at-a-time. stream.Send() blocked the entire pipeline until the client
acknowledged each event, limiting throughput to ~80 events/sec regardless of
the -concurrency setting.

Three server-side optimizations that stack:

1. Pipelined sender: decouple stream.Send() from the read loop via a buffered
   channel (1024 messages). A dedicated goroutine handles gRPC delivery while
   the reader continues processing the next events.

2. Adaptive batching: when event timestamps are >2min behind wall clock
   (backlog catch-up), drain multiple events from the channel and pack them
   into a single stream.Send() using a new `repeated events` field on
   SubscribeMetadataResponse. When events are recent (real-time), send
   one-by-one for low latency. Old clients ignore the new field (backward
   compatible).

3. Persisted log readahead: run the OrderedLogVisitor in a background
   goroutine so volume server I/O for the next log file overlaps with event
   processing and gRPC delivery.

4. Event-driven aggregated subscription: replace time.Sleep(1127ms) polling
   in SubscribeMetadata with notification-driven wake-up using the
   MetaLogBuffer subscriber mechanism, reducing real-time latency from
   ~1127ms to sub-millisecond.

Combined, these create a 3-stage pipeline:
  [Volume I/O → readahead buffer] → [Filter → send buffer] → [gRPC Send]

Test results (simulated backlog with 50µs gRPC latency per Send):
  direct (old):        2100 events  2100 sends  168ms   12,512 events/sec
  pipelined+batched:   2100 events    14 sends   40ms   52,856 events/sec
  Speedup: 4.2x single-stream throughput

Ref: #8771

* filer.sync: require client opt-in for batch event delivery

Add ClientSupportsBatching field to SubscribeMetadataRequest. The server
only packs events into the Events batch field when the client explicitly
sets this flag to true. Old clients (Java SDK, third-party) that don't
set the flag get one-event-per-Send, preserving backward compatibility.

All Go callers (FollowMetadata, MetaAggregator) set the flag to true
since their recv loops already unpack batched events.

* filer.sync: clear batch Events field after Send to release references

Prevents the envelope message from holding references to the rest of the
batch after gRPC serialization, allowing the GC to collect them sooner.

* filer.sync: fix Send deadlock, add error propagation test, event-driven local subscribe

- pipelinedSender.Send: add case <-s.done to unblock when sender goroutine
  exits (fixes deadlock when errCh was already consumed by a prior Send).
- pipelinedSender.reportErr: remove for-range drain on sendCh that could
  block indefinitely. Send() now detects exit via s.done instead.
- SubscribeLocalMetadata: replace remaining time.Sleep(1127ms) in the
  gap-detected-no-memory-data path with event-driven listenersCond.Wait(),
  consistent with the rest of the subscription paths.
- Add TestPipelinedSenderErrorPropagation: verifies error surfaces via
  Send and Close when the underlying stream fails.
- Replace goto with labeled break in test simulatePipeline.

* filer.sync: check error returns in test code

- direct_send: check slowStream.Send error return
- pipelined_batched_send: check sender.Close error return
- simulatePipeline: return error from sender.Close, propagate to callers

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-26 23:55:42 -07:00
Chris Lu
3a3fff1399 Fix TUS chunked upload and resume failures (#8783) (#8786)
* Fix TUS chunked upload and resume failures caused by request context cancellation (#8783)

The filer's TCP connections use a 10-second inactivity timeout (net_timeout.go).
After the TUS PATCH request body is fully consumed, internal operations (assigning
file IDs via gRPC to the master, uploading data to volume servers, completing uploads)
do not generate any activity on the client connection, so the inactivity timer fires
and Go's HTTP server cancels the request context. This caused HTTP 500 errors on
PATCH requests where body reading + internal processing exceeded the timeout.

Fix by using context.WithoutCancel in TUS create and patch handlers, matching the
existing pattern used by assignNewFileInfo. This ensures internal operations complete
regardless of client connection state.

Fixes seaweedfs/seaweedfs#8783

* Add comment to tusCreateHandler explaining context.WithoutCancel rationale

* Run TUS integration tests on all PRs, not just TUS file changes

The previous path filter meant these tests only ran when TUS-specific files
changed. This allowed regressions from changes to shared infrastructure
(net_timeout.go, upload paths, gRPC) to go undetected — which is exactly
how the context cancellation bug in #8783 was missed. Matches the pattern
used by s3-go-tests.yml.
2026-03-26 14:06:21 -07:00
Lisandro Pin
e5cf2d2a19 Give the ScrubVolume() RPC an option to flag found broken volumes as read-only. (#8360)
* Give the `ScrubVolume()` RPC an option to flag found broken volumes as read-only.

Also exposes this option in the shell `volume.scrub` command.

* Remove redundant test in `TestVolumeMarkReadonlyWritableErrorPaths`.

417051bb slightly rearranges the logic for `VolumeMarkReadonly()` and `VolumeMarkWritable()`,
so calling them for invalid volume IDs will actually yield that error, instead of checking
maintnenance mode first.
2026-03-26 10:20:57 -07:00
Chris Lu
94bfa2b340 mount: stream all filer mutations over single ordered gRPC stream (#8770)
* filer: add StreamMutateEntry bidi streaming RPC

Add a bidirectional streaming RPC that carries all filer mutation
types (create, update, delete, rename) over a single ordered stream.
This eliminates per-request connection overhead for pipelined
operations and guarantees mutation ordering within a stream.

The server handler delegates each request to the existing unary
handlers (CreateEntry, UpdateEntry, DeleteEntry) and uses a proxy
stream adapter for rename operations to reuse StreamRenameEntry logic.

The is_last field signals completion for multi-response operations
(rename sends multiple events per request; create/update/delete
always send exactly one response with is_last=true).

* mount: add streaming mutation multiplexer (streamMutateMux)

Implement a client-side multiplexer that routes all filer mutation RPCs
(create, update, delete, rename) over a single bidirectional gRPC
stream. Multiple goroutines submit requests through a send channel;
a dedicated sendLoop serializes them on the stream; a recvLoop
dispatches responses to waiting callers via per-request channels.

Key features:
- Lazy stream opening on first use
- Automatic reconnection on stream failure
- Permanent fallback to unary RPCs if filer returns Unimplemented
- Monotonic request_id for response correlation
- Multi-response support for rename operations (is_last signaling)

The mux is initialized on WFS and closed during unmount cleanup.
No call sites use it yet — wiring comes in subsequent commits.

* mount: route CreateEntry and UpdateEntry through streaming mux

Wire all CreateEntry call sites to use wfs.streamCreateEntry() which
routes through the StreamMutateEntry stream when available, falling
back to unary RPCs otherwise. Also wire Link's UpdateEntry calls
through wfs.streamUpdateEntry().

Updated call sites:
- flushMetadataToFiler (file flush after write)
- Mkdir (directory creation)
- Symlink (symbolic link creation)
- createRegularFile non-deferred path (Mknod)
- flushFileMetadata (periodic metadata flush)
- Link (hard link: update source + create link + rollback)

* mount: route UpdateEntry and DeleteEntry through streaming mux

Wire remaining mutation call sites through the streaming mux:

- saveEntry (Setattr/chmod/chown/utimes) → streamUpdateEntry
- Unlink → streamDeleteEntry (replaces RemoveWithResponse)
- Rmdir → streamDeleteEntry (replaces RemoveWithResponse)

All filer mutations except Rename now go through StreamMutateEntry
when the filer supports it, with automatic unary RPC fallback.

* mount: route Rename through streaming mux

Wire Rename to use streamMutate.Rename() when available, with
fallback to the existing StreamRenameEntry unary stream.

The streaming mux sends rename as a StreamRenameEntryRequest oneof
variant. The server processes it through the existing rename logic
and sends multiple StreamRenameEntryResponse events (one per moved
entry), with is_last=true on the final response.

All filer mutations now go through a single ordered stream.

* mount: fix stream mux connection ownership

WithGrpcClient(streamingMode=true) closes the gRPC connection when
the callback returns, destroying the stream. Own the connection
directly via pb.GrpcDial so it stays alive for the stream's lifetime.
Close it explicitly in recvLoop on stream failure and in Close on
shutdown.

* mount: fix rename failure for deferred-create files

Three fixes for rename operations over the streaming mux:

1. lookupEntry: fall back to local metadata store when filer returns
   "not found" for entries in uncached directories. Files created with
   deferFilerCreate=true exist only in the local leveldb store until
   flushed; lookupEntry skipped the local store when the parent
   directory had never been readdir'd, causing rename to fail with
   ENOENT.

2. Rename: wait for pending async flushes and force synchronous flush
   of dirty metadata before sending rename to the filer. Covers the
   writebackCache case where close() defers the flush to a background
   worker that may not complete before rename fires.

3. StreamMutateEntry: propagate rename errors from server to client.
   Add error/errno fields to StreamMutateEntryResponse so the mount
   can map filer errors to correct FUSE status codes instead of
   silently returning OK. Also fix the existing Rename error handler
   which could return fuse.OK on unrecognized errors.

* mount: fix streaming mux error handling, sendLoop lifecycle, and fallback

Address PR review comments:

1. Server: populate top-level Error/Errno on StreamMutateEntryResponse for
   create/update/delete errors, not just rename. Previously update errors
   were silently dropped and create/delete errors were only in nested
   response fields that the client didn't check.

2. Client: check nested error fields in CreateEntry (ErrorCode, Error)
   and DeleteEntry (Error) responses, matching CreateEntryWithResponse
   behavior.

3. Fix sendLoop lifecycle: give each stream generation a stopSend channel.
   recvLoop closes it on error to stop the paired sendLoop. Previously a
   reconnect left the old sendLoop draining sendCh, breaking ordering.

4. Transparent fallback: stream helpers and doRename fall back to unary
   RPCs on transport errors (ErrStreamTransport), including the first
   Unimplemented from ensureStream. Previously the first call failed
   instead of degrading.

5. Filer rotation in openStream: try all filer addresses on dial failure,
   matching WithFilerClient behavior. Stop early on Unimplemented.

6. Pass metadata-bearing context to StreamMutateEntry RPC call so
   sw-client-id header is actually sent.

7. Gate lookupEntry local-cache fallback on open dirty handle or pending
   async flush to avoid resurrecting deleted/renamed entries.

8. Remove dead code in flushFileMetadata (err=nil followed by if err!=nil).

9. Use string matching for rename error-to-errno mapping in the mount to
   stay portable across Linux/macOS (numeric errno values differ).

* mount: make failAllPending idempotent with delete-before-close

Change failAllPending to collect pending entries into a local slice
(deleting from the sync.Map first) before closing channels. This
prevents double-close panics if called concurrently. Also remove the
unused err parameter.

* mount: add stream generation tracking and teardownStream

Introduce a generation counter on streamMutateMux that increments each
time a new stream is created. Requests carry the generation they were
enqueued for so sendLoop can reject stale requests after reconnect.

Add teardownStream(gen) which is idempotent (only acts when gen matches
current generation and stream is non-nil). Both sendLoop and recvLoop
call it on error, replacing the inline cleanup in recvLoop. sendLoop now
actively triggers teardown on send errors instead of silently exiting.

ensureStream waits for the prior generation's recvDone before creating a
new stream, ensuring all old pending waiters are failed before reconnect.
recvLoop now takes the stream, generation, and recvDone channel as
parameters to avoid accessing shared fields without the lock.

* mount: harden Close to prevent races with teardownStream

Nil out stream, cancel, and grpcConn under the lock so that any
concurrent teardownStream call from recvLoop/sendLoop becomes a no-op.
Call failAllPending before closing sendCh to unblock waiters promptly.
Guard recvDone with a nil check for the case where Close is called
before any stream was ever opened.

* mount: make errCh receive ctx-aware in doUnary and Rename

Replace the blocking <-sendReq.errCh with a select that also observes
ctx.Done(). If sendLoop exits via stopSend without consuming a buffered
request, the caller now returns ctx.Err() instead of blocking forever.
The buffered errCh (capacity 1) ensures late acknowledgements from
sendLoop don't block the sender.

* mount: fix sendLoop/Close race and recvLoop/teardown pending channel race

Three related fixes:

1. Stop closing sendCh in Close(). Closing the shared producer channel
   races with callers who passed ensureStream() but haven't sent yet,
   causing send-on-closed-channel panics. sendCh is now left open;
   ensureStream checks m.closed to reject new callers.

2. Drain buffered sendCh items on shutdown. sendLoop defers drainSendCh()
   on exit so buffered requests get an ErrStreamTransport on their errCh
   instead of blocking forever. Close() drains again for any stragglers
   enqueued between sendLoop's drain and the final shutdown.

3. Move failAllPending from teardownStream into recvLoop's defer.
   teardownStream (called from sendLoop on send error) was closing
   pending response channels while recvLoop could be between
   pending.Load and the channel send — a send-on-closed-channel panic.
   recvLoop is now the sole closer of pending channels, eliminating
   the race. Close() waits on recvDone (with cancel() to guarantee
   Recv unblocks) so pending cleanup always completes.

* filer/mount: add debug logging for hardlink lifecycle

Add V(0) logging at every point where a HardLinkId is created, stored,
read, or deleted to trace orphaned hardlink references. Logging covers:

- gRPC server: CreateEntry/UpdateEntry when request carries HardLinkId
- FilerStoreWrapper: InsertEntry/UpdateEntry when entry has HardLinkId
- handleUpdateToHardLinks: entry path, HardLinkId, counter, chunk count
- setHardLink: KvPut with blob size
- maybeReadHardLink: V(1) on read attempt and successful decode
- DeleteHardLink: counter decrement/deletion events
- Mount Link(): when NewHardLinkId is generated and link is created

This helps diagnose how a git pack .rev file ended up with a
HardLinkId during a clone (no hard links should be involved).

* test: add git clone/pull integration test for FUSE mount

Shell script that exercises git operations on a SeaweedFS mount:
1. Creates a bare repo on the mount
2. Clones locally, makes 3 commits, pushes to mount
3. Clones from mount bare repo into an on-mount working dir
4. Verifies clone integrity (files, content, commit hashes)
5. Pushes 2 more commits with renames and deletes
6. Checks out an older revision on the mount clone
7. Returns to branch and pulls with real changes
8. Verifies file content, renames, deletes after pull
9. Checks git log integrity and clean status

27 assertions covering file existence, content, commit hashes,
file counts, renames, deletes, and git status. Run against any
existing mount: bash test-git-on-mount.sh /path/to/mount

* test: add git clone/pull FUSE integration test to CI suite

Add TestGitOperations to the existing fuse_integration test framework.
The test exercises git's full file operation surface on the mount:

  1. Creates a bare repo on the mount (acts as remote)
  2. Clones locally, makes 3 commits (files, bulk data, renames), pushes
  3. Clones from mount bare repo into an on-mount working dir
  4. Verifies clone integrity (content, commit hash, file count)
  5. Pushes 2 more commits with new files, renames, and deletes
  6. Checks out an older revision on the mount clone
  7. Returns to branch and pulls with real fast-forward changes
  8. Verifies post-pull state: content, renames, deletes, file counts
  9. Checks git log integrity (5 commits) and clean status

Runs automatically in the existing fuse-integration.yml CI workflow.

* mount: fix permission check with uid/gid mapping

The permission checks in createRegularFile() and Access() compared
the caller's local uid/gid against the entry's filer-side uid/gid
without applying the uid/gid mapper. With -map.uid 501:0, a directory
created as uid 0 on the filer would not match the local caller uid
501, causing hasAccess() to fall through to "other" permission bits
and reject write access (0755 → other has r-x, no w).

Fix: map entry uid/gid from filer-space to local-space before the
hasAccess() call so both sides are in the same namespace.

This fixes rsync -a failing with "Permission denied" on mkstempat
when using uid/gid mapping.

* mount: fix Mkdir/Symlink returning filer-side uid/gid to kernel

Mkdir and Symlink used `defer wfs.mapPbIdFromFilerToLocal(entry)` to
restore local uid/gid, but `outputPbEntry` writes the kernel response
before the function returns — so the kernel received filer-side
uid/gid (e.g., 0:0). macFUSE then caches these and rejects subsequent
child operations (mkdir, create) because the caller uid (501) doesn't
match the directory owner (0), and "other" bits (0755 → r-x) lack
write permission.

Fix: replace the defer with an explicit call to mapPbIdFromFilerToLocal
before outputPbEntry, so the kernel gets local uid/gid.

Also add nil guards for UidGidMapper in Access and createRegularFile
to prevent panics in tests that don't configure a mapper.

This fixes rsync -a "Permission denied" on mkpathat for nested
directories when using uid/gid mapping.

* mount: fix Link outputting filer-side uid/gid to kernel, add nil guards

Link had the same defer-before-outputPbEntry bug as Mkdir and Symlink:
the kernel received filer-side uid/gid because the defer hadn't run
yet when outputPbEntry wrote the response.

Also add nil guards for UidGidMapper in Access and createRegularFile
so tests without a mapper don't panic.

Audit of all outputPbEntry/outputFilerEntry call sites:
- Mkdir: fixed in prior commit (explicit map before output)
- Symlink: fixed in prior commit (explicit map before output)
- Link: fixed here (explicit map before output)
- Create (existing file): entry from maybeLoadEntry (already mapped)
- Create (deferred): entry has local uid/gid (never mapped to filer)
- Create (non-deferred): createRegularFile defer runs before return
- Mknod: createRegularFile defer runs before return
- Lookup: entry from lookupEntry (already mapped)
- GetAttr: entry from maybeReadEntry/maybeLoadEntry (already mapped)
- readdir: entry from cache (mapIdFromFilerToLocal) or filer (mapped)
- saveEntry: no kernel output
- flushMetadataToFiler: no kernel output
- flushFileMetadata: no kernel output

* test: fix git test for same-filesystem FUSE clone

When both the bare repo and working clone live on the same FUSE mount,
git's local transport uses hardlinks and cross-repo stat calls that
fail on FUSE. Fix:

- Use --no-local on clone to disable local transport optimizations
- Use reset --hard instead of checkout to stay on branch
- Use fetch + reset --hard origin/<branch> instead of git pull to
  avoid local transport stat failures during fetch

* adjust logging

* test: use plain git clone/pull to exercise real FUSE behavior

Remove --no-local and fetch+reset workarounds. The test should use
the same git commands users run (clone, reset --hard, pull) so it
reveals real FUSE issues rather than hiding them.

* test: enable V(1) logging for filer/mount and collect logs on failure

- Run filer and mount with -v=1 so hardlink lifecycle logs (V(0):
  create/delete/insert, V(1): read attempts) are captured
- On test failure, automatically dump last 16KB of all process logs
  (master, volume, filer, mount) to test output
- Copy process logs to /tmp/seaweedfs-fuse-logs/ for CI artifact upload
- Update CI workflow to upload SeaweedFS process logs alongside test output

* mount: clone entry for filer flush to prevent uid/gid race

flushMetadataToFiler and flushFileMetadata used entry.GetEntry() which
returns the file handle's live proto entry pointer, then mutated it
in-place via mapPbIdFromLocalToFiler. During the gRPC call window, a
concurrent Lookup (which takes entryLock.RLock but NOT fhLockTable)
could observe filer-side uid/gid (e.g., 0:0) on the file handle entry
and return it to the kernel. The kernel caches these attributes, so
subsequent opens by the local user (uid 501) fail with EACCES.

Fix: proto.Clone the entry before mapping uid/gid for the filer request.
The file handle's live entry is never mutated, so concurrent Lookup
always sees local uid/gid.

This fixes the intermittent "Permission denied" on .git/FETCH_HEAD
after the first git pull on a mount with uid/gid mapping.

* mount: add debug logging for stale lock file investigation

Add V(0) logging to trace the HEAD.lock recreation issue:
- Create: log when O_EXCL fails (file already exists) with uid/gid/mode
- completeAsyncFlush: log resolved path, saved path, dirtyMetadata,
  isDeleted at entry to trace whether async flush fires after rename
- flushMetadataToFiler: log the dir/name/fullpath being flushed

This will show whether the async flush is recreating the lock file
after git renames HEAD.lock → HEAD.

* mount: prevent async flush from recreating renamed .lock files

When git renames HEAD.lock → HEAD, the async flush from the prior
close() can run AFTER the rename and re-insert HEAD.lock into the
meta cache via its CreateEntryRequest response event. The next git
pull then sees HEAD.lock and fails with "File exists".

Fix: add isRenamed flag on FileHandle, set by Rename before waiting
for the pending async flush. The async flush checks this flag and
skips the metadata flush for renamed files (same pattern as isDeleted
for unlinked files). The data pages still flush normally.

The Rename handler flushes deferred metadata synchronously (Case 1)
before setting isRenamed, ensuring the entry exists on the filer for
the rename to proceed. For already-released handles (Case 2), the
entry was created by a prior flush.

* mount: also mark renamed inodes via entry.Attributes.Inode fallback

When GetInode fails (Forget already removed the inode mapping), the
Rename handler couldn't find the pending async flush to set isRenamed.
The async flush then recreated the .lock file on the filer.

Fix: fall back to oldEntry.Attributes.Inode to find the pending async
flush when the inode-to-path mapping is gone. Also extract
MarkInodeRenamed into a method on FileHandleToInode for clarity.

* mount: skip async metadata flush when saved path no longer maps to inode

The isRenamed flag approach failed for refs/remotes/origin/HEAD.lock
because neither GetInode nor oldEntry.Attributes.Inode could find the
inode (Forget already evicted the mapping, and the entry's stored
inode was 0).

Add a direct check in completeAsyncFlush: before flushing metadata,
verify that the saved path still maps to this inode in the inode-to-path
table. If the path was renamed or removed (inode mismatch or not found),
skip the metadata flush to avoid recreating a stale entry.

This catches all rename cases regardless of whether the Rename handler
could set the isRenamed flag.

* mount: wait for pending async flush in Unlink before filer delete

Unlink was deleting the filer entry first, then marking the draining
async-flush handle as deleted. The async flush worker could race between
these two operations and recreate the just-unlinked entry on the filer.
This caused git's .lock files (e.g. refs/remotes/origin/HEAD.lock) to
persist after git pull, breaking subsequent git operations.

Move the isDeleted marking and add waitForPendingAsyncFlush() before
the filer delete so any in-flight flush completes first. Even if the
worker raced past the isDeleted check, the wait ensures it finishes
before the filer delete cleans up any recreated entry.

* mount: reduce async flush and metadata flush log verbosity

Raise completeAsyncFlush entry log, saved-path-mismatch skip log, and
flushMetadataToFiler entry log from V(0) to V(3)/V(4). These fire for
every file close with writebackCache and are too noisy for normal use.

* filer: reduce hardlink debug log verbosity from V(0) to V(4)

HardLinkId logs in filerstore_wrapper, filerstore_hardlink, and
filer_grpc_server fire on every hardlinked file operation (git pack
files use hardlinks extensively) and produce excessive noise.

* mount/filer: reduce noisy V(0) logs for link, rmdir, and empty folder check

- weedfs_link.go: hardlink creation logs V(0) → V(4)
- weedfs_dir_mkrm.go: non-empty folder rmdir error V(0) → V(1)
- empty_folder_cleaner.go: "not empty" check log V(0) → V(4)

* filer: handle missing hardlink KV as expected, not error

A "kv: not found" on hardlink read is normal when the link blob was
already cleaned up but a stale entry still references it. Log at V(1)
for not-found; keep Error level for actual KV failures.

* test: add waitForDir before git pull in FUSE git operations test

After git reset --hard, the FUSE mount's metadata cache may need a
moment to settle on slow CI. The git pull subprocess (unpack-objects)
could fail to stat the working directory. Poll for up to 5s.

* Update git_operations_test.go

* wait

* test: simplify FUSE test framework to use weed mini

Replace the 4-process setup (master + volume + filer + mount) with
2 processes: "weed mini" (all-in-one) + "weed mount". This simplifies
startup, reduces port allocation, and is faster on CI.

* test: fix mini flag -admin → -admin.ui
2026-03-25 20:06:34 -07:00
Chris Lu
0b3867dca3 filer: add structured error codes to CreateEntryResponse (#8767)
* filer: add FilerError enum and error_code field to CreateEntryResponse

Add a machine-readable error code alongside the existing string error
field. This follows the precedent set by PublishMessageResponse in the
MQ broker proto. The string field is kept for human readability and
backward compatibility.

Defined codes: OK, ENTRY_NAME_TOO_LONG, PARENT_IS_FILE,
EXISTING_IS_DIRECTORY, EXISTING_IS_FILE, ENTRY_ALREADY_EXISTS.

* filer: add sentinel errors and error code mapping in filer_pb

Define sentinel errors (ErrEntryNameTooLong, ErrParentIsFile, etc.) in
the filer_pb package so both the filer and consumers can reference them
without circular imports.

Add FilerErrorToSentinel() to map proto error codes to sentinels, and
update CreateEntryWithResponse() to check error_code first, falling back
to the string-based path for backward compatibility with old servers.

* filer: return wrapped sentinel errors and set proto error codes

Replace fmt.Errorf string errors in filer.CreateEntry, UpdateEntry, and
ensureParentDirectoryEntry with wrapped filer_pb sentinel errors (using
%w). This preserves errors.Is() traversal on the server side.

In the gRPC CreateEntry handler, map sentinel errors to the
corresponding FilerError proto codes using errors.Is(), setting both
resp.Error (string, for backward compat) and resp.ErrorCode (enum).

* S3: use errors.Is() with filer sentinels instead of string matching

Replace fragile string-based error matching in filerErrorToS3Error and
other S3 API consumers with errors.Is() checks against filer_pb sentinel
errors. This works because the updated CreateEntryWithResponse helper
reconstructs sentinel errors from the proto FilerError code.

Update iceberg stage_create and metadata_files to check resp.ErrorCode
instead of parsing resp.Error strings. Update SSE-S3 to use errors.Is()
for the already-exists check.

String matching is retained only for non-filer errors (gRPC transport
errors, checksum validation) that don't go through CreateEntryResponse.

* filer: remove backward-compat string fallbacks for error codes

Clients and servers are always deployed together, so there is no need
for backward-compatibility fallback paths that parse resp.Error strings
when resp.ErrorCode is unset. Simplify all consumers to rely solely on
the structured error code.

* iceberg: ensure unknown non-OK error codes are not silently ignored

When FilerErrorToSentinel returns nil for an unrecognized error code,
return an error including the code and message rather than falling
through to return nil.

* filer: fix redundant error message and restore error wrapping in helper

Use request path instead of resp.Error in the sentinel error format
string to avoid duplicating the sentinel message (e.g. "entry already
exists: entry already exists"). Restore %w wrapping with errors.New()
in the fallback paths so callers can use errors.Is()/errors.As().

* filer: promote file to directory on path conflict instead of erroring

S3 allows both "foo/bar" (object) and "foo/bar/xyzzy" (another object)
to coexist because S3 has a flat key space. When ensureParentDirectoryEntry
finds a parent path that is a file instead of a directory, promote it to
a directory by setting ModeDir while preserving the original content and
chunks. Use Store.UpdateEntry directly to bypass the Filer.UpdateEntry
type-change guard.

This fixes the S3 compatibility test failures where creating overlapping
keys (e.g. "foo/bar" then "foo/bar/xyzzy") returned ExistingObjectIsFile.
2026-03-24 17:08:22 -07:00
Chris Lu
c31e6b4684 Use filer-side copy for mounted whole-file copy_file_range (#8747)
* Optimize mounted whole-file copy_file_range

* Address mounted copy review feedback

* Harden mounted copy fast path

---------

Co-authored-by: Copilot <copilot@github.com>
2026-03-23 18:35:15 -07:00
Chris Lu
6bf654c25c fix: keep metadata subscriptions progressing (#8730) (#8746)
* fix: keep metadata subscriptions progressing (#8730)

* test: cancel slow metadata writers with parent context

* filer: ignore missing persisted log chunks
2026-03-23 15:26:54 -07:00
Chris Lu
15f4a97029 fix: improve raft leader election reliability and failover speed (#8692)
* fix: clear raft vote state file on non-resume startup

The seaweedfs/raft library v1.1.7 added a persistent `state` file for
currentTerm and votedFor. When RaftResumeState=false (the default), the
log, conf, and snapshot directories are cleared but this state file was
not. On repeated restarts, different masters accumulate divergent terms,
causing AppendEntries rejections and preventing leader election.

Fixes #8690

* fix: recover TopologyId from snapshot before clearing raft state

When RaftResumeState=false clears log/conf/snapshot, the TopologyId
(used for license validation) was lost. Now extract it from the latest
snapshot before cleanup and restore it on the topology.

Both seaweedfs/raft and hashicorp/raft paths are handled, with a shared
recoverTopologyIdFromState helper in raft_common.go.

* fix: stagger multi-master bootstrap delay by peer index

Previously all masters used a fixed 1500ms delay before the bootstrap
check. Now the delay is proportional to the peer's sorted index with
randomization (matching the hashicorp raft path), giving the designated
bootstrap node (peer 0) a head start while later peers wait for gRPC
servers to be ready.

Also adds diagnostic logging showing why DoJoinCommand was or wasn't
called, making leader election issues easier to diagnose from logs.

* fix: skip unreachable masters during leader reconnection

When a master leader goes down, non-leader masters still redirect
clients to the stale leader address. The masterClient would follow
these redirects, fail, and retry — wasting round-trips each cycle.

Now tryAllMasters tracks which masters failed within a cycle and skips
redirects pointing to them, reducing log spam and connection overhead
during leader failover.

* fix: take snapshot after TopologyId generation for recovery

After generating a new TopologyId on the leader, immediately take a raft
snapshot so the ID can be recovered from the snapshot on future restarts
with RaftResumeState=false. Without this, short-lived clusters would
lose the TopologyId on restart since no automatic snapshot had been
taken yet.

* test: add multi-master raft failover integration tests

Integration test framework and 5 test scenarios for 3-node master
clusters:

- TestLeaderConsistencyAcrossNodes: all nodes agree on leader and
  TopologyId
- TestLeaderDownAndRecoverQuickly: leader stops, new leader elected,
  old leader rejoins as follower
- TestLeaderDownSlowRecover: leader gone for extended period, cluster
  continues with 2/3 quorum
- TestTwoMastersDownAndRestart: quorum lost (2/3 down), recovered
  when both restart
- TestAllMastersDownAndRestart: full cluster restart, leader elected,
  all nodes agree on TopologyId

* fix: address PR review comments

- peerIndex: return -1 (not 0) when self not found, add warning log
- recoverTopologyIdFromSnapshot: defer dir.Close()
- tests: check GetTopologyId errors instead of discarding them

* fix: address review comments on failover tests

- Assert no leader after quorum loss (was only logging)
- Verify follower cs.Leader matches expected leader via
  ServerAddress.ToHttpAddress() comparison
- Check GetTopologyId error in TestTwoMastersDownAndRestart
2026-03-18 23:28:07 -07:00
Jayshan Raghunandan
1f1eac4f08 feat: improve aio support for admin/volume ingress and fix UI links (#8679)
* feat: improve allInOne mode support for admin/volume ingress and fix master UI links

- Add allInOne support to admin ingress template, matching the pattern
  used by filer and s3 ingress templates (or-based enablement with
  ternary service name selection)
- Add allInOne support to volume ingress template, which previously
  required volume.enabled even when the volume server runs within the
  allInOne pod
- Expose admin ports in allInOne deployment and service when
  allInOne.admin.enabled is set
- Add allInOne.admin config section to values.yaml (enabled by default,
  ports inherit from admin.*)
- Fix legacy master UI templates (master.html, masterNewRaft.html) to
  prefer PublicUrl over internal Url when linking to volume server UI.
  The new admin UI already handles this correctly.

* fix: revert admin allInOne changes and fix PublicUrl in admin dashboard

The admin binary (`weed admin`) is a separate process that cannot run
inside `weed server` (allInOne mode). Revert the admin-related allInOne
helm chart changes that caused 503 errors on admin ingress.

Fix bug in cluster_topology.go where VolumeServer.PublicURL was set to
node.Id (internal pod address) instead of the actual public URL. Add
public_url field to DataNodeInfo proto message so the topology gRPC
response carries the public URL set via -volume.publicUrl flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use HTTP /dir/status to populate PublicUrl in admin dashboard

The gRPC DataNodeInfo proto does not include PublicUrl, so the admin dashboard showed internal pod IPs instead of the configured public URL.
Fetch PublicUrl from the master's /dir/status HTTP endpoint and apply it
in both GetClusterTopology and GetClusterVolumeServers code paths.

Also reverts the unnecessary proto field additions from the previous
commit and cleans up a stray blank line in all-in-one-service.yml.

* fix: apply PublicUrl link fix to masterNewRaft.html

Match the same conditional logic already applied to master.html —
prefer PublicUrl when set and different from Url.

* fix: add HTTP timeout and status check to fetchPublicUrlMap

Use a 5s-timeout client instead of http.DefaultClient to prevent
blocking indefinitely when the master is unresponsive. Also check
the HTTP status code before attempting to parse the response body.

* fix: fall back to node address when PublicUrl is empty

Prevents blank links in the admin dashboard when PublicUrl is not
configured, such as in standalone or mixed-version clusters.

* fix: log io.ReadAll error in fetchPublicUrlMap

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-03-18 13:20:55 -07:00
Chris Lu
81369b8a83 improve: large file sync throughput for remote.cache and filer.sync (#8676)
* improve large file sync throughput for remote.cache and filer.sync

Three main throughput improvements:

1. Adaptive chunk sizing for remote.cache: targets ~32 chunks per file
   instead of always starting at 5MB. A 500MB file now uses ~16MB chunks
   (32 chunks) instead of 5MB chunks (100 chunks), reducing per-chunk
   overhead (volume assign, gRPC call, needle write) by 3x.

2. Configurable concurrency at every layer:
   - remote.cache chunk concurrency: -chunkConcurrency flag (default 8)
   - remote.cache S3 download concurrency: -downloadConcurrency flag
     (default raised from 1 to 5 per chunk)
   - filer.sync chunk concurrency: -chunkConcurrency flag (default 32)

3. S3 multipart download concurrency raised from 1 to 5: the S3 manager
   downloader was using Concurrency=1, serializing all part downloads
   within each chunk. This alone can 5x per-chunk download speed.

The concurrency values flow through the gRPC request chain:
  shell command → CacheRemoteObjectToLocalClusterRequest →
  FetchAndWriteNeedleRequest → S3 downloader

Zero values in the request mean "use server defaults", maintaining
full backward compatibility with existing callers.

Ref #8481

* fix: use full maxMB for chunk size cap and remove loop guard

Address review feedback:
- Use full maxMB instead of maxMB/2 for maxChunkSize to avoid
  unnecessarily limiting chunk size for very large files.
- Remove chunkSize < maxChunkSize guard from the safety loop so it
  can always grow past maxChunkSize when needed to stay under 1000
  chunks (e.g., extremely large files with small maxMB).

* address review feedback: help text, validation, naming, docs

- Fix help text for -chunkConcurrency and -downloadConcurrency flags
  to say "0 = server default" instead of advertising specific numeric
  defaults that could drift from the server implementation.
- Validate chunkConcurrency and downloadConcurrency are within int32
  range before narrowing, returning a user-facing error if out of range.
- Rename ReadRemoteErr to readRemoteErr to follow Go naming conventions.
- Add doc comment to SetChunkConcurrency noting it must be called
  during initialization before replication goroutines start.
- Replace doubling loop in chunk size safety check with direct
  ceil(remoteSize/1000) computation to guarantee the 1000-chunk cap.

* address Copilot review: clamp concurrency, fix chunk count, clarify proto docs

- Use ceiling division for chunk count check to avoid overcounting
  when file size is an exact multiple of chunk size.
- Clamp chunkConcurrency (max 1024) and downloadConcurrency (max 1024
  at filer, max 64 at volume server) to prevent excessive goroutines.
- Always use ReadFileWithConcurrency when the client supports it,
  falling back to the implementation's default when value is 0.
- Clarify proto comments that download_concurrency only applies when
  the remote storage client supports it (currently S3).
- Include specific server defaults in help text (e.g., "0 = server
  default 8") so users see the actual values in -h output.

* fix data race on executionErr and use %w for error wrapping

- Protect concurrent writes to executionErr in remote.cache worker
  goroutines with a sync.Mutex to eliminate the data race.
- Use %w instead of %v in volume_grpc_remote.go error formatting
  to preserve the error chain for errors.Is/errors.As callers.
2026-03-17 16:49:56 -07:00
Chris Lu
f4073107cb fix: clean up orphaned needles on remote.cache partial download failure (#8675)
When remote.cache downloads a file in parallel chunks and a gRPC
connection drops mid-transfer, chunks already written to volume servers
were not cleaned up. Since the filer metadata was never updated, these
needles became orphaned — invisible to volume.vacuum and never
referenced by the filer. On subsequent cache cycles the file was still
treated as uncached, creating more orphans each attempt.

Call DeleteUncommittedChunks on the download-error path, matching the
cleanup already present for the metadata-update-failure path.

Fixes #8481
2026-03-17 13:47:54 -07:00
Chris Lu
acea36a181 filer: add conditional update preconditions (#8647)
* filer: add conditional update preconditions

* iceberg: tighten metadata CAS preconditions
2026-03-16 12:33:32 -07:00
Chris Lu
8cde3d4486 Add data file compaction to iceberg maintenance (Phase 2) (#8503)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add data file compaction to iceberg maintenance handler (Phase 2)

Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata

Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix memory exhaustion in mergeParquetFiles by processing files sequentially

Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.

* Validate bucket/namespace/table names against path traversal

Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.

* Add filer address failover in iceberg maintenance handler

Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.

* Add separate MinManifestsToRewrite config for manifest rewrite threshold

The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.

* Fix risky mtime fallback in orphan removal that could delete new files

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.

* Fix undefined function references in iceberg_maintenance_handler.go

Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.

* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage

The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.

Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.

* Fix MinManifestsToRewrite clamping to match UI minimum of 2

The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.

* Sort entries by size descending in splitOversizedBin for better packing

Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.

* Add context cancellation check to drainReader loop

The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.

* Fix splitOversizedBin to always respect targetSize limit

The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.

* Add integration tests for iceberg table maintenance plugin worker

Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.

Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.

* Fix splitOversizedBin dropping files and add source path to drainReader errors

The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.

Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.

* Harden integration test error handling

- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
  continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
  errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
  CreateEntryResponse.Error in addition to the RPC error

* go fmt

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 11:27:42 -07:00
Chris Lu
c4d642b8aa fix(ec): gather shards from all disk locations before rebuild (#8633)
* fix(ec): gather shards from all disk locations before rebuild (#8631)

Fix "too few shards given" error during ec.rebuild on multi-disk volume
servers. The root cause has two parts:

1. VolumeEcShardsRebuild only looked at a single disk location for shard
   files. On multi-disk servers, the existing local shards could be on one
   disk while copied shards were placed on another, causing the rebuild to
   see fewer shards than actually available.

2. VolumeEcShardsCopy had a DiskId condition (req.DiskId == 0 &&
   len(vs.store.Locations) > 0) that was always true, making the
   FindFreeLocation fallback dead code. This meant copies always went to
   Locations[0] regardless of where existing shards were.

Changes:
- VolumeEcShardsRebuild now finds the location with the most shards,
  then gathers shard files from other locations via hard links (or
  symlinks for cross-device) before rebuilding. Gathered files are
  cleaned up after rebuild.
- VolumeEcShardsCopy now only uses Locations[DiskId] when DiskId > 0
  (explicitly set). Otherwise, it prefers the location that already has
  the EC volume, falling back to HDD then any free location.
- generateMissingEcFiles now logs shard counts and provides a clear
  error message when not enough shards are found, instead of passing
  through to the opaque reedsolomon "too few shards given" error.

* fix(ec): update test to match skip behavior for unrepairable volumes

The test expected an error for volumes with insufficient shards, but
commit 5acb4578a changed unrepairable volumes to be skipped with a log
message instead of returning an error. Update the test to verify the
skip behavior and log output.

* fix(ec): address PR review comments

- Add comment clarifying DiskId=0 means "not specified" (protobuf default),
  callers must use DiskId >= 1 to target a specific disk.
- Log warnings on cleanup failures for gathered shard links.

* fix(ec): read shard files from other disks directly instead of linking

Replace the hard link / symlink gathering approach with passing
additional search directories into RebuildEcFiles. The rebuild
function now opens shard files directly from whichever disk they
live on, avoiding filesystem link operations and cleanup.

RebuildEcFiles and RebuildEcFilesWithContext gain a variadic
additionalDirs parameter (backward compatible with existing callers).

* fix(ec): clarify DiskId selection semantics in VolumeEcShardsCopy comment

* fix(ec): avoid empty files on failed rebuild; don't skip ecx-only locations

- generateMissingEcFiles: two-pass approach — first discover present/missing
  shards and check reconstructability, only then create output files. This
  avoids leaving behind empty truncated shard files when there are too few
  shards to rebuild.

- VolumeEcShardsRebuild: compute hasEcx before skipping zero-shard locations.
  A location with an .ecx file but no shard files (all shards on other disks)
  is now a valid rebuild candidate instead of being silently skipped.

* fix(ec): select ecx-only location as rebuildLocation when none chosen yet

When rebuildLocation is nil and a location has hasEcx=true but
existingShardCount=0 (all shards on other disks), the condition
0 > 0 was false so it was never promoted to rebuildLocation.
Add rebuildLocation == nil to the predicate so the first location
with an .ecx file is always selected as a candidate.
2026-03-14 20:59:47 -07:00