Commit Graph

13504 Commits

Author SHA1 Message Date
Chris Lu
00a2e22478 fix(mount): remove fid pool to stop master over-allocating volumes (#9111)
* fix(mount): remove fid pool to stop master over-allocating volumes

The writeback-cache fid pool pre-allocated file IDs with
ExpectedDataSize = ChunkSizeLimit (typically 8+ MB). The master's
PickForWrite charges count * expectedDataSize against the volume's
effectiveSize, so a full pool refill could charge hundreds of MB
against a single volume before any bytes were actually written.
That tripped RecordAssign's hard-limit path and eagerly removed
volumes from writable, causing the master to grow new volumes
even when the real data being written was tiny.

Drop the pool entirely. Every chunk upload goes through
UploadWithRetry -> AssignVolume with no ExpectedDataSize hint,
letting the master fall back to the 1 MB default estimate. The
mount->filer grpc connection is already cached in pb.WithGrpcClient
(non-streaming mode), so per-chunk AssignVolume is a unary RPC
over an existing HTTP/2 stream, not a full dial. Path-based
filer.conf storage rules now apply to mount chunk assigns again,
which the pool had to skip.

Also remove the now-unused operation.UploadWithAssignFunc and its
AssignFunc type.

* fix(upload): populate ExpectedDataSize from actual chunk bytes

UploadWithRetry already buffers the full chunk into `data` before
calling AssignVolume, so the real size is known. Previously the
assign request went out with ExpectedDataSize=0, making the master
fall back to the 1 MB DefaultNeedleSizeEstimate per fid — same
over-reservation symptom the pool had, just smaller per call.

Stamp ExpectedDataSize = len(data) before the assign RPC when the
caller hasn't already set it. This covers mount chunk uploads,
filer_copy, filersink, mq/logstore, broker_write, gateway_upload,
and nfs — all the UploadWithRetry paths.

* fix(assign): pass real ExpectedDataSize at every assign call site

After removing the mount fid pool, per-chunk AssignVolume calls went
out with ExpectedDataSize=0, making the master fall back to its 1 MB
DefaultNeedleSizeEstimate. That's still an over-estimate for small
writes. Thread the real payload size through every remaining assign
site so RecordAssign charges effectiveSize accurately and stops
prematurely marking volumes full.

- filer: assignNewFileInfo now takes expectedDataSize and stamps it
  on both primary and alternate VolumeAssignRequests. Callers pass:
  - SSE data-to-chunk: len(data)
  - copy manifest save: len(data)
  - streamCopyChunk: srcChunk.Size
  - TUS sub-chunk: bytes read
  - saveAsChunk (autochunk/manifestize): 0 (small, size unknown
    until the reader is drained; master uses 1 MB default)
- filer gRPC remote fetch-and-write: ExpectedDataSize = chunkSize
  after the adaptive chunkSize is computed.
- ChunkedUploadOption.AssignFunc gains an expectedDataSize parameter;
  upload_chunked.go passes the buffered dataSize at the call site.
  S3 PUT assignFunc stamps it on the AssignVolumeRequest.
- S3 copy: assignNewVolume / prepareChunkCopy take expectedDataSize;
  all seven call sites pass the source chunk's Size.
- operation.SubmitFiles / FilePart.Upload: derive per-fid size from
  FileSize (average for batched requests, real per-chunk size for
  sequential chunk assigns).
- benchmark: pass fileSize.
- filer append-to-file: pass len(data).

* fix(assign): thread size through SaveDataAsChunkFunctionType

The saveAsChunk path (autochunk, filer_copy, webdav, mount) ran
AssignVolume before the reader was drained, so it had to pass
ExpectedDataSize=0 and fall back to the master's 1 MB default.

Add an expectedDataSize parameter to SaveDataAsChunkFunctionType.
- mergeIntoManifest already has the serialized manifest bytes, so
  it passes uint64(len(data)) directly.
- Mount's saveDataAsChunk ignores the parameter because it uses
  UploadWithRetry, which already stamps len(data) on the assign
  after reading the payload.
- webdav and filer_copy saveDataAsChunk follow the same UploadWithRetry
  path and also ignore the hint.
- Filer's saveAsChunk (used for manifestize) plumbs the value to
  assignNewFileInfo so manifest-chunk assigns get a real size.

Callers of saveFunc-as-value (weedfs_file_sync, dirty_pages_chunked)
pass the chunk size they're about to upload.
2026-04-16 15:51:13 -07:00
Chris Lu
7916e61c08 fix(mount): avoid self-notify deadlock in Link and CopyFileRange handlers (#9110)
The Link and CopyFileRange FUSE request handlers were calling
fuseServer.InodeNotify (and EntryNotify for copy) synchronously while
the kernel was still waiting for the request's reply on the same
/dev/fuse fd. Notifications share that fd, so the syscall.Write can
block indefinitely when the kernel hasn't drained its queue yet,
hanging the entire mount. A goroutine dump from a stuck mount showed
the Link handler blocked in syscall.Write inside InodeNotify while the
server's read loop kept waiting for new requests.

Drop the synchronous notifies. The local meta cache is still updated
inline, so subsequent filesystem ops see the fresh state; the kernel's
attr/dentry caches re-fetch once their TTL expires.
2026-04-16 14:09:32 -07:00
Chris Lu
ecc0390795 fix(master): eagerly remove volume from writable when assign hits limit (#9108)
* fix(master): eagerly remove volume from writable when RecordAssign hits limit

Previously, a volume was only removed from the writable list by the
heartbeat-driven CollectDeadNodeAndFullVolumes pass, which runs every
pulse (5s) after a 5s heartbeat. Under sustained concurrent writes,
fio-style workloads observed in the field grew volumes 8-20x past the
configured 100MB limit (median 530MB, peak 1.98GB) during that
5-15s detection window.

RecordAssign already tracks effective size (reported + pending) on each
/dir/assign. It now also removes the volume from writable the moment
effectiveSize reaches volumeSizeLimit, and mirrors the activeVolumeCount
decrement that Topology.SetVolumeCapacityFull would have done on the
next heartbeat. The heartbeat path remains unchanged and idempotent
(vl.SetVolumeCapacityFull returns false if already removed, so no
double-decrement).

Recovery still works: if a heartbeat later reports size < limit and
the volume is not oversized, EnsureCorrectWritables adds it back.

- weed/topology/volume_layout.go: RecordAssign returns reachedCapacity
  bool; adds AdjustActiveVolumeCountForFull helper.
- weed/topology/topology.go: PickForWrite invokes the decrement on
  eager full transitions.
- TestPickForWrite: pass a 1024-byte hint instead of 0 so the default
  1MB pendingDelta does not immediately bust the test's 32KB limit.
- New TestRecordAssignReachingCapacityRemovesFromWritable covers the
  eager removal, active count accounting, and no-double-accounting.

* fix(master): recover eagerly-removed volume once decay clears pending

After RecordAssign eagerly removes a volume from writables because
effectiveSize reached the limit, decay can later bring effectiveSize
back under the limit (e.g., when a burst of assigns didn't all result
in uploads). Without recovery the volume would stay non-writable until
vacuum or a ReadOnly flip.

UpdateVolumeSize now re-adds the volume to writables once all of the
following hold:

  * RecordAssign is what removed it (tracked via fullSince timestamp)
  * at least capacityRecoveryDelay has elapsed since the removal (30s)
    — this prevents bouncing during a steady stream of assigns near
    the limit
  * effectiveSize has decayed below the crowded threshold (90% of limit)
  * reportedSize is under the limit (actual disk is not over)
  * standard EnsureCorrectWritables preconditions: enough copies, all
    copies writable, not oversized

The caller (SyncDataNodeRegistration) re-increments activeVolumeCount
symmetrically with the decrement done on eager removal.

* review: release VolumeLayout lock before UpAdjustDiskUsageDelta

adjustActiveVolumeCount held vl.accessLock across the tree-climbing
UpAdjustDiskUsageDelta walk. That walk takes per-level DiskUsages
locks and could be re-entered from other call paths that hold a
node-level lock and then acquire vl.accessLock. Copy the node list
under the VolumeLayout lock and release it before the tree walk to
eliminate the lock-ordering hazard.
2026-04-16 12:50:30 -07:00
Chris Lu
40ffc73aa8 ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes (#9106)
* ci(pjdfstest): cache docker layers via GHA to avoid apt mirror flakes

Replace the local buildx cache + manual fallback with docker/setup-buildx-action
and docker/build-push-action using type=gha cache. The e2e and pjdfstest Dockerfile
layers now persist across runs in GitHub's own cache backend, so apt-get update
only hits Ubuntu mirrors when the Dockerfiles change. Also add Acquire::Retries
and Timeout so first-run cache-miss builds survive transient mirror sync errors.

* ci(pjdfstest): use local registry to share e2e image across buildx builds

The docker-container buildx driver cannot see images loaded into the host
Docker daemon, so the second build's FROM chrislusf/seaweedfs:e2e failed with
"not found" on registry-1.docker.io.

Run a local registry:2 on the runner, push both images to localhost:5000,
remap the FROM via build-contexts so the Dockerfile stays unchanged, then
tag the pulled images locally for docker compose to consume.
2026-04-16 12:50:19 -07:00
Chris Lu
ff4f96c71f fix(filer): drop stale master gRPC cache on stream death (#9102) (#9107)
* fix(filer): drop stale master gRPC cache on stream death (#9102)

When the master server restarts behind a stable L4 endpoint (e.g. a
Kubernetes ClusterIP Service), the filer's streaming KeepConnected
channel detects the disconnect and reconnects, but the shared
request-path ClientConn cached in pb.grpcClients can remain in READY
state while actually being dead. New AssignVolume/LookupVolume calls
reuse that cached channel and return `rpc error: code = Canceled desc
= context canceled` for every request, until the filer pod is
restarted.

- Expose pb.InvalidateGrpcConnection(address) to drop a cached
  ClientConn when a higher-level signal says it is stale.
- In MasterClient.tryConnectToMaster, invalidate the cached
  request-path channel whenever the KeepConnected stream returns, so
  unrelated callers dial fresh on their next RPC.
- Extend operation.Assign's retry predicate to cover Canceled and
  DeadlineExceeded while the caller context is still live: the first
  failure invalidates the stale ClientConn via
  shouldInvalidateConnection, and the retry dials a new channel.

* fix(grpc): invalidate cached peer conn on streaming death in other paths

Extends the master-client fix to the other streaming-caller + cached
non-streaming-peer pairs that share the same stale-channel failure mode
when the peer restarts behind a stable L4 endpoint (k8s Service VIP,
external load balancer):

- pb.FollowMetadata (s3, mount, webdav, mq broker, filer remote gateway,
  etc. → filer): invalidate the filer's cached ClientConn when the
  SubscribeMetadata stream returns an error.
- filer.MetaAggregator.loopSubscribeToOneFiler (filer → peer filer):
  invalidate the peer's cached ClientConn after doSubscribeToOneFiler
  fails, so the next iteration's readFilerStoreSignature / updateOffset
  calls dial fresh.
- mq sub_client.onEachPartition and doKeepConnectedToSubCoordinator
  (subscriber → broker): invalidate the broker's cached ClientConn when
  the SubscribeMessage / SubscriberToSubCoordinator stream errors.
- mq broker.BrokerConnectToBalancer (broker → broker-balancer):
  invalidate the balancer's cached ClientConn after the
  PublisherToPubBalancer stream errors.

* address review feedback on InvalidateGrpcConnection

- pb.InvalidateGrpcConnection: drop the cache entry under grpcClientsLock
  but call ClientConn.Close() after releasing the lock, so Close's
  internal synchronisation/IO doesn't serialise unrelated callers on the
  global map lock.
- wdclient.tryConnectToMaster: only invalidate the cached request-path
  channel when the streaming call returned an error. On a healthy leader
  redirect (gprcErr == nil) the cached channel is still usable and
  invalidating it just causes a needless re-dial from concurrent callers.

* refactor(grpc): centralize peer-conn invalidation in streaming path

Previously every streaming caller duplicated the same invalidate-cached-
non-streaming-peer-conn wrapper around their WithGrpcClient(true, ...)
call. Move that logic into WithGrpcClient itself: when the streaming
fn returns an error, invalidate any cached ClientConn for the same
address. This removes six near-identical call-site wrappers and gives
every current and future streaming caller the fix by default.

Also aligns the non-streaming branch with the new Invalidate helper's
lock discipline: delete the cache entry under grpcClientsLock, then
Close the ClientConn after releasing the lock.
2026-04-16 12:10:25 -07:00
Chris Lu
9896eade51 feat(mount): set FOPEN_KEEP_CACHE on re-open of unchanged files (#9097)
* feat(mount): set FOPEN_KEEP_CACHE when file mtime is unchanged

On re-open of an unmodified file, signal the kernel to preserve its
existing page cache. This eliminates redundant volume server reads for
workloads that repeatedly open-read-close the same files (build systems,
config readers, etc.).

* fix(mount): use guarded type assertion for openMtimeCache load

Use the two-value form of type assertion when loading from sync.Map
to prevent potential panics if a non-int64 value is ever stored.

* fix(mount): skip redundant mtime store and invalidate on truncation

- Avoid redundant sync.Map Store when cached mtime already matches
  the current mtime, reducing contention on the hot open path.
- Invalidate openMtimeCache in SetAttr when file size changes
  (truncation), preventing stale kernel page cache after ftruncate.

* fix(mount): use nanosecond mtime precision and bounded cache for FOPEN_KEEP_CACHE

- Compare both Mtime (seconds) and MtimeNs (nanoseconds) to detect
  sub-second modifications common in automated workloads.
- Replace unbounded sync.Map with a bounded map + mutex (8192 entries,
  random eviction when full), following the existing atimeMap pattern.
- Extract applyKeepCacheFlag and invalidateOpenMtimeCache methods for
  clarity and testability.
- Add tests for nanosecond precision and cache eviction.

* fix(mount): invalidate mtime cache in truncateEntry for O_TRUNC consistency

Add invalidateOpenMtimeCache call to truncateEntry so the Create path
with O_TRUNC follows the same explicit invalidation pattern as SetAttr
and Write.
2026-04-16 11:37:52 -07:00
Chris Lu
ceae01d05b test(kafka): retry ConsumeWithGroup on failed initial join (#9105)
The consumer group resumption flow in TestOffsetManagement occasionally fails
with `read tcp ... i/o timeout` on the second consumer's first FetchMessage.
Re-joining an existing group races with the previous member's LeaveGroup and
session cleanup; the new reader can observe transient coordinator state that
the kafka-go client surfaces as a connection read timeout.

Wrap ConsumeWithGroup in a 3-attempt retry that rebuilds the reader only when
no message was received yet, so a transient join failure doesn't fail the
whole test. Partial progress (at least one message consumed) still returns
immediately, preserving the original error semantics for post-join failures.
2026-04-16 11:31:26 -07:00
Chris Lu
216b52c13a perf(mount): add graduated write backpressure (#9099)
* perf(mount): add graduated write backpressure before buffer cap

Introduce soft (80%) and hard (95%) throttling thresholds in
WriteBufferAccountant. When write buffer usage approaches the cap,
Reserve() inserts brief sleeps to slow writers gradually rather than
blocking them completely at the cap. This smooths out write latency
under sustained load.

* fix(mount): address review feedback for graduated backpressure

- Use projected usage (used + n) for threshold checks so a single
  reservation crossing a threshold is throttled immediately.
- Widen no-throttle test timing tolerance to softThrottleDelay to
  avoid CI flakes from scheduling jitter.
- Replace fragile timing upper-bound in recovery test with
  counter-based invariant (hardThrottleCount stays at 1).

* docs(mount): clarify single-shot throttle design in WriteBufferAccountant

Add a comment explaining why graduated throttling runs once per Reserve
call rather than inside the blocking loop: once at the cap, the evictor
+ cond.Wait mechanism frees actual capacity, which time-based sleeps
cannot do.
2026-04-16 11:30:23 -07:00
Chris Lu
886d50a6a5 feat(mount): singleflight dedup for concurrent chunk reads (#9100)
* feat(mount): add singleflight deduplication for concurrent chunk reads

When multiple FUSE readers request the same uncached chunk concurrently,
only one network fetch is performed. Other readers wait and share the
downloaded data, reducing redundant volume server traffic under parallel
read workloads.

* fix(util): make singleflight panic-safe with defer cleanup

If the provided function panics, the WaitGroup and map entry are now
cleaned up via defer, preventing other waiters from hanging forever.

* fix(filer): remove singleflight from reader_cache to fix buffer ownership

The singleflight wrapper around chunk fetches returned the same []byte
buffer to concurrent callers. Since each SingleChunkCacher owns and
frees its data buffer in destroy(), sharing the same slice would cause
a use-after-free or double-free with the mem allocator.

The downloaders map already deduplicates in-flight downloads for the
same fileId, so the singleflight was redundant at this layer. The
SingleFlightGroup utility is retained for use elsewhere.
2026-04-16 10:18:05 -07:00
Chris Lu
e1fa4ec756 perf(cache): drop OS page cache after disk cache reads (#9098)
* perf(cache): drop OS page cache after disk cache reads

After reading from the on-disk chunk cache, advise the kernel via
FADV_DONTNEED to release the corresponding page cache pages. This
prevents double-caching the same data in both user-space and kernel
page caches, freeing RAM for other uses on systems with large disk
caches.

* fix(cache): guard dropReadCache against zero length and invalid fd

A zero-length fadvise is interpreted as "to end of file" on Linux,
which would inadvertently drop the page cache for the entire remainder
of the cache volume. Also check fd >= 0 to avoid unnecessary syscalls
when the backend file is closed.

* perf(cache): only apply FADV_DONTNEED for reads >= 1 MiB

For small needle reads the syscall overhead outweighs the memory
savings, and the kernel page cache is more beneficial for warm data.
Restrict fadvise to reads of at least 1 MiB where the freed page
cache is meaningful.
2026-04-16 09:38:42 -07:00
Chris Lu
213b6c3107 fix(mount): count manifest sizes in merge condition to prevent accumulation (#9101)
* fix(mount): count manifest sizes in merge condition to prevent manifest accumulation

shouldMergeChunks only counted non-manifest chunk sizes toward the bloat
threshold, so overlapping manifests accumulated undetected across flush
cycles.

During sustained random writes (e.g. fio), each metadata flush compacts
non-manifest chunks and may group them into a new manifest via
MaybeManifestize, while carrying forward all existing manifests. Since
the merge condition only checked non-manifest totals against 2x file
size, the growing pile of redundant manifests never triggered a merge.

In a real workload: 4 GB file, 25 manifests each covering ~4 GB
(107 GB manifest data on volume servers), but shouldMergeChunks saw only
4.2 GB of non-manifest chunks vs the 8.6 GB threshold -- no merge.

Fix: include manifest coverage sizes in totalChunkSize. This correctly
detects the 111 GB total vs 8.6 GB threshold and triggers the merge,
which re-reads the file as clean chunks and lets the filer's MinusChunks
(which resolves manifests) garbage-collect all redundant sub-chunks.

* test(mount,filer): add manifest chunk correctness tests

Filer-level tests (filechunk_manifest_test.go):
- TestManifestRoundTripPreservesChunks: create -> serialize -> deserialize
  preserves fileId, offset, size, and timestamp for all sub-chunks
- TestCompactResolvedOverlappingManifests: two manifests covering the same
  range, older sub-chunks correctly identified as garbage after compaction
- TestDoMinusChunksWithResolvedManifests: old manifest sub-chunks detected
  as garbage when compared against new clean chunks (mirrors filer cleanup)
- TestManifestizeSmallBatchWithRemainder: batch=3 with 7 chunks produces
  2 manifests + 1 remainder, all data recoverable after resolve
- TestCompactMultipleOverlappingManifestGenerations: 5 generations of
  full-file manifests, compaction keeps only the newest generation
- TestManifestBloatDetection: with N overlapping manifests, merge triggers
  at N >= 2 (stored > 2x file size)

Mount-level test (weedfs_file_sync_test.go):
- TestFlushCycleManifestAccumulation: simulates the flushMetadataToFiler
  pipeline over multiple cycles with a small manifest batch (5 chunks).
  Verifies merge triggers at cycle 2 when accumulated manifests exceed
  the 2x threshold. Uses SeparateManifestChunks + CompactFileChunks +
  shouldMergeChunks to mirror the real code path.
2026-04-16 03:33:08 -07:00
Chris Lu
f9df187928 fix(mount): reduce chunk fragmentation from random writes (#9096)
* fix(mount): reduce chunk fragmentation from random writes

Random writes via FUSE mount create many small partially-overlapping
chunks that bloat storage and slow reads.  Two complementary fixes:

1. Fill gaps at seal time: when a partial writable chunk is sealed for
   upload, read existing file data into the unwritten regions so
   SaveContent uploads one complete chunk instead of many fragments.
   No overhead for sequential writes (IsComplete short-circuits).

2. Merge on fsync: after CompactFileChunks, if total non-manifest
   chunk data exceeds 2x the logical file size, re-read and re-upload
   the entire file as properly-sized chunks. The filer's cleanupChunks
   garbage-collects all superseded chunks.

* revert FillGaps: reading from volume servers during write path is too expensive

* test(mount): add tests for chunk merge condition

Extract shouldMergeChunks as a testable function and add tests covering:
- empty file, single chunk, non-overlapping chunks (no merge)
- exactly 2x boundary (no merge) vs just over 2x (merge)
- manifest chunks extend file size but don't count toward stored total
- input slices are not mutated (safe append)
- CompactFileChunks + condition: fully superseded, staggered overlaps,
  75% overlap pattern, many 4K writes at 1K step
- randomized bloat detection (50 iterations)
- visible content preserved after compaction (100 iterations)

* fix(mount): cap merge buffer at file size for small files

Avoids allocating a full ChunkSizeLimit buffer when the file being
merged is smaller. Addresses PR review feedback.
2026-04-16 02:02:24 -07:00
Chris Lu
44ab2ffee3 fix(plugin): remove Min Volume Age field from vacuum plugin worker config (#9095)
* fix(plugin): remove Min Volume Age field from vacuum plugin worker config

The min_volume_age_seconds setting is not needed as a user-configurable
field in the plugin worker form. The detection logic continues to use
the hardcoded default from NewDefaultConfig().

* fix(plugin): disable volume age filtering in plugin worker detection

Set MinVolumeAgeSeconds to 0 in deriveVacuumConfig so the plugin worker
does not filter out volumes by age during detection.

* fix(plugin): remove all volume age filtering from plugin worker detection

Remove age-related fields from detection trace messages, activity
reports, and per-volume diagnostics. The plugin worker now only
filters by garbage threshold.
2026-04-16 01:35:12 -07:00
Chris Lu
d7865909ba feat(mount): proactive flush of idle writable chunks (#9094)
* feat(mount): proactive flush of idle writable chunks

Add a background goroutine that periodically scans writable chunks
across all open file handles and seals those that are idle and
unlikely to receive further writes, submitting them for async upload.

A writable chunk is proactively flushed when it has been idle for
500ms AND meets one of: nearly full (>=90%), behind the sequential
write frontier by 2+ chunks, or stale for 5+ seconds. Flushing only
happens when the upload pipeline has spare capacity (< half of
concurrent writer slots in use).

This prevents partial chunks from accumulating until fsync/close,
which is particularly beneficial for bursty or small-file workloads
where chunks may never reach IsComplete().

Also fixes a latent bug in ActivityScore where MarkRead/MarkWrite
used value receivers, silently discarding all mutations.

* refactor(mount): reuse WriterPattern instead of duplicating sequential detection

Remove the isSequential atomic from UploadPipeline. The proactive
flusher now reads IsSequentialMode() from the existing WriterPattern
on PageWriter and passes it as a parameter to ProactiveFlush. This
avoids duplicating the sequential/random detection that WriterPattern
already maintains.

* fix(mount): address PR review feedback

- Make ActivityScore thread-safe using atomics (CAS loop for score
  updates, atomic load/swap for timestamp). Previously MarkRead was
  called under RLock while MarkWrite held a write lock, creating a
  data race on the shared fields.

- Fix ProactiveFlush half-capacity guard: use multiplication
  (uploaderCount*2 >= max) instead of floor division (max/2) which
  misbehaves for odd or small concurrentWriterMax values.

* fix(mount): review fixes for proactive flush

- Fix TOCTOU race in lastWriteChunkIndex update: use CAS loop so
  concurrent writers cannot regress the frontier.
- Remove unused UploaderCount() getter.
- Reuse the caller-provided tsNs instead of calling time.Now() again
  in WriteDataAt for lastWriteTsNs, eliminating a redundant syscall
  per write.
2026-04-16 00:44:24 -07:00
Chris Lu
46b801aedb fix(admin): list all masters and dedupe EC file counts in dashboard (#9093)
* fix(admin): list all masters and dedupe EC file counts in dashboard

Dashboard -> Master Nodes only ever showed the currently connected master
because getMasterNodesStatus hard-coded a single entry. Replace it with a
RaftListClusterServers call that returns every master in the raft group and
tags the real leader, falling back to the current master only if the raft
call fails.

Buckets -> Object Store Buckets could render 0 objects for a bucket backed
by an EC volume. Every shard holder reports the same whole-volume
file_count (read from the replicated .ecx), so the first-seen value wins;
if that first node had not yet finished loading .ecx it reported 0 and
pinned the aggregate at 0. Take the max across reporting nodes instead.

The dashboard header total_files also dropped after volumes were converted
to erasure coding because getTopologyViaGRPC never folded EC file_count
into topology.TotalFiles. Aggregate it with the same max/sum dedupe.

* fix(admin): address PR review comments

- bound RaftListClusterServers with a 3s timeout so the dashboard endpoint
  cannot hang on a stalled master
- pre-validate raft addresses with net.SplitHostPort before calling
  pb.GrpcAddressToServerAddress, which otherwise glog.Fatalf's on a
  malformed entry and would crash the admin process
- when raft is unreachable, mark the fallback master as not-leader rather
  than claiming leadership the code cannot verify
- warn when summed EC delete_count exceeds file_count while folding into
  topology.TotalFiles, matching collectCollectionStats

* fix(admin): distinguish empty raft response from RPC failure

When RaftListClusterServers returns successfully with no servers, raft is
not initialized (standalone/non-raft cluster), so the single fallback
master is the leader. Only treat the fallback as a non-leader when the
RPC actually failed.

* fix(admin): remove misleading Objects column from S3 buckets page

The bucket "Objects" column displayed needle counts from volume
collection stats, not actual S3 object counts. This is confusing
because a single S3 object can span multiple needles (multipart
uploads, versions) and the count is inaccurate for EC volumes.

Remove the ObjectCount field from S3Bucket, the Objects table column,
the sort-by-objects handler, the detail-view row, and both CSV export
references.

* fix(admin): correct cell indexes in fallback bucket CSV export

After the Objects column was removed, the fallback CSV exporter in
admin.js still used stale cell indexes: cells[1] mapped to Owner
(not Created), cells[2] to Created (not Size), cells[3] to Logical
Size (not Quota). Align all indexes with the current table column
order and include Owner, Logical Size, and Physical Size.
2026-04-15 22:28:54 -07:00
Chris Lu
dfecd664f9 fix(master): do not re-enter warmup when a fresh cluster grows its first volume (#9092)
* fix(master): do not re-enter warmup when a fresh cluster grows its first volume

Follow-up investigation on #8777. After fixing the writable-chunk cap
deadlock, a second issue surfaced on the same fio reproducer: the
weed mount would log thousands of

  upload data X: filerGrpcAddress assign volume: assign volume failure
    count:1 path:"/test2/X": assign volume: rpc error: code = Canceled
    desc = grpc: the client connection is closing

cascading from the filer's handler. The mount's own cached gRPC
connection to the filer was never invalidated — the "client connection
is closing" text is the FILER's cached connection to the MASTER going
away, and the message is forwarded verbatim in the filer's
AssignVolumeResponse.Error.

Root cause: Topology.IsWarmingUp checks the *live* GetMaxVolumeId. On
a fresh cluster the master is initialized with SetLastLeaderChangeTime
(master_server.go:239 "Seed the warmup timestamp so IsWarmingUp() is
active even if the leader change event hasn't fired yet"), but the
MaxVolumeId==0 guard is supposed to short-circuit IsWarmingUp for
bootstraps so there is no wait. That guard breaks the moment the
first volume is grown inside the warmup window: MaxVolumeId flips
from 0 to 1, the lastLeaderChangeTime is still within 3*pulse (15 s
default), and IsWarmingUp retroactively returns true for the next
several seconds. Every AssignVolume in that window returns
codes.Unavailable, which trips the filer's shouldInvalidateConnection
guard and tears down its cached master connection, which in turn
surfaces as "client connection is closing" to every concurrent
in-flight call from the mount's file-close flush storm. fio reports
EIO on close and the user sees a thousand scary error lines in the
mount log for an otherwise correct run.

Fix: snapshot `hadVolumesAtLeaderChange` inside SetLastLeaderChangeTime
and read that snapshot in IsWarmingUp instead of the live MaxVolumeId.
A fresh cluster snapshots "no volumes at leader change" → IsWarmingUp
stays false through the entire warmup window regardless of how fast
the first grow lands. A real leader transition on a populated cluster
still snapshots "has volumes" → IsWarmingUp behaves exactly as before
until the 3*pulse window closes. The lock ordering in
SetLastLeaderChangeTime reads MaxVolumeId before taking the
lastLeaderChangeTimeLock so the two calls cannot interleave weirdly;
IsWarmingUp reads both fields under a single RLock acquisition.

Verified with the same containerized reproducer used for the deadlock
fix: 4 jobs × 250 nrfiles × 40 MiB × 4k randwrite direct.
  - baseline (master): fio rc=1 (EIO on close), 2809 mount error
    lines matching "filerGrpcAddress assign volume ... client
    connection is closing", 788 filer lines matching "warming up",
    331 "Removing cached gRPC connection to ...19333 due to error:
    master is warming up".
  - patched: fio rc=0 in 1.9 s at 79.2 MiB/s, 0 mount errors,
    0 filer "warming up" lines, 0 cached-master-conn invalidations.
go test ./weed/topology/... passes.

* fix(topology): make NodeImpl.maxVolumeId atomic

CodeRabbit flagged on #9092 that my new IsWarmingUp snapshot
(hadVolumesAtLeaderChange) reads through GetMaxVolumeId(), which until
now returned an unprotected int field on NodeImpl. UpAdjustMaxVolumeId
is called from the volume server heartbeat path in parallel with
GetMaxVolumeId reads on the assign path, and neither side had any
synchronization — a long-standing data race the race detector would
flag if the warmup test suite stressed both sides.

Switch maxVolumeId to atomic.Uint32 (needle.VolumeId is uint32) and
implement UpAdjustMaxVolumeId as a CAS loop so the check-then-set
stays linearizable: two heartbeats racing to promote the field will
land the higher value deterministically and propagate to the parent
exactly once. GetMaxVolumeId is a single atomic load. Callers of both
helpers are unchanged; the struct field comment documents why the
atomic is necessary.

go test -race ./weed/topology/... passes.

* refactor(topology): use WarmupDuration helper in IsWarmingUp

Gemini review on #9092 flagged that IsWarmingUp re-derives the
warmup duration from pulse and WarmupPulseMultiplier instead of
using the existing WarmupDuration() helper, which RemainingWarmupDuration
already uses. Fold the duration calculation through the helper and
short-circuit on lastChange.IsZero() before the time.Since call.
No behavior change.
2026-04-15 14:06:23 -07:00
Chris Lu
bdebd63f7c fix(mount): evict writable chunks when writeBufferSizeMB cap is reached (#9091)
* fix(mount): evict writable chunks when writeBufferSizeMB cap is reached

Reported on #8777: fio 4k randwrite with --nrfiles=1000 on a weed mount
hangs indefinitely after ~1 second of activity, volume server QPS
drops to zero, and the test never makes progress beyond a few hundred
writes. A scaled-down local repro (4 jobs x 250 nrfiles x 40 MiB with
-chunkSizeLimitMB=32 -writeBufferSizeMB=10240) reproduced the hang
exactly: fio stuck in fio_state=SETTING_UP, mount idle, all writer
goroutines parked in page_writer.(*WriteBufferAccountant).Reserve
with n=0x2000000 (32 MiB).

Root cause: UploadPipeline.SaveDataAt reserves a full chunkSize slot
against the global WriteBufferAccountant the first time it allocates
a writable chunk for a given file. The reservation is released by
SealedChunk.FreeReference only after the chunk's async upload
completes, and chunks only move from writable to sealed when they
fill or when the file is flushed/closed. For 4k random writes with
small per-file data and iodepth holding files open, writable chunks
never fill and never close, so their slots are pinned forever. Once
openFiles * chunkSize exceeds writeBufferSizeMB, every new writer
blocks in Reserve's cond.Wait with no possible waker, a hard deadlock.
The user-visible symptoms (volume QPS=0, waited 30min, still running)
are exactly that state. In the reported config, 8 jobs * 1000 nrfiles
* 32 MiB = 256 GB of reservations against a 10 GB cap.

Fix: add a SetEvictor hook on WriteBufferAccountant. When Reserve
would otherwise block on the cond, it single-flights an evictor
callback that walks wfs.fhMap, picks the first open file handle with
a writable chunk, and force-seals its fullest writable chunk via a
new UploadPipeline.EvictOneWritableChunk. Force-sealing submits the
chunk for async upload on the existing uploader pool; the upload's
SealedChunk.FreeReference releases the global slot, broadcasts on
the accountant cond, and unblocks the waiter. The fullest-first
heuristic matches the existing over-limit path in SaveDataAt and
keeps us from thrashing on repeatedly re-sealing half-empty chunks.

The original #9066 motivation (cap global write-pipeline growth so
swap can't be filled while uploads stall) is preserved: Reserve
still blocks when the cap is actually exhausted by in-flight
uploads, and a failed evictor (no writable chunks to seal) falls
through to cond.Wait exactly as before.

Verified end-to-end in a privileged debian container running a
full master + volume + filer + weed mount stack. Against the
reporter's config shrunk to the same memory footprint (4 jobs x
250 nrfiles x 40 MiB, -chunkSizeLimitMB=32 -writeBufferSizeMB=10240):
  - baseline binary (master): fio stalls after ~320 writes, killed
    by timeout at 400s with io=1280 KiB and bw=2383 B/s.
  - patched binary: fio completes in 1.9 s at 82.8 MiB/s, all
    40,000 writes issued, 156 MiB written, rc=0.
page_writer unit tests still pass.

Touches: WriteBufferAccountant (SetEvictor + single-flight in
Reserve), UploadPipeline.EvictOneWritableChunk, DirtyPages
interface + the two existing implementers (ChunkedDirtyPages,
PageWriter), and a new weedfs_write_buffer_evict.go holding the
WFS.evictOneWritableChunk callback that walks fhMap.

* fix(mount): re-check write-budget cap after evict and make eviction panic-safe

Addresses two review concerns on #9091:

1. CodeRabbit flagged that the original Reserve loop only considered the
   break condition when the evictor reported success:

     if evicted {
         if a.used+n <= a.cap || a.used == 0 {
             break
         }
     }
     a.cond.Wait()

   A concurrent Release firing during the evictor's execution window (we
   drop a.mu around the callback so uploader goroutines can drain) would
   land its Broadcast on an empty waiter set and then be missed, because
   we unconditionally call cond.Wait() on the same iteration. In practice
   an in-flight async upload eventually fires another broadcast so this
   would delay rather than permanently hang — but the logic is wrong
   either way and should re-check the cap after every evictor round.

2. Gemini flagged that setting `a.evicting = true` and dropping `a.mu`
   without a deferred unwind leaves the accountant in a broken state if
   the evictor panics: the flag stays set forever and no Reserve caller
   can ever run another eviction, even though the mutex gets unlocked by
   the normal stack unwind.

Pull the evictor call into a tiny helper runEvictorLocked whose defer
re-acquires a.mu and clears a.evicting regardless of panic status, then
broadcasts so other blocked Reservers re-evaluate. Reserve checks the
cap condition immediately after the helper returns; only falls through
to cond.Wait() when the budget is still exhausted. No behavior change on
the happy path; the fix is in the error and race corners.

* doc(mount): clarify eviction-scan cost on the blocked-Reserve path

Gemini review on #9091 flagged that the fhMap scan inside
evictOneWritableChunk is O(open files). Document why the cost is
bounded to the Reserve-blocked path and paid at most once per
chunkSize drained, rather than on the write hot path, and record the
reason we do not preempt with an LRU.
2026-04-15 14:06:06 -07:00
Chris Lu
2fd60cfbc3 fix(balance): guard against destination overshoot and oscillation (#9090)
* fix(balance): guard against destination overshoot and oscillation

Plugin-worker volume_balance detection re-selects maxServer/minServer
each iteration based on utilization ratio. With heterogeneous
MaxVolumeCount values, a single greedy move can flip which server is
most-utilized, causing A->B, B->A oscillation within one detection
cycle and pushing destinations past the cluster ideal.

Mirror the shell balancer's per-move guard
(weed/shell/command_volume_balance.go:440): before scheduling a move,
verify that the destination's post-move utilization would not strictly
exceed the source's post-move utilization. If it would, no single move
can improve balance, so stop.

Add regression tests that cover:
- TestDetection_HeterogeneousMax_NoOvershootNoOscillation: 2 servers
  with different caps just above threshold; detection must not
  oscillate or make the imbalance worse.
- TestDetection_RespectsClusterIdealUtilization: 3-server heterogeneous
  layout; destinations must not overshoot cluster ideal.

* fix(balance): use effective capacity when resolving destination disk

resolveBalanceDestination read VolumeCount directly from the topology
snapshot, which is not updated when AddPendingTask registers a move
within the current detection cycle. This meant multiple moves planned
in a single cycle all saw the same static count and could target the
same disk past its effective capacity.

Switch to ActiveTopology.GetNodeDisks + GetEffectiveAvailableCapacity
so that destination planning accounts for all pending and assigned
tasks affecting the disk — consistent with how the detection loop
already tracks effectiveCounts at the server level.

Add a unit test that seeds two pending balance tasks against a
destination disk with 2 free slots and asserts resolveBalanceDestination
rejects a third planned move.

* fix(ec_balance): capacity-weighted guard in Phase 4 global rebalance

detectGlobalImbalance picked min/max nodes by raw shard count and
compared them against a simple (unweighted) rack-wide average. With
heterogeneous MaxVolumeCount across nodes in the same rack, this lets
the greedy algorithm move shards from a large, barely-used node to a
small, nearly-full node just because the small node has fewer shards
in absolute terms — strictly worsening imbalance by utilization and
potentially overfilling the small node.

Snapshot each node's total shard capacity (current shards plus free
slots) at loop start and add a per-move convergence guard: reject any
move where the destination's post-move utilization would strictly
exceed the source's post-move utilization. Mirrors the fix in
weed/worker/tasks/balance/detection.go.

Regression test TestDetectGlobalImbalance_HeterogeneousCapacity covers
a rack with node1 (cap 100, 10 shards → 10% util) and node2 (cap 5,
3 shards → 60% util). Before the fix, Phase 4 moves 2 shards from
node1 to node2, filling node2 to 100% util. After the fix, the guard
blocks both moves.

* fix(ec_balance): utilization-based max/min in Phase 4 rebalance

Phase 4's global rebalancer picked source and destination nodes by raw
shard count, and compared against a simple raw-count average. With
heterogeneous MaxVolumeCount across nodes in a rack, this got the
direction wrong: a large-capacity node holding many shards in absolute
terms but only a small fraction of its capacity would be picked as the
"overloaded" source, while a small-capacity node nearly at its slot
limit (but holding fewer absolute shards) would be picked as the
"underloaded" destination. The previous fix added a strict-improvement
guard that prevented the bad move but left balance untouched — the
rack stayed in an uneven state.

Switch to utilization-based selection and a utilization-based pre-check:
- Pick max/min by (count / capacity), where capacity is the node's
  current allowed shards plus remaining free slots (snapshotted once
  per rack and held constant for the duration of the loop).
- Replace the raw-count imbalance gate (exceedsImbalanceThreshold) with
  a new exceedsUtilImbalanceThreshold helper that compares fractional
  fullness. The raw-count gate is still used by Phase 2 and Phase 3,
  where the per-rack / per-volume semantics differ.
- Drop the raw-count guards (maxCount <= avgShards || minCount+1 >
  avgShards and maxCount-minCount <= 1) now that the per-move
  strict-improvement check handles termination correctly for both
  homogeneous and heterogeneous capacity.

Also fix a latent bug in the inner shard-selection loop: it was not
updating shardBits between iterations, so every iteration picked the
same lowest-set bit and emitted duplicate move requests for the same
physical shard. Update maxNode and minNode's shardBits immediately
after appending a move, mirroring what applyMovesToTopology does
between phases.

Update TestDetectGlobalImbalance_HeterogeneousCapacity to assert:
- Moves flow from the higher-util node2 to the lower-util node1
  (direction check), and
- Each (volumeID, shardID) pair appears at most once in the move list
  (duplicate-shard guard).

* fix(ec_balance): keep source freeSlots in sync after planned shard moves

All three phase loops that plan EC shard moves (detectCrossRackImbalance,
detectWithinRackImbalance, detectGlobalImbalance) decrement the
destination node's freeSlots but leave the source node's freeSlots
stale. Over the course of a detection run that processes many volumes
or iterates within a rack, the source's reported freeSlots drifts
below its actual value.

In Phase 4 specifically, the per-move strict-improvement guard prevents
the source from becoming a destination candidate, so the stale value
never affects decisions. In Phases 2 and 3 it can: a node that sheds
shards for one volume's rebalance is eligible as a destination for
another volume in the same run, and the destination selection uses
node.freeSlots <= 0 as a hard skip (findDestNodeInUnderloadedRack /
findLeastLoadedNodeInRack). A tightly-provisioned node could be
skipped as a destination even after it has freed slots.

Increment maxNode.freeSlots / node.freeSlots symmetrically at each
scheduled move so freeSlots remains an accurate running view of
available slot capacity throughout a detection run.
2026-04-15 12:47:59 -07:00
Chris Lu
979c54f693 fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals

Raft leader is advertised as host:httpPort.grpcPort, but clients dial
host:httpPort. Raw string comparison against VolumeLocation.Leader /
HeartbeatResponse.Leader therefore never matches, causing the
masterclient and the volume server heartbeat loop to continuously
"redirect" to the already-connected master, tearing down the stream
and reconnecting.

Use ServerAddress.Equals, which normalizes the grpc-port suffix.

* fix(filer,mq): compare ServerAddress via Equals in two more sites

filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local
partition assignment check both compared a wire-supplied address string
against the local self ServerAddress with raw string equality. Both are
vulnerable to the same plain-vs-host:port.grpcPort mismatch as the
masterclient/volume heartbeat sites: filer would bootstrap from itself,
and the broker would fail to claim a partition it was actually assigned.
Route both through ServerAddress.Equals.

* fix(master,shell): more ServerAddress comparisons via Equals

- raft_server_handlers.go HealthzHandler: s.serverAddr == leader would
  skip the child-lock check on the real leader when the two carry
  different plain/grpc-suffix forms, returning 200 OK instead of 423.
- master_server.go SetRaftServer leader-change callback: the
  Leader() == Name() guard for ensureTopologyId could disagree with
  topology.IsLeader() (which already uses Equals), so leader-only
  initialization could be skipped after an election.
- command_volume_merge.go isReplicaServer: the -target guard compared
  user-supplied host:port against NewServerAddressFromDataNode(...) with
  ==, letting an existing replica slip through when topology carries
  the embedded gRPC port.

All routed through pb.ServerAddress.Equals.

* fix(mq,cluster): more ServerAddress comparisons via Equals

- broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the
  partition ownership check gated listing on raw
  LeaderBroker == BrokerAddress().String(), so listings silently omitted
  partitions hosted locally when the assignment carried the other
  host:port / host:port.grpcPort form.
- lock_client.go: LockHostMovedTo comparison and the seedFiler fallback
  guard both used raw string equality against configured filer
  addresses (which may be plain host:port while LockHostMovedTo comes
  back suffixed), causing spurious host-change churn and blocking the
  seed-filer fallback.

* fix(mq): more ServerAddress comparisons via Equals

- pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct
  activeBrokers.Get() lookup missed brokers when a persisted assignment
  carried a different address encoding than the registered broker key,
  triggering a bogus reassignment on every read/write cycle. Added a
  findActiveBroker helper that falls back to an Equals-based scan and
  canonicalizes the assignment in place so later writes are stable.
- broker_grpc_lookup.go isLockOwner: used raw string equality between
  LockOwner() and BrokerAddress().String(), so a lock owner could fail
  to recognize itself and proxy local lookup/config/admin RPCs away.
- pub_client/scheduler.go onEachAssignments: reused publisher jobs only
  on exact LeaderBroker match, so an encoding flip in lookup results
  tore down and recreated a stream to the same broker.
2026-04-15 12:29:31 -07:00
Lars Lehtonen
d292d3640d chore(weed/mq/kafka/schema): remove unused functions (#9088) 2026-04-15 10:23:12 -07:00
dependabot[bot]
2c460e0fc9 build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest (#9082)
build(deps): bump org.apache.kafka:kafka-clients

Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2.

---
updated-dependencies:
- dependency-name: org.apache.kafka:kafka-clients
  dependency-version: 3.9.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-14 21:54:21 -07:00
Chris Lu
d0a09ea178 fix(s3): honor ChecksumAlgorithm on presigned URL uploads (#9076)
* fix(s3): honor ChecksumAlgorithm on presigned URL uploads

AWS SDK presigners hoist x-amz-sdk-checksum-algorithm (and related
checksum headers) into the signed URL's query string, so servers must
read either location. detectRequestedChecksumAlgorithm only looked at
request headers, so presigned PUTs with ChecksumAlgorithm set validated
and stored no additional checksum, and HEAD/GET never returned the
x-amz-checksum-* header.

Read these parameters from headers first, then fall back to a
case-insensitive query-string lookup. Apply the same fallback when
comparing an object-level checksum value against the computed one.

Fixes #9075

* test(s3): presigned URL checksum integration tests (#9075)

Adds test/s3/checksum with end-to-end coverage for flexible-checksum
behavior on presigned URL uploads. Tests generate a presigned PUT URL
with ChecksumAlgorithm set, upload the body with a plain http.Client
(bypassing AWS SDK middleware so the server must honor the query-string
hoisted x-amz-sdk-checksum-algorithm), then HEAD/GET with
ChecksumMode=ENABLED and assert the stored x-amz-checksum-* header.

Covers SHA256, SHA1, and a negative control with no checksum requested.
Wires the new directory into s3-go-tests.yml as its own CI job.

* perf(s3): parse presigned query once in detectRequestedChecksumAlgorithm

Previously, each header fallback called getHeaderOrQuery, which re-parsed
r.URL.Query() and allocated a new map on every invocation — up to eight
times per PutObject request. Parse the raw query at most once per request
(only when non-empty) and pass the pre-parsed url.Values into a new
lookupHeaderOrQuery helper.

Also drops a redundant strings.ToLower allocation in the case-insensitive
query key scan (strings.EqualFold already handles ASCII case folding).

Addresses review feedback from gemini-code-assist on PR #9076.

* test(s3): honor credential env vars and add presigned upload timeout

- init() now reads S3_ACCESS_KEY/S3_SECRET_KEY (and AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY / AWS_REGION fallbacks) so that
  `make test-with-server ACCESS_KEY=... SECRET_KEY=...` no longer
  authenticates with hardcoded defaults while the server has been
  started with different credentials.
- uploadViaPresignedURL uses a dedicated http.Client with a 30s timeout
  instead of http.DefaultClient, so a stalled server fails fast in CI
  instead of blocking until the suite's global timeout fires.

Addresses review feedback from coderabbitai on PR #9076.

* test(s3): pass S3_PORT and credentials through to checksum tests

- 'make test' now exports S3_ENDPOINT, S3_ACCESS_KEY, and S3_SECRET_KEY
  derived from the Makefile variables so the Go test process talks to
  the same endpoint/credentials that start-server was launched with.
- start-server cleans up the background SeaweedFS process and PID file
  when the readiness poll times out, preventing stale port conflicts on
  subsequent runs.

Addresses review feedback from coderabbitai on PR #9076.

* ci(s3): raise checksum tests step timeout

make test-with-server builds weed_binary, waits up to 90s for readiness,
then runs go test -timeout=10m. The previous 12-minute step timeout only
had ~2 minutes of headroom over the Go timeout, risking the Actions
runner killing the step before tests reported a real failure.

Bumps the job timeout from 15 to 20 minutes and the step timeout from
12 to 16 minutes, matching other S3 integration jobs.

Addresses review feedback from coderabbitai on PR #9076.

* perf(s3): thread pre-parsed query through putToFiler hot path

Parse the request's query string once at the top of putToFiler and
reuse the resulting url.Values for both the checksum-algorithm detection
and the expected-checksum verification. Previously, the verification
path called getHeaderOrQuery which re-parsed r.URL.Query() again on
every PutObject, defeating the previous commit's single-parse goal.

- Add parseRequestQuery + detectRequestedChecksumAlgorithmQ (the
  pre-parsed-query variant). detectRequestedChecksumAlgorithm is now a
  thin wrapper used by callers that do a single lookup.
- putToFiler parses once and threads the result through both call sites.
- Remove getHeaderOrQuery and update the unit test to use
  lookupHeaderOrQuery directly.

Addresses follow-up review from gemini-code-assist on PR #9076.

* test(s3): check io.ReadAll error in uploadViaPresignedURL helper

* test(s3): drop SHA1 presigned test case

The AWS SDK v2 presigner signs a Content-MD5 header at presign time
for SHA1 PutObject requests even when no body is attached (the MD5 of
the empty payload gets baked into the signed headers). Uploading the
real body via a plain http.Client then trips SeaweedFS's MD5 validation
and returns BadDigest — an SDK/presigner quirk, not a SeaweedFS bug.

The SHA256 positive case already exercises the server-side
query-hoisted algorithm path that issue #9075 is about, and the unit
tests in weed/s3api cover each algorithm's header mapping. Drop the
SHA1 integration case rather than chase SDK-specific workarounds.

* test(s3): provide real Content-MD5 to presigned checksum test

AWS SDK v2's flexible-checksum middleware signs a Content-MD5 header at
presign time. There is no body to hash at that point, so it seeds the
header with MD5 of the empty payload. When the real body is then PUT
with a plain http.Client, SeaweedFS's server-side Content-MD5
verification correctly rejects the upload with BadDigest.

Pre-compute the MD5 of the test body and thread it into
PutObjectInput.ContentMD5 so the signed Content-MD5 matches the body
that will actually be uploaded. The test still exercises the
server-side path that reads X-Amz-Sdk-Checksum-Algorithm from the
query string (the fix that PR #9076 is validating).

* test(s3): send the signed Content-MD5 header on presigned upload

uploadViaPresignedURL now accepts an extraHeaders map so callers can
thread through headers that the presigner signed but the raw http
request would otherwise omit. The SHA256 test passes the Content-MD5
it computed, matching what the presigner baked into the signature.

Fixes SignatureDoesNotMatch seen in CI after the previous commit set
ContentMD5 on the presign input without sending the corresponding
header on the actual upload.

* test(s3): build presigned URL with the raw v4 signer

The AWS SDK v2 s3.PresignClient runs the flexible-checksum middleware
for any PutObject input that carries ChecksumAlgorithm. That middleware
injects a Content-MD5 header at presign time, and with no body present
it seeds MD5-of-empty. Any subsequent upload of a non-empty body
through a plain http.Client then trips SeaweedFS's Content-MD5
verification and returns BadDigest — not the code path that issue
#9075 is about.

Replace the PresignClient usage in the integration test with a direct
call to v4.Signer.PresignHTTP, building a canonical URL whose query
string already contains x-amz-sdk-checksum-algorithm=SHA256. This is
exactly the shape of URL a browser/curl client would receive from any
presigner that hoists the algorithm header, and it exercises the
server-side fix from PR #9076 without dragging in SDK-specific
middleware quirks.

* test(s3): set X-Amz-Expires on presigned URL before signing

v4.Signer.PresignHTTP does not add X-Amz-Expires on its own — the
caller has to seed it into the request's query string so the signer
includes it in the canonical query and the server accepts the
presigned URL. Without it, SeaweedFS correctly returns
AuthorizationQueryParametersError.

Also adds a .gitignore for the make-managed test volume data, log
file, and PID file so local `make test-with-server` runs do not leave
artifacts tracked by git.

Verified by running the integration tests locally:
  make test-with-server → both presigned checksum tests PASS.
2026-04-14 21:52:49 -07:00
Chris Lu
08d9193fe1 [nfs] Add NFS (#9067)
* add filer inode foundation for nfs

* nfs command skeleton

* add filer inode index foundation for nfs

* make nfs inode index hardlink aware

* add nfs filehandle and inode lookup plumbing

* add read-only nfs frontend foundation

* add nfs namespace mutation support

* add chunk-backed nfs write path

* add nfs protocol integration tests

* add stale handle nfs coverage

* complete nfs hardlink and failover coverage

* add nfs export access controls

* add nfs metadata cache invalidation

* fix nfs chunk read lookup routing

* fix nfs review findings and rename regression

* address pr 9067 review comments

- filer_inode: fail fast if the snowflake sequencer cannot start, and let
  operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
  to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
  a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
  after CommitTransaction so a rolled-back rename does not strand live
  metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
  filer.path, so the experimental server does not expose the entire
  filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
  layout rather than RFC 1813 LINK3args

* mount: pre-allocate inode in Mkdir and Symlink

Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.

createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.

* sequence: mask snowflake node id on int→uint32 conversion

CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.

* add test/nfs integration suite

Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.

Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.

Framework notes:

- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
  -port.grpc explicitly so the default port+10000 convention cannot
  overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
  starting the NFS server — the NFS server's ensureIndexedEntry check
  requires the export root to exist with a real entry, which filer.Root
  does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
  not try to re-dial via portmapper (which concatenates ":111" onto the
  address).

* ci: add NFS integration test workflow

Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.

* nfs: use append for buffer growth in Write and Truncate

The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.

* filer: honor caller cancellation in collectInodeIndexEntries

Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.

* nfs: skip filer read on open when O_TRUNC is set

openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.

* nfs: allow Seek on O_APPEND files and document buffered write cap

Two related cleanups on filesystem.go:

- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
  Seek error ("append-only file descriptors may only seek to EOF")
  prevented read-and-write workloads that legitimately reposition the
  read cursor. Write already snaps the offset to EOF before persisting
  (see seaweedFile Write), so Seek can unconditionally accept any
  offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
  per-file ceiling, the memory footprint it implies, and that the real
  fix for larger whole-file rewrites is streaming / multi-chunk support.

Both changes flagged by gemini-code-assist on PR 9067.

* nfs: guard offset before casting to int in Write

CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.

* nfs: compute Write endOffset in int64 to satisfy CodeQL

The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.

* ci: drop emojis from nfs-tests workflow summary

Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.

* nfs: annotate remaining DEV_PLAN TODOs with status

Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:

- Reuse chunk cache and mutation stream helpers without FUSE deps:
  checked off — the NFS server imports weed/filer.ReaderCache and
  weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
  as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
  annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
  go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
  "Current Blockers" note.

* test/nfs: share port+readiness helpers with test/testutil

Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.

Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.

* nfs: stream writes to volume servers instead of buffering the whole file

Before this change the NFS write path held the full contents of every
writable open in memory:

  - OpenFile(write) called loadWritableContent which read the existing
    file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
  - each Write() extended content in-place
  - Close() uploaded the whole buffer as a single chunk via
    persistContent + AssignVolume

The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:

  - openFile(write) no longer loads the existing content at all; it
    only issues an UpdateEntry when O_TRUNC is set *and* the file is
    non-empty (so a fresh create+trunc is still zero-RPC)
  - Write() streams the caller's bytes straight to a volume server via
    one AssignVolume + one chunk upload, then atomically appends the
    resulting chunk to the filer entry through mutateEntry. Any
    previously inlined entry.Content is migrated to a chunk in the same
    update so the chunk list becomes the authoritative representation.
  - Truncate() becomes a direct mutateEntry (drop chunks past the new
    size, clip inline content, update FileSize) instead of resizing an
    in-memory buffer.
  - Close() is a no-op because everything was flushed inline.

The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.

Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).

* filer: extract shared gateway upload helper for NFS and WebDAV

Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.

- NFS's saveDataAsChunk is now a thin adapter that assembles the
  GatewayChunkUploadRequest from server options and calls the helper.
  The chunkUploader interface keeps working for test injection because
  the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
  local operation.NewUploader call plus the AssignVolume/UploadOption
  scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
  features that do not fit the shared helper (a pre-allocated file-id
  pool used to skip AssignVolume entirely, and a chunkCache
  write-through at offset 0 so future reads hit the mount's local
  cache), both of which are mount-specific.

Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.

* nfs: remove DESIGN.md and DEV_PLAN.md

The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
2026-04-14 20:48:24 -07:00
Chris Lu
2818251dd5 fix(iceberg): clean stale data before creating a table (#9074) (#9077)
* fix(iceberg): clean stale data before creating a table

CREATE TABLE AS from Trino fails against the SeaweedFS Iceberg REST
catalog with "Cannot create a table on a non-empty location". The
catalog assigns every new table the deterministic <ns>/<table> path,
and Trino's pre-write check rejects the CTAS whenever leftover objects
live there — typically files from a prior DROP that did not purge the
data, or an earlier aborted CTAS.

Make the catalog the authority for table existence: before writing any
metadata, look up the table in the S3 Tables catalog. If it is already
registered, return the existing definition (idempotent create). If it
is not registered, any objects still sitting at the target location
are stale, so purge them before proceeding. Live tables are never
touched — the cleanup path is guarded by the catalog lookup.

Fixes #9074

* test(trino): regression for create/drop/recreate table (#9074)

Exercises the exact sequence from the reported bug: CREATE TABLE without
an explicit location, INSERT, DROP, then CREATE again with the same name,
followed by a CTAS on top. Previously the recreate failed with
"Cannot create a table on a non-empty location" because stale data files
from the dropped table lingered at the deterministic <schema>/<table>
path. The test also asserts the recreated table does not see the dropped
data and that a drop/recreate CTAS cycle works.

* fix(iceberg): purge dropped table location on DROP TABLE

Recreate-at-same-location was failing with "Cannot create a table on a
non-empty location" because DROP TABLE only removed the catalog entry
and left data files behind. Trino's pre-write emptiness check then
rejected the subsequent CREATE.

Look up the table's storage location before deleting the catalog entry,
and after a successful DeleteTable, purge the filer subtree at that
location. The lookup uses the catalog — the authoritative owner of the
name→location mapping — so cleanup is gated on a live table having
existed; failed lookups skip cleanup and leave storage alone.

Also pin the regression test to an explicit, fixed location so the
recreate genuinely targets the same path the drop was supposed to free.

* refactor(iceberg): use errors.As in isNoSuchTableError

* fix(iceberg): drop create-preflight cleanup; harden cleanup path

- Remove the destructive cleanupStaleTableLocation call from the
  CreateTable preflight. Storage cleanup now lives exclusively on the
  DROP path, so CreateTable has no side effects on storage beyond the
  new table's own metadata write.
- Validate tablePath in cleanupStaleTableLocation by rejecting empty,
  ".", "..", or backslash-bearing segments before joining with the
  bucket prefix, so a crafted location cannot escape the bucket
  subtree. path.Clean would silently collapse traversal, so segments
  are checked raw.

* test(trino): defer table drops in recreate test for failure-safe cleanup
2026-04-14 20:34:53 -07:00
Chris Lu
2d9441726d fix(helm): skip s3 ServiceMonitor when only filer.s3 is enabled (#9081)
* fix(helm): skip s3 ServiceMonitor when only filer.s3 is enabled (#9080)

The seaweedfs-s3 Service only exposes a "metrics" port when the standalone
s3 gateway is enabled. With filer.s3.enabled=true and s3.enabled=false the
Service only has swfs-s3:8333, so the generated ServiceMonitor matched zero
targets and fired persistent no-targets alerts. The embedded filer S3
gateway's metrics are already scraped via the filer ServiceMonitor.

* comment: drop issue ref
2026-04-14 19:04:51 -07:00
Chris Lu
c2f5db3a02 perf(filer.sync): don't serialize descendants behind dir attribute updates (#9079)
* perf(filer.sync): don't serialize descendants behind dir attribute updates

The MetadataProcessor treated every in-flight directory job as a subtree
barrier: any active dir job at /foo forced all file events under /foo to
wait, and because the admit loop runs on the single stream.Recv()
goroutine, a stalled descendant also stalled the whole gRPC stream. For
large directories this turned every attribute-only dir event (mtime /
xattr / chmod bumps) into a full-subtree pinch point.

Classify dir jobs as barrier (create / delete / rename) vs non-barrier
(filer_pb.IsUpdate on a directory — same parent and same name, i.e. an
in-place attribute update). Only barrier dirs block descendants and get
blocked by ancestor barrier dirs. Non-barrier dir updates still bump the
ancestor descendantCount, so an incoming barrier dir on an ancestor
still waits for them — preserving the "delete /a waits for in-flight
/a/b update" safety.

Tests cover the loosened cases and the preserved barriers:
non-barrier update doesn't block a file descendant, barrier create
still does, barrier delete still waits for in-flight descendants, and
a barrier ancestor still waits for a non-barrier descendant update.

* fix(filer.sync): serialize same-path barrier dir jobs against concurrent ops

Review (Gemini) flagged that pathConflicts had latent same-path gaps
that predated this PR but deserve fixing alongside the dir-conflict
loosening: two barrier dir jobs at the same path could run concurrently
(e.g. create /a and delete /a), and a file job at the same path as an
in-flight barrier dir wasn't blocked either.

Tighten pathConflicts so that:
- an active barrier dir at p blocks every incoming job at p (file,
  barrier dir, or non-barrier attribute update) — same-path promotions,
  renames, and delete/create collisions must serialize;
- an active file at p blocks incoming files and barrier dirs at p;
- non-barrier dir updates at the same path still overlap with each
  other (attribute bumps are last-writer-wins, intentional).

TestDirVsDirConflict and TestFileUnderActiveDirConflict flip their
"same path does not conflict" assertions to match. New
TestSamePathBarrierSerialization covers all five same-path cases
explicitly.

* fix(filer.sync): serialize incoming barrier dir against same-path non-barrier update

Bug introduced by the previous same-path tightening commit and caught
in review (CodeRabbit, critical): a kindNonBarrierDir at /dir1 was not
indexed at its own path, so a later kindBarrierDir at /dir1 saw neither
activeBarrierDirPaths["/dir1"] nor descendantCount["/dir1"] (the latter
only counts strict descendants) and was admitted concurrently with the
in-flight attribute update. That violated the "barrier at p serializes
all work at p" rule.

Track non-barrier dir jobs in a new activeNonBarrierDirPaths map and
check it only from the incoming-barrier-dir branch of pathConflicts.
The map is deliberately invisible to the ancestor check, so non-barrier
updates still don't serialize file descendants — the loosening this PR
is about stays intact.

Regression test added in TestSamePathBarrierSerialization covers both
the admission conflict and the index cleanup on job completion.
2026-04-14 18:34:05 -07:00
Chris Lu
40c1797f8e fix(s3): allow anonymous ListBuckets with prefix-scoped List action (#9073)
* fix(s3): allow anonymous ListBuckets with prefix-scoped List action

An anonymous identity holding a prefix-scoped action such as
"List:prefix-*" was denied at the auth middleware before ListBucketsHandler
could apply the per-bucket visibility check. The middleware called
CanDo with an empty bucket, which never matches a scoped action, so
every anonymous ListBuckets request returned 403 even though matching
buckets should have been visible.

Defer ListBuckets authorization to the handler for the anonymous
identity when it actually carries a List action, mirroring the
existing behavior for authenticated users. Anonymous identities with
no List action continue to be rejected at the global layer, preserving
the secure-by-default posture.

Fixes #9072

* refactor(s3): make hasListAction a method on Identity

Addresses PR review — consistent with existing CanDo/isAdmin methods and
also treats Admin identities as implicitly having List permission.
2026-04-14 10:52:00 -07:00
Chris Lu
eaf561e86c perf(s3): add optional shared in-memory chunk cache for GET (#9069)
Adds the -s3.cacheCapacityMB flag (default 0, disabled) that attaches
an in-memory chunk_cache.ChunkCacheInMemory to the server-wide
ReaderCache introduced in the previous commit. When enabled,
completed chunks are deposited into the shared cache as they are
downloaded, so concurrent and repeat GETs of the same object hit
memory instead of re-fetching chunks from volume servers.

When 0 (the default) the shared ReaderCache still runs — it just
attaches a nil chunk cache, so behaviour matches the previous commit
exactly. No behaviour change for clusters that don't opt in.

Disk-backed TieredChunkCache was evaluated and rejected: its
synchronous SetChunk writes regressed cold reads ~12x on loopback
because the chunk fetchers block on local disk I/O that is *slower*
than the TCP volume-server fetch it is supposed to accelerate.
Memory-only avoids that.

Flag registered in all four S3 flag sites (s3.go, server.go,
filer.go, mini.go) per the comment on command.S3Options. The chunk
size used to convert CacheSizeMB → entry count is encapsulated in
the s3ChunkCacheChunkSizeMB constant so it's easy to grep and
revisit if the filer default chunk size changes.

Measured on weed mini + 1 GiB random object over loopback, single
curl on a presigned URL:

    cacheCapacityMB=0 (off):  cold ~2900, warm ~2900 MB/s
    cacheCapacityMB=4096:     cold ~2790, warm ~5050 MB/s (+70%)
2026-04-14 09:24:35 -07:00
Chris Lu
7a7f220224 feat(mount): cap write buffer with -writeBufferSizeMB (#9066)
* feat(mount): cap write buffer with -writeBufferSizeMB

Without a bound on the per-mount write pipeline, sustained upload
failures (e.g. volume server returning "Volume Size Exceeded" while
the master hasn't yet rotated assignments) let sealed chunks pile up
across open file handles until the swap directory — by default
os.TempDir() — fills the disk. Reported on 4.19 filling /tmp to 1.8 TB
during a large rclone sync.

Add a global WriteBufferAccountant shared across every UploadPipeline
in a mount. Creating a new page chunk (memory or swap) first reserves
ChunkSize bytes; when the cap is reached the writer blocks until an
uploader finishes and releases, turning swap overflow into natural
FUSE-level backpressure instead of unbounded disk growth.

The new -writeBufferSizeMB flag (also accepted via fuse.conf) defaults
to 0 = unlimited, preserving current behavior. Reserve drops
chunksLock while blocking so uploader goroutines — which take
chunksLock on completion before calling Release — cannot deadlock,
and an oversized reservation on an empty accountant succeeds to avoid
single-handle starvation.

* fix(mount): plug write-budget leaks in pipeline Shutdown

Review on #9066 caught two accounting bugs on the Destroy() path:

1. Writable-chunk leak (high). SaveDataAt() reserves ChunkSize before
   inserting into writableChunks, but Shutdown() only iterated
   sealedChunks. Truncate / metadata-invalidation flows call Destroy()
   (via ResetDirtyPages) without flushing first, so any dirty but
   unsealed chunks would permanently shrink the global write budget.
   Shutdown now frees and releases writable chunks too.

2. Double release with racing uploader (medium). Shutdown called
   accountant.Release directly after FreeReference, while the async
   uploader goroutine did the same on normal completion — under a
   Destroy-before-flush race this could underflow the accountant and
   let later writes exceed the configured cap. Move accounting into
   SealedChunk.FreeReference itself: the refcount-zero transition is
   exactly-once by construction, so any number of FreeReference calls
   release the slot precisely once.

Add regression tests for the writable-leak and the FreeReference
idempotency guarantee.

* test(mount): remove sleep-based race in accountant blocking test

Address review nits on #9066:
- Replace time.Sleep(50ms) proxy for "goroutine entered Reserve" with
  a started channel the goroutine closes immediately before calling
  Reserve. Reserve cannot make progress until Release is called, so
  landed is guaranteed false after the handshake — no arbitrary wait.
- Short-circuit WriteBufferAccountant.Used() in unlimited mode for
  consistency with Reserve/Release, avoiding a mutex round-trip.

* test(mount): add end-to-end write-buffer cap integration test

Exercises the full write-budget plumbing with a small cap (4 chunks of
64 KiB = 256 KiB) shared across three UploadPipelines fed by six
concurrent writers. A gated saveFn models the "volume server rejecting
uploads" condition from the original report: no sealed chunk can drain
until the test opens the gate. A background sampler records the peak
value of accountant.Used() throughout the run.

The test asserts:
  - writers fill the budget and then block on Reserve (Used() stays at
    the cap while stalled)
  - Used() never exceeds the configured cap even under concurrent
    pressure from multiple pipelines
  - after the gate opens, writers drain to zero
  - peak observed Used() matches the cap (262144 bytes in this run)

While wiring this up, the race detector surfaced a pre-existing data
race on UploadPipeline.uploaderCount: the two glog.V(4) lines around
the atomic Add sites read the field non-atomically. Capture the new
value from AddInt32 and log that instead — one-liner each, no
behavioral change.

* test(fuse): end-to-end integration test for -writeBufferSizeMB

Exercise the new write-buffer cap against a real weed mount so CI
(fuse-integration.yml) covers the FUSE→upload-pipeline→filer path, not
just the in-package unit tests. Uses a 4 MiB cap with 2 MiB chunks so
every subtest's total write demand is multiples of the budget and
Reserve/Release must drive forward progress for writes to complete.

Subtests:
- ConcurrentLargeWrites: six parallel 6 MiB files (36 MiB total, ~18
  chunk allocations) through the same mount, verifies every byte
  round-trips.
- SingleFileExceedingCap: one 20 MiB file (10 chunks) through a single
  handle, catching any self-deadlock when the pipeline's own earlier
  chunks already fill the global budget.
- DoesNotDeadlockAfterPressure: final small write with a 30s timeout,
  catching budget-slot leaks that would otherwise hang subsequent
  writes on a still-full accountant.

Ran locally on Darwin with macfuse against a real weed mini + mount:
  === RUN   TestWriteBufferCap
  --- PASS: TestWriteBufferCap (1.82s)

* test(fuse): loosen write-buffer cap e2e test + fail-fast on hang

On Linux CI the previous configuration (-writeBufferSizeMB=4,
-concurrentWriters=4 against a 20 MiB single-handle write)
deterministically hung the "Run FUSE Integration Tests" step to the
45-minute workflow timeout, while on macOS / macfuse the same test
completes in ~2 seconds (see run 24386197483). The Linux hang shows
up after TestWriteBufferCap/ConcurrentLargeWrites completes cleanly,
then TestWriteBufferCap/SingleFileExceedingCap starts and never
emits its PASS line.

Change:
- Loosen the cap to 16 MiB (8 × 2 MiB chunk slots) and drop the
  custom -concurrentWriters override. The subtests still drive demand
  well above the cap (32 MiB concurrent, 12 MiB single-handle), so
  Reserve/Release is still on every chunk-allocation path; the cap
  just gives the pipeline enough headroom that interactions with the
  per-file writableChunkLimit and the go-fuse MaxWrite batching don't
  wedge a single-handle writer on a slow runner.
- Wrap every os.WriteFile in a writeWithTimeout helper that dumps every
  live goroutine on timeout. If this ever re-regresses, CI surfaces
  the actual stuck goroutines instead of a 45-minute walltime.
- Also guard the concurrent-writer goroutines with the same timeout +
  stack dump.

The in-package unit test TestWriteBufferCap_SharedAcrossPipelines
remains the deterministic, controlled verification of the blocking
Reserve/Release path — this e2e test is now a smoke test for
correctness and absence of deadlocks through a real FUSE mount, which
is all it should be.

* fix: address PR #9066 review — idempotent FreeReference, subtest watchdog, larger single-handle test

FreeReference on SealedChunk now early-returns when referenceCounter is
already <= 0. The existing == 0 body guard already made side effects
idempotent, but the counter itself would still decrement into the
negatives on a double-call — ugly and a latent landmine for any future
caller that does math on the counter. Make double-call a strict no-op.

test(fuse): per-subtest watchdog + larger single-handle test

- Add runSubtestWithWatchdog and wrap every TestWriteBufferCap subtest
  with a 3-minute deadline. Individual writes were already
  timeout-wrapped but the readback loops and surrounding bookkeeping
  were not, leaving a gap where a subtest body could still hang. On
  watchdog fire, every live goroutine is dumped so CI surfaces the
  wedge instead of a 45-minute walltime.

- Bump testLargeFileUnderCap from 12 MiB → 20 MiB (10 chunks) to
  exceed the 16 MiB cap (8 slots) again and actually exercise
  Reserve/Release backpressure on a single file handle. The earlier
  e2e hang was under much tighter params (-writeBufferSizeMB=4,
  -concurrentWriters=4, writable limit 4); with the current loosened
  config the pressure is gentle and the goroutine-dump-on-timeout
  safety net is in place if it ever regresses.

Declined: adding an observable peak-Used() assertion to the e2e test.
The mount runs as a subprocess so its in-process WriteBufferAccountant
state isn't reachable from the test without adding a metrics/RPC
surface. The deterministic peak-vs-cap verification already lives in
the in-package unit test TestWriteBufferCap_SharedAcrossPipelines.
Recorded this rationale inline in TestWriteBufferCap's doc comment.

* test(fuse): capture mount pprof goroutine dump on write-timeout

The previous run (24388549058) hung on LargeFileUnderCap and the
test-side dumpAllGoroutines only showed the test process — the test's
syscall.Write is blocked in the kernel waiting for FUSE to respond,
which tells us nothing about where the MOUNT is stuck. The mount runs
as a subprocess so its in-process stacks aren't reachable from the
test.

Enable the mount's pprof endpoint via -debug=true -debug.port=<free>,
allocate the port from the test, and on write-timeout fetch
/debug/pprof/goroutine?debug=2 from the mount process and log it. This
gives CI the only view that can actually diagnose a write-buffer
backpressure deadlock (writer goroutines blocked on Reserve, uploader
goroutines stalled on something, etc).

Kept fileSize at 20 MiB so the Linux CI run will still hit the hang
(if it's genuinely there) and produce an actionable mount-side dump;
the alternative — silently shrinking the test below the cap — would
lose the regression signal entirely.

* review: constructor-inject accountant + subtest watchdog body on main

Two PR-#9066 review fixes:

1. NewUploadPipeline now takes the WriteBufferAccountant as a
   constructor parameter; SetWriteBufferAccountant is removed. In
   practice the previous setter was only called once during
   newMemoryChunkPages, before any goroutine could touch the
   pipeline, so there was no actual race — but constructor injection
   makes the "accountant is fixed at construction time" invariant
   explicit and eliminates the possibility of a future caller
   mutating it mid-flight. All three call sites (real + two tests)
   updated; the legacy TestUploadPipeline passes a nil accountant,
   preserving backward-compatible unlimited-mode behavior.

2. runSubtestWithWatchdog now runs body on the subtest main goroutine
   and starts a watcher goroutine that only calls goroutine-safe t
   methods (t.Log, t.Logf, t.Errorf). The previous version ran body
   on a spawned goroutine, which meant any require.* or writeWithTimeout
   t.Fatalf inside body was being called from a non-test goroutine —
   explicitly disallowed by Go's testing docs. The watcher no longer
   interrupts body (it can't), so body must return on its own —
   which it does via writeWithTimeout's internal 90s timeout firing
   t.Fatalf on (now) the main goroutine. The watchdog still provides
   the critical diagnostic: on timeout it dumps both test-side and
   mount-side (via pprof) goroutine stacks and marks the test failed
   via t.Errorf.

* fix(mount): IsComplete must detect coverage across adjacent intervals

Linux FUSE caps per-op writes at FUSE_MAX_PAGES_PER_REQ (typically
1 MiB on x86_64) regardless of go-fuse's requested MaxWrite, so a
2 MiB chunk filled by a sequential writer arrives as two adjacent
1 MiB write ops. addInterval in ChunkWrittenIntervalList does not
merge adjacent intervals, so the resulting list has two elements
{[0,1M], [1M,2M]} — fully covered, but list.size()==2.

IsComplete previously returned `list.size() == 1 &&
list.head.next.isComplete(chunkSize)`, which required a single
interval covering [0, chunkSize). Under that rule, chunks filled by
adjacent writes never reach IsComplete==true, so maybeMoveToSealed
never fires, and the chunks sit in writableChunks until
FlushAll/close. SaveContent handles the adjacency correctly via its
inline merge loop, so uploads work once they're triggered — but
IsComplete is the gate that triggers them.

This was a latent bug: without the write-buffer cap, the overflow
path kicks in at writableChunkLimit (default 128) and force-seals
chunks, hiding the leak. #9066's -writeBufferSizeMB adds a tighter
global cap, and with 8 slots / 20 MiB test, the budget trips long
before overflow. The writer blocks in Reserve, waiting for a slot
that never frees because no uploader ever ran — observed in the CI
run 24390596623 mount pprof dump: goroutine 1 stuck in
WriteBufferAccountant.Reserve → cond.Wait, zero uploader goroutines
anywhere in the 89-goroutine dump.

Walk the (sorted) interval list tracking the furthest covered
offset; return true if coverage reaches chunkSize with no gaps. This
correctly handles adjacent intervals, overlapping intervals, and
out-of-order inserts. Added TestIsComplete_AdjacentIntervals
covering single-write, two adjacent halves (both orderings), eight
adjacent eighths, gaps, missing edges, and overlaps.

* test(fuse): route mount glog to stderr + dump mount on any write error

Run 24392087737 (with the IsComplete fix) no longer hangs on Linux —
huge progress. Now TestWriteBufferCap/LargeFileUnderCap fails with
'close(...write_buffer_cap_large.bin): input/output error', meaning
a chunk upload failed and pages.lastErr propagated via FlushData to
close(). But the mount log in the CI artifact is empty because weed
mount's glog defaults to /tmp/weed.* files, which the CI upload step
never sees, so we can't tell WHICH upload failed or WHY.

Add -logtostderr=true -v=2 to MountOptions so glog output goes to
the mount process's stderr, which the framework's startProcess
redirects into f.logDir/mount.log, which the framework's DumpLogs
then prints to the test output on failure. The -v=2 floor enables
saveDataAsChunk upload errors (currently logged at V(0)) plus the
medium-level write_pipeline/upload traces without drowning the log
in V(4) noise.

Also dump MOUNT goroutines on any writeWithTimeout error (not just
timeout). The IsComplete fix means we now get explicit errors
instead of silent hangs, and the goroutine dump at the error moment
shows in-flight upload state (pending sealed chunks, retry loops,
etc) that a post-failure log alone can't capture.
2026-04-14 07:47:35 -07:00
Chris Lu
228ed25a01 perf(s3): route GET through ChunkReadAt + per-request ReaderCache (#9068)
perf(s3): route GET through ChunkReadAt + shared ReaderCache

The S3 GET path previously used filer.PrepareStreamContentWithPrefetch,
which hands chunk bytes from the volume-server fetch goroutine to the
consumer through an io.Pipe. io.Pipe is a synchronous rendezvous, so
the prefetch=4 window only overlapped HTTP connection setup — the
actual data bytes still flowed one pipe at a time.

Switch to the same path WebDAV uses (server/webdav_server.go): build
a filer.ChunkReadAt backed by a server-wide filer.ReaderCache.
ReaderCache prefetches whole chunks into []byte buffers, so the
prefetch window translates into real in-flight bytes and the consumer
copies them out as memcpys.

The ReaderCache is server-wide (not per-request) for two reasons:

1. ChunkReadAt.Close() destroys the ReaderCache's downloader map.
   With a per-request cache, the defer on the handler would wait for
   background chunk downloads that run on context.Background() — so
   a client disconnect would block handler cleanup on downloads that
   the client no longer wants, tying up goroutines and memory.

2. Concurrent requests for the same object can share in-flight
   downloads through the shared downloader map.

No persistent ChunkCache is added in this commit — the ReaderCache is
constructed with a nil *chunk_cache.TieredChunkCache (all its methods
are nil-receiver safe). A follow-up PR wires in an in-memory chunk
cache for cross-request warm hits.

JWT for volume-server requests is generated internally by
util_http.RetriedFetchChunkData from jwtSigningReadKey, so the new
path remains compatible with JWT-protected clusters — this is the
same mechanism the WebDAV and mount read paths have been using.

Measured on weed mini + 1 GiB random object over loopback, cold
cache, single-stream curl on a presigned URL:

    before (io.Pipe):        2100-2200 MB/s
    after  (ChunkReadAt):    2900-3800 MB/s
2026-04-14 07:46:05 -07:00
Chris Lu
4bcbe9ded3 ci(helm): publish chart on tag push
Trigger the helm release workflow automatically on tag pushes so each
software release also publishes the chart to gh-pages and the OCI
registry at ghcr.io/seaweedfs. workflow_dispatch is kept as a manual
fallback.

Refs #6296
2026-04-14 02:20:06 -07:00
Chris Lu
9859f5fafc build(docker): upgrade all Alpine packages in final image (#9070)
build(docker): apply full apk upgrade in final image to pick up security patches

Trivy flagged CVE-2026-28390 (libcrypto3/libssl3) on the published image
because the final stage only upgraded zlib. Broaden to `apk upgrade
--no-cache` so all Alpine security fixes land at build time.
2026-04-14 02:08:15 -07:00
dependabot[bot]
ad2aa3135c build(deps): bump rand from 0.9.2 to 0.9.4 in /seaweedfs-rdma-sidecar/rdma-engine (#9065)
build(deps): bump rand in /seaweedfs-rdma-sidecar/rdma-engine

Bumps [rand](https://github.com/rust-random/rand) from 0.9.2 to 0.9.4.
- [Release notes](https://github.com/rust-random/rand/releases)
- [Changelog](https://github.com/rust-random/rand/blob/0.9.4/CHANGELOG.md)
- [Commits](https://github.com/rust-random/rand/compare/rand_core-0.9.2...0.9.4)

---
updated-dependencies:
- dependency-name: rand
  dependency-version: 0.9.4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 22:51:00 -07:00
Chris Lu
34b4ecc631 fix(mount): serialize hard-link mutations on HardLinkId (#9064)
* fix(mount): serialize hard-link mutations on HardLinkId

syncHardLinkSiblings stamps every sibling of a hard-link to
authoritativeEntry.HardLinkCounter, and the caller computes that value
as entry.HardLinkCounter - 1 (Unlink) or entry.HardLinkCounter + 1
(Link) from a cached entry read before the filer mutation. With
concurrent Unlinks on different links of the same file, both callers
observe the same pre-decrement counter, the filer's atomic blob
decrement lands correctly, but both then stamp their siblings to
counter-1 — leaving the mount metacache one higher than the authoritative
blob.

Serialize Link and Unlink on string(HardLinkId) via a new
hardLinkLockTable on WFS, and re-load the entry under the lock so the
second caller sees the updated sibling counter its predecessor just
wrote before computing its own delta. First-link races (empty
HardLinkId on the source) are a separate pre-existing issue and are
not addressed here.

Full pjdfstest suite still passes (235 files, 8803 tests).

* fix(mount): abort on stale pre-lock entry after HardLinkId lock

Review follow-up: if maybeLoadEntry fails after acquiring the
hardLinkLockTable lock, the prior revision silently fell back to the
pre-lock snapshot, reintroducing the stale-base update the lock is
meant to prevent.

- Unlink: treat fuse.ENOENT as success (the file was already removed by
  the thread that held the lock before us) and propagate any other
  error.
- Link: abort with the returned status so we never derive the next
  HardLinkCounter from a stale source entry.

* fix(mount): re-resolve Link source alias under HardLinkId lock

Review follow-up: Link resolved oldEntryPath from in.Oldnodeid before
waiting on the HardLinkId lock. A concurrent Unlink that held the same
lock could remove the specific alias we picked pre-lock while leaving
other sibling hard links for the same inode intact. The post-lock
maybeLoadEntry then returned ENOENT even though the source inode was
still reachable.

Call GetPath(in.Oldnodeid) again under the lock to pick whichever
alias is still active, refresh oldParentPath, and only return ENOENT
if no sibling survived.
2026-04-13 22:39:50 -07:00
Chris Lu
300e906330 admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes

The admin bucket size fix (#9058) left object counts at zero for
EC-encoded data because VolumeEcShardInformationMessage carried no file
count. Billing/monitoring dashboards therefore still under-report
objects once a bucket is EC-encoded.

Thread file_count and delete_count end-to-end:

- Add file_count/delete_count to VolumeEcShardInformationMessage (proto
  fields 8 and 9) and regenerate master_pb.
- Compute them lazily on volume servers by walking the .ecx index once
  per EcVolume, cache on the struct, and keep the cache in sync inside
  DeleteNeedleFromEcx (distinguishing live vs already-tombstoned
  entries so idempotent deletes do not drift the counts).
- Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage
  and carry them through the master-side EcVolumeInfo / topology sync.
- Aggregate in admin collectCollectionStats, deduping per volume id:
  every node holding shards of an EC volume reports the same counts, so
  summing across nodes would otherwise multiply the object count by the
  number of shard holders.

Regression tests cover the initial .ecx walk, live/tombstoned delete
bookkeeping (including idempotent and missing-key cases), and the admin
dedup path for an EC volume reported by multiple nodes.

* ec: include .ecj journal in EcVolume delete count

The initial delete count only reflected .ecx tombstones, missing any
needle that was journaled in .ecj but not yet folded into .ecx — e.g.
on partial recovery. Expand initCountsLocked to take the union of
.ecx tombstones and .ecj journal entries, deduped by needle id, so:

  - an id that is both tombstoned in .ecx and listed in .ecj counts once
  - a duplicate .ecj entry counts once
  - an .ecj id with a live .ecx entry is counted as deleted (not live)
  - an .ecj id with no matching .ecx entry is still counted

Covered by TestEcVolumeFileAndDeleteCountEcjUnion.

* ec: report delete count authoritatively and tombstone once per delete

Address two issues with the previous EcVolume file/delete count work:

1. The delete count was computed lazily on first heartbeat and mixed
   in a .ecj-union fallback to "recover" partial state. That diverged
   from how regular volumes report counts (always live from the needle
   map) and had drift cases when .ecj got reconciled. Replace with an
   eager walk of .ecx at NewEcVolume time, maintained incrementally on
   every DeleteNeedleFromEcx call. Semantics now match needle_map_metric:
   FileCount is the total number of needles ever recorded in .ecx
   (live + tombstoned), DeleteCount is the tombstones — so live =
   FileCount - DeleteCount. Drop the .ecj-union logic entirely.

2. A single EC needle delete fanned out to every node holding a replica
   of the primary data shard and called DeleteNeedleFromEcx on each,
   which inflated the per-volume delete total by the replica factor.
   Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in
   order and stop at the first success (one tombstone per delete), and
   only fall back to other shards when the primary shard has no home
   (ErrEcShardMissing sentinel), not on transient RPC errors.

Admin aggregation now folds EC counts correctly: FileCount is deduped
per volume id (every shard holder has an identical .ecx) and DeleteCount
is summed across nodes (each delete tombstones exactly one node). Live
object count = deduped FileCount - summed DeleteCount.

Tests updated to match the new semantics:
  - EC volume counts seed FileCount as total .ecx entries (live +
    tombstoned), DeleteCount as tombstones.
  - DeleteNeedleFromEcx keeps FileCount constant and increments
    DeleteCount only on live->tombstone transitions.
  - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2)
    to prove they're summed, while FileCount=100 is applied once.

* ec: test fixture uses real vid; admin warns on skewed ec counts

- writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the
  actual vid passed in, instead of hardcoding "_1". The existing tests
  all use vid=1 so behaviour is unchanged, but the helper no longer
  silently diverges from its documented parameter.
- collectCollectionStats logs a glog warning when an EC volume's summed
  delete count exceeds its deduped file count, surfacing the anomaly
  (stale heartbeat, counter drift, etc.) instead of silently dropping
  the volume from the object count.

* ec: derive file/delete counts from .ecx/.ecj file sizes

seedCountsFromEcx walked the full .ecx index at volume load, which is
wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj
has fixed-size deletion records (NeedleIdSize), so both counts are pure
file-size arithmetic.

  fileCount   = ecxFileSize / NeedleMapEntrySize
  deleteCount = ecjFileSize / NeedleIdSize

Rip out the cached counters, countsLock, seedCountsFromEcx, and the
recordDelete helper. Track ecjFileSize directly on the EcVolume struct,
seed it from Stat() at load, and bump it on every successful .ecj append
inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write
entirely when the needle is already tombstoned so the derived delete
count stays idempotent on repeat deletes. Heartbeats now compute counts
in O(1).

Tests updated: the initial fixture pre-populates .ecj with two ids to
verify the file-size derivation end-to-end, and the delete test keeps
its idempotent-re-delete / missing-needle invariants (unchanged
externally, now enforced by the early return rather than a cache guard).

* ec: sync Rust volume server with Go file/delete count semantics

Mirror the Go-side EC file/delete count work in the Rust volume server
so mixed Go/Rust clusters report consistent bucket object counts in
the admin dashboard.

- Add file_count (8) and delete_count (9) to the Rust copy of
  VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto).
- EcVolume gains ecj_file_size, seeded from the journal's metadata on
  open and bumped inside journal_delete on every successful append.
- file_and_delete_count() returns counts derived in O(1) from
  ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and
  ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount.
- to_volume_ec_shard_information_messages populates the new proto
  fields instead of defaulting them to zero.
- mark_needle_deleted_in_ecx now returns a DeleteOutcome enum
  (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip
  both the .ecj append and the size bump when the needle is missing
  or already tombstoned, keeping the derived delete_count idempotent
  on repeat or no-op deletes.
- Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's
  RebuildEcxFile is only called from specific decode/rebuild gRPC
  handlers, not on volume open, and replaying on load was hiding the
  deletion journal from the new file-size-derived delete counter.
  rebuild_ecx_from_journal is kept as dead_code for future decode
  paths that may want the same replay semantics.

Also clean up the Go FileAndDeleteCount to drop unnecessary runtime
guards against zero constants — NeedleMapEntrySize and NeedleIdSize
are compile-time non-zero.

test_ec_volume_journal updated to pre-populate the .ecx with the
needles it deletes, and extended to verify that repeat and
missing-id deletes do not drift the derived counts.

* ec: document enterprise-reserved proto field range on ec shard info

Both OSS master.proto copies now note that fields 10-19 are reserved
for future upstream additions while 20+ are owned by the enterprise
fork. Enterprise already pins data_shards/parity_shards at 20/21, so
keeping OSS additions inside 8-19 avoids wire-level collisions for
mixed deployments.

* ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir

ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but
new() opens the actual files from ecx_actual_dir (which may fall back
to the data dir when the idx dir does not contain the index). After a
fallback, read_deleted_needles() and rebuild_ecx_from_journal() would
read/rebuild the wrong (nonexistent) path while heartbeats reported
counts from the file actually in use — silently dropping deletes.

Point idx_base_name() at ecx_actual_dir, which is initialized to
dir_idx and only diverges after a successful fallback, so every call
site agrees with the file new() has open. The pre-fallback call in
new() (line 142) still returns the dir_idx path because
ecx_actual_dir == dir_idx at that point.

Update the destroy() sweep to build the dir_idx cleanup paths
explicitly instead of leaning on the helpers, so post-fallback stale
files in the idx dir are still removed.

* ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure

Two EC delete-count correctness fixes applied symmetrically to Go and
Rust volume servers.

1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after
   recreating the empty journal, matching the on-disk truth.
   Previously the cached size still reflected the pre-rebuild journal
   and file_and_delete_count() would keep reporting stale delete
   counts. The Go side has no equivalent bug because RebuildEcxFile
   runs in an offline helper that does not touch an EcVolume struct.

2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx
   entry before writing the .ecj record. If the .ecj append then
   failed, the needle was permanently marked deleted but the
   heartbeat-reported delete_count never advanced (it is derived from
   .ecj file size), and a retry would see AlreadyDeleted and early-
   return, leaving the drift permanent.

   Both languages now capture the entry's file offset and original
   size bytes during the mark step, attempt the .ecj append, and on
   failure roll the .ecx tombstone back by writing the original size
   bytes at the known offset. A rollback that itself errors is
   logged (glog / tracing) but cannot re-sync the files — this is
   the same failure mode a double disk error would produce, and is
   unavoidable without a full on-disk transaction log.

Go: wrap MarkNeedleDeleted in a closure that captures the file
offset into an outer variable, then pass the offset + oldSize to the
new rollbackEcxTombstone helper on .ecj seek/write errors.

Rust: DeleteOutcome::Tombstoned now carries the size_offset and a
[u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete
destructures on Tombstoned and calls restore_ecx_size on .ecj append
failure.

* test(ec): widen admin /health wait to 180s for cold CI

TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and
admin in sequence, then waited only 60s for admin's HTTP server to come
up. On cold GitHub runners the tail of the earlier subprocess startups
eats most of that budget and the wait occasionally times out (last hit
on run 24374773031). The local fast path is still ~20s total, so the
bump only extends the timeout ceiling, not the happy path.

* test(ec): fork volume servers in parallel in TestEcEndToEnd

startWeed is non-blocking (just cmd.Start()), so the per-process fork +
mkdir + log-file-open overhead for 14 volume servers was serialized for
no reason. On cold CI disks that overhead stacks up and eats into the
subsequent admin /health wait, which is how run 24374773031 flaked.

Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds
with a mutex so concurrent appends are safe. startWeed still calls
t.Fatalf on failure, which is fine from a goroutine for a fatal test
abort; the fail-fast isn't something we rely on for precise ordering.

* ec: fsync ecx before ecj, truncate on failure, harden rebuild

Four correctness fixes covering both volume servers.

1. Durability ordering (Go + Rust). After marking the .ecx tombstone
   we now fsync .ecx before touching .ecj, so a crash between the two
   files cannot leave the journal with an entry for a needle whose
   tombstone is still sitting in page cache. Once the fsync returns,
   the tombstone is the source of truth: reads see "deleted",
   delete_count may under-count by one (benign, idempotent retries)
   but never over-reports. If the fsync itself fails we restore the
   original size bytes and surface the error. The .ecj append is then
   followed by its own Sync so the reported delete_count matches the
   on-disk journal once the write returns.

2. .ecj truncation on append failure. write_all may have extended the
   journal on disk before sync_all / Sync errors out, leaving the
   cached ecj_file_size out of sync with the physical length and
   drifting delete_count permanently after restart. Both languages
   now capture the pre-append size, truncate the file back via
   set_len / Truncate on any write or sync failure, and only then
   restore the .ecx tombstone. Truncation errors are logged — same-fd
   length resets cannot realistically fail — but cannot themselves
   re-sync the files.

3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired
   up on any future decode path). Previously a failed
   mark_needle_deleted_in_ecx call was swallowed with `let _ = ...`
   and the journal was still removed, silently losing tombstones.
   We now bubble up any non-NotFound error, fsync .ecx after the
   whole replay succeeds, and only then drop and recreate .ecj.
   NotFound is still ignored (expected race between delete and encode).

4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to
   return Ok(NotFound) when self.ecx_file was None, hiding a closed or
   corrupt volume behind what looks like an idempotent no-op. It now
   returns an io::Error carrying the volume id so callers (e.g.
   journal_delete) fail loudly instead.

Existing Go and Rust EC test suites stay green.

* ec: make .ecx immutable at runtime; track deletes in memory + .ecj

Refactors both volume servers so the sealed sorted .ecx index is never
mutated during normal operation. Runtime deletes are committed to the
.ecj deletion journal and tracked in an in-memory deleted-needle set;
read-path lookups consult that set to mask out deleted ids on top of
the immutable .ecx record. Mirrors the intended design on both Go and
Rust sides.

EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from
.ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx /
journal_delete:

  1. Looks the needle up read-only in .ecx.
  2. Missing needle -> no-op.
  3. Pre-existing .ecx tombstone (from a prior decode/rebuild) ->
     mirror into the in-memory set, no .ecj append.
  4. Otherwise append the id to .ecj, fsync, and only then publish
     the id into the set. A partial write is truncated back to the
     pre-append length so the on-disk journal and the in-memory set
     cannot drift.

FindNeedleFromEcx / find_needle_from_ecx now return
TombstoneFileSize when the id is in the in-memory set, even though
the bytes on disk still show the original size.

FileAndDeleteCount:
  fileCount   = .ecx size / NeedleMapEntrySize (unchanged)
  deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize)

The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers
still fold .ecj into .ecx — that is the one place tombstones land in
the physical index, and it runs offline on closed files. Rust's
rebuild helper now also clears the in-memory set when it succeeds.

Dead code removed on the Rust side: `DeleteOutcome`,
`mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the
runtime `rollbackEcxTombstone` path. Neither helper was needed once
.ecx stopped being a runtime mutation target.

TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten
as TestEcVolumeDeleteDurableToJournal, which exercises the full
durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks
via the in-memory set -> raw .ecx bytes are *unchanged* -> Close +
RebuildEcxFile folds the journal into .ecx -> raw bytes now show
the tombstone, as CopyFile in the decode path expects.
2026-04-13 21:10:36 -07:00
Chris Lu
64af80c78d fix(mount): stop double-applying umask in Mkdir (#9063)
Mkdir was masking in.Mode with wfs.option.Umask on top of the kernel's
VFS umask pass, so a caller with umask=0 who requested mkdir(0777) got
0755 (0777 & ~022). Create and Symlink don't apply this second pass —
Mkdir was the odd one out. The resulting dirs had fewer write bits than
the caller asked for, which broke cross-user rename permission checks
(kernel may_delete rejects with EACCES when the parent lacks o+w even
though the caller explicitly requested it) and blocked pjdfstest
tests/rename/21.t and its cascading checks.

Drop the extra umask so Mkdir trusts in.Mode exactly like Create. The
CLI -umask flag still covers the internal cache dirs that the mount
creates for itself via os.MkdirAll; only the user-facing Mkdir path
changes.

Unblocks tests/rename/21.t — full pjdfstest suite is now 236 files /
8819 tests, all PASS, and known_failures.txt is empty.
2026-04-13 19:34:30 -07:00
Chris Lu
c8433a19f0 fix(mount): propagate hard-link nlink changes to sibling cache entries (#9062)
* fix(mount): propagate hard-link nlink changes to sibling cache entries

weed mount serves stat from its local metacache, and the kernel also
caches inode attrs from FUSE replies. When a hard link was unlinked or
a new link added, the filer updated the shared HardLink blob correctly,
but the sibling link entries in the mount's metacache still carried the
stale HardLinkCounter and the kernel attr cache on the shared inode was
not invalidated. Subsequent lstat on any sibling link returned the old
nlink — pjdfstest link/00.t caught this after `unlink n0` and on
`link n1 n2` stating n0.

Walk every path bound to the hard-linked inode via a new
InodeToPath.GetAllPaths, rewrite each cached sibling's HardLinkCounter
and ctime to the authoritative new value, and call
fuseServer.InodeNotify to invalidate the kernel attr cache for the
shared inode. Applied from both Link (bump) and Unlink (decrement).

Unblocks tests/link/00.t and tests/unlink/00.t in pjdfstest; full suite
(235 files, 8803 tests) passes end-to-end with no regressions.

* fix(mount): harden hard-link sibling sync against nil Attributes and id mismatch

Review follow-ups:
- Unlink: guard entry.Attributes for nil before reading Inode, with a
  fallback to inodeToPath.GetInode resolved before RemovePath. Fold the
  duplicated RemovePath into a single call.
- syncHardLinkSiblings: skip siblings whose HardLinkId does not match
  the authoritative entry. The shared-inode invariant normally
  guarantees a match, but a transient mismatch (e.g. a rename replaced
  one of the paths) would otherwise stamp an unrelated entry with the
  wrong counter.

Full pjdfstest suite still passes (235 files, 8803 tests).
2026-04-13 18:40:57 -07:00
Chris Lu
f00cbe4a6d test(vacuum): fix flaky TestVacuumIntegration across multiple volumes (#9061)
* test(vacuum): fix flaky TestVacuumIntegration across multiple volumes

The test assumed all uploaded files landed in a single volume and
tracked only the last file's volume id. With -volumeSizeLimitMB 10
and 16x500KB files, the master can spread uploads across volumes,
so the tracked id could point to a volume with no deletes and thus
0% garbage — causing verify_garbage_before_vacuum to fail even
though vacuum ran correctly on the other volume.

Track the set of volumes where deletes actually occurred and
verify garbage/cleanup against all of them. Also add a short
retry loop on the pre-vacuum check to absorb heartbeat jitter.

* test(vacuum): require all dirty volumes ready; retry cleanup check

Address review feedback: the pre-vacuum check now waits until every
volume in dirtyVolumes reports garbage > threshold (not just the
first), and the post-vacuum cleanup check retries per-volume with a
deadline instead of relying on a fixed sleep, since vacuum + heartbeat
reporting is asynchronous.

* test(vacuum): deterministic dirty volumes order, aggregate cleanup failures

- Sort dirtyVolumes after building from the set so logs and iteration
  are stable across runs.
- In verify_cleanup_after_vacuum, track per-volume failure reasons in a
  map and report all still-failing volumes on timeout instead of only
  the last one that happened to be written to lastErr.
2026-04-13 17:56:32 -07:00
Lars Lehtonen
e0c361ec77 fix(weed/worker/tasks): log dropped errors (#9057) 2026-04-13 16:03:02 -07:00
Chris Lu
8f2a3d92bb docker: upgrade libcrypto3/libssl3 to clear Trivy HIGH (CVE-2026-28390) (#9059)
* docker: upgrade libcrypto3/libssl3 to clear Trivy HIGH

Trivy gate on ghcr.io/seaweedfs/seaweedfs:latest-amd64 flagged
CVE-2026-28390 in libcrypto3 3.5.5-r0 (fixed in 3.5.6-r0) on the
alpine 3.23.3 base. Add libcrypto3/libssl3 to the existing apk upgrade
so rebuilt images pick up the patched openssl without waiting for a
new alpine base tag.

* docker: apk add libcrypto3/libssl3 so they install at patched version

Per review, apk upgrade <pkg> is a no-op when the package isn't already
installed. libcrypto3/libssl3 come in transitively via curl, so list
them in apk add to guarantee installation at the latest (patched)
version from the alpine repo.
2026-04-13 15:34:11 -07:00
Chris Lu
ef77df6141 admin: include EC volumes in bucket size reporting (#9058)
* admin: include EC volumes in bucket size reporting

The Object Store buckets page computed per-collection size by iterating
only regular volumes, so once a bucket's data was EC-encoded it silently
disappeared from the reported size — breaking usage-based billing.

Walk EcShardInfos alongside VolumeInfos in collectCollectionStats: add
raw shard bytes to PhysicalSize, and the parity-stripped value
(shardBytes * DataShardsCount / TotalShardsCount) to LogicalSize,
matching the normalization used by `weed shell` cluster.status.

* admin: derive EC logical size from shard bitmap, not constants

Use ShardsInfoFromVolumeEcShardInformationMessage + MinusParityShards
to sum actual data-shard bytes instead of scaling raw bytes by the
DataShardsCount/TotalShardsCount ratio. Keeps the data/parity split
encapsulated in the erasure_coding package and is exact when shard
sizes differ (e.g. last shard).

* admin: regression test for EC shard size aggregation

Cover the uneven-tail-shard case (data shard 9 < 1000 bytes) and the
empty-collection-name path to pin PhysicalSize/LogicalSize behavior
for collectCollectionStats against future changes.
2026-04-13 15:15:21 -07:00
Chris Lu
50f25bb5cd 4.20 4.20 2026-04-13 13:25:13 -07:00
Chris Lu
512912cbb8 Update plugin_templ.go 2026-04-13 13:10:03 -07:00
dependabot[bot]
8d6c5cbb58 build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest/tools (#9056)
build(deps): bump org.apache.kafka:kafka-clients

Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2.

---
updated-dependencies:
- dependency-name: org.apache.kafka:kafka-clients
  dependency-version: 3.9.2
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 13:05:09 -07:00
dependabot[bot]
f3151900e4 build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.98.0 to 1.99.0 (#9053)
build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3

Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.98.0 to 1.99.0.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.98.0...service/s3/v1.99.0)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/service/s3
  dependency-version: 1.99.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 13:03:35 -07:00
Chris Lu
7aaa431bb4 s3api: prune bucket-scoped IAM actions on DeleteBucket (#9054)
* s3api: prune bucket-scoped IAM actions on DeleteBucket

DeleteBucket removed the bucket directory and collection but left
behind any identity actions configured via s3.configure that were
scoped to that bucket (e.g. Read:bucket, Write:bucket/prefix),
leaving stale auth metadata that users expected to be cleaned up
along with the bucket.

After a successful delete, strip actions whose resource is exactly
the bucket or a prefix under it, save via the credential manager,
and let the existing filer metadata subscription fan the reload out
to every S3 server. Wildcarded resources and global actions are
preserved since they may cover other buckets; static identities
are left untouched.

Fixes #5310

* s3api: address review feedback on bucket IAM prune

- Apply per-identity updates via credentialManager.UpdateUser instead
  of a full LoadConfiguration/SaveConfiguration round-trip, so the
  prune no longer clobbers concurrent IAM edits made by s3.configure
  or the IAM API during a DeleteBucket.
- Use a 30s bounded background context for the post-delete cleanup so
  it survives client disconnect — the bucket is already gone by then
  and this is best-effort bookkeeping.
- Skip static identities via IsStaticIdentity, since the credential
  store never persists them and UpdateUser would return NotFound.
2026-04-13 12:13:38 -07:00
Arthur Woimbée
8049fcc516 correctly namespace all define calls (#9044)
* correctly namespace all `define` calls

* fix unrelated issue: wrong dict call to gen sftp passwd
2026-04-13 11:49:12 -07:00
dependabot[bot]
06cbd2acdf build(deps): bump golang.org/x/net from 0.52.0 to 0.53.0 (#9052)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.52.0 to 0.53.0.
- [Commits](https://github.com/golang/net/compare/v0.52.0...v0.53.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-version: 0.53.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 11:14:43 -07:00
Mark McCormick
2ee6907c19 Update Helm Chart docs with instructions for deploying RocksDB variant (#9006)
* Update documentation for helm chart, with instructions on how to deploy the RocksDB image tag variant.

Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev>

Nit: Update example to make it clearer that the seaweedfs version needs to be replaced.

Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev>

* docs(helm): clarify RocksDB variant instructions

- Note that filer persistence (enablePVC) is required so RocksDB
  metadata survives restarts.
- Explain why master/volume also use the rocksdb-tagged image.
- Tighten wording around WEED_LEVELDB2_ENABLED override.

---------

Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-04-13 10:56:14 -07:00
dependabot[bot]
cc5b246973 build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.13 to 1.32.14 (#9051)
build(deps): bump github.com/aws/aws-sdk-go-v2/config

Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.13 to 1.32.14.
- [Release notes](https://github.com/aws/aws-sdk-go-v2/releases)
- [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.13...config/v1.32.14)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go-v2/config
  dependency-version: 1.32.14
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-04-13 10:47:55 -07:00