mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-21 17:21:34 +00:00
2fd60cfbc3cbfc0fdd1af59166dfdf16c2a9136d
13487 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2fd60cfbc3 |
fix(balance): guard against destination overshoot and oscillation (#9090)
* fix(balance): guard against destination overshoot and oscillation Plugin-worker volume_balance detection re-selects maxServer/minServer each iteration based on utilization ratio. With heterogeneous MaxVolumeCount values, a single greedy move can flip which server is most-utilized, causing A->B, B->A oscillation within one detection cycle and pushing destinations past the cluster ideal. Mirror the shell balancer's per-move guard (weed/shell/command_volume_balance.go:440): before scheduling a move, verify that the destination's post-move utilization would not strictly exceed the source's post-move utilization. If it would, no single move can improve balance, so stop. Add regression tests that cover: - TestDetection_HeterogeneousMax_NoOvershootNoOscillation: 2 servers with different caps just above threshold; detection must not oscillate or make the imbalance worse. - TestDetection_RespectsClusterIdealUtilization: 3-server heterogeneous layout; destinations must not overshoot cluster ideal. * fix(balance): use effective capacity when resolving destination disk resolveBalanceDestination read VolumeCount directly from the topology snapshot, which is not updated when AddPendingTask registers a move within the current detection cycle. This meant multiple moves planned in a single cycle all saw the same static count and could target the same disk past its effective capacity. Switch to ActiveTopology.GetNodeDisks + GetEffectiveAvailableCapacity so that destination planning accounts for all pending and assigned tasks affecting the disk — consistent with how the detection loop already tracks effectiveCounts at the server level. Add a unit test that seeds two pending balance tasks against a destination disk with 2 free slots and asserts resolveBalanceDestination rejects a third planned move. * fix(ec_balance): capacity-weighted guard in Phase 4 global rebalance detectGlobalImbalance picked min/max nodes by raw shard count and compared them against a simple (unweighted) rack-wide average. With heterogeneous MaxVolumeCount across nodes in the same rack, this lets the greedy algorithm move shards from a large, barely-used node to a small, nearly-full node just because the small node has fewer shards in absolute terms — strictly worsening imbalance by utilization and potentially overfilling the small node. Snapshot each node's total shard capacity (current shards plus free slots) at loop start and add a per-move convergence guard: reject any move where the destination's post-move utilization would strictly exceed the source's post-move utilization. Mirrors the fix in weed/worker/tasks/balance/detection.go. Regression test TestDetectGlobalImbalance_HeterogeneousCapacity covers a rack with node1 (cap 100, 10 shards → 10% util) and node2 (cap 5, 3 shards → 60% util). Before the fix, Phase 4 moves 2 shards from node1 to node2, filling node2 to 100% util. After the fix, the guard blocks both moves. * fix(ec_balance): utilization-based max/min in Phase 4 rebalance Phase 4's global rebalancer picked source and destination nodes by raw shard count, and compared against a simple raw-count average. With heterogeneous MaxVolumeCount across nodes in a rack, this got the direction wrong: a large-capacity node holding many shards in absolute terms but only a small fraction of its capacity would be picked as the "overloaded" source, while a small-capacity node nearly at its slot limit (but holding fewer absolute shards) would be picked as the "underloaded" destination. The previous fix added a strict-improvement guard that prevented the bad move but left balance untouched — the rack stayed in an uneven state. Switch to utilization-based selection and a utilization-based pre-check: - Pick max/min by (count / capacity), where capacity is the node's current allowed shards plus remaining free slots (snapshotted once per rack and held constant for the duration of the loop). - Replace the raw-count imbalance gate (exceedsImbalanceThreshold) with a new exceedsUtilImbalanceThreshold helper that compares fractional fullness. The raw-count gate is still used by Phase 2 and Phase 3, where the per-rack / per-volume semantics differ. - Drop the raw-count guards (maxCount <= avgShards || minCount+1 > avgShards and maxCount-minCount <= 1) now that the per-move strict-improvement check handles termination correctly for both homogeneous and heterogeneous capacity. Also fix a latent bug in the inner shard-selection loop: it was not updating shardBits between iterations, so every iteration picked the same lowest-set bit and emitted duplicate move requests for the same physical shard. Update maxNode and minNode's shardBits immediately after appending a move, mirroring what applyMovesToTopology does between phases. Update TestDetectGlobalImbalance_HeterogeneousCapacity to assert: - Moves flow from the higher-util node2 to the lower-util node1 (direction check), and - Each (volumeID, shardID) pair appears at most once in the move list (duplicate-shard guard). * fix(ec_balance): keep source freeSlots in sync after planned shard moves All three phase loops that plan EC shard moves (detectCrossRackImbalance, detectWithinRackImbalance, detectGlobalImbalance) decrement the destination node's freeSlots but leave the source node's freeSlots stale. Over the course of a detection run that processes many volumes or iterates within a rack, the source's reported freeSlots drifts below its actual value. In Phase 4 specifically, the per-move strict-improvement guard prevents the source from becoming a destination candidate, so the stale value never affects decisions. In Phases 2 and 3 it can: a node that sheds shards for one volume's rebalance is eligible as a destination for another volume in the same run, and the destination selection uses node.freeSlots <= 0 as a hard skip (findDestNodeInUnderloadedRack / findLeastLoadedNodeInRack). A tightly-provisioned node could be skipped as a destination even after it has freed slots. Increment maxNode.freeSlots / node.freeSlots symmetrically at each scheduled move so freeSlots remains an accurate running view of available slot capacity throughout a detection run. |
||
|
|
979c54f693 |
fix(wdclient,volume): compare master leader with ServerAddress.Equals (#9089)
* fix(wdclient,volume): compare master leader with ServerAddress.Equals Raft leader is advertised as host:httpPort.grpcPort, but clients dial host:httpPort. Raw string comparison against VolumeLocation.Leader / HeartbeatResponse.Leader therefore never matches, causing the masterclient and the volume server heartbeat loop to continuously "redirect" to the already-connected master, tearing down the stream and reconnecting. Use ServerAddress.Equals, which normalizes the grpc-port suffix. * fix(filer,mq): compare ServerAddress via Equals in two more sites filer bootstrap skip (MaybeBootstrapFromOnePeer) and the broker's local partition assignment check both compared a wire-supplied address string against the local self ServerAddress with raw string equality. Both are vulnerable to the same plain-vs-host:port.grpcPort mismatch as the masterclient/volume heartbeat sites: filer would bootstrap from itself, and the broker would fail to claim a partition it was actually assigned. Route both through ServerAddress.Equals. * fix(master,shell): more ServerAddress comparisons via Equals - raft_server_handlers.go HealthzHandler: s.serverAddr == leader would skip the child-lock check on the real leader when the two carry different plain/grpc-suffix forms, returning 200 OK instead of 423. - master_server.go SetRaftServer leader-change callback: the Leader() == Name() guard for ensureTopologyId could disagree with topology.IsLeader() (which already uses Equals), so leader-only initialization could be skipped after an election. - command_volume_merge.go isReplicaServer: the -target guard compared user-supplied host:port against NewServerAddressFromDataNode(...) with ==, letting an existing replica slip through when topology carries the embedded gRPC port. All routed through pb.ServerAddress.Equals. * fix(mq,cluster): more ServerAddress comparisons via Equals - broker_grpc_lookup.go GetTopicPublishers/GetTopicSubscribers: the partition ownership check gated listing on raw LeaderBroker == BrokerAddress().String(), so listings silently omitted partitions hosted locally when the assignment carried the other host:port / host:port.grpcPort form. - lock_client.go: LockHostMovedTo comparison and the seedFiler fallback guard both used raw string equality against configured filer addresses (which may be plain host:port while LockHostMovedTo comes back suffixed), causing spurious host-change churn and blocking the seed-filer fallback. * fix(mq): more ServerAddress comparisons via Equals - pub_balancer/allocate.go EnsureAssignmentsToActiveBrokers: direct activeBrokers.Get() lookup missed brokers when a persisted assignment carried a different address encoding than the registered broker key, triggering a bogus reassignment on every read/write cycle. Added a findActiveBroker helper that falls back to an Equals-based scan and canonicalizes the assignment in place so later writes are stable. - broker_grpc_lookup.go isLockOwner: used raw string equality between LockOwner() and BrokerAddress().String(), so a lock owner could fail to recognize itself and proxy local lookup/config/admin RPCs away. - pub_client/scheduler.go onEachAssignments: reused publisher jobs only on exact LeaderBroker match, so an encoding flip in lookup results tore down and recreated a stream to the same broker. |
||
|
|
d292d3640d | chore(weed/mq/kafka/schema): remove unused functions (#9088) | ||
|
|
2c460e0fc9 |
build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest (#9082)
build(deps): bump org.apache.kafka:kafka-clients Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2. --- updated-dependencies: - dependency-name: org.apache.kafka:kafka-clients dependency-version: 3.9.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
d0a09ea178 |
fix(s3): honor ChecksumAlgorithm on presigned URL uploads (#9076)
* fix(s3): honor ChecksumAlgorithm on presigned URL uploads AWS SDK presigners hoist x-amz-sdk-checksum-algorithm (and related checksum headers) into the signed URL's query string, so servers must read either location. detectRequestedChecksumAlgorithm only looked at request headers, so presigned PUTs with ChecksumAlgorithm set validated and stored no additional checksum, and HEAD/GET never returned the x-amz-checksum-* header. Read these parameters from headers first, then fall back to a case-insensitive query-string lookup. Apply the same fallback when comparing an object-level checksum value against the computed one. Fixes #9075 * test(s3): presigned URL checksum integration tests (#9075) Adds test/s3/checksum with end-to-end coverage for flexible-checksum behavior on presigned URL uploads. Tests generate a presigned PUT URL with ChecksumAlgorithm set, upload the body with a plain http.Client (bypassing AWS SDK middleware so the server must honor the query-string hoisted x-amz-sdk-checksum-algorithm), then HEAD/GET with ChecksumMode=ENABLED and assert the stored x-amz-checksum-* header. Covers SHA256, SHA1, and a negative control with no checksum requested. Wires the new directory into s3-go-tests.yml as its own CI job. * perf(s3): parse presigned query once in detectRequestedChecksumAlgorithm Previously, each header fallback called getHeaderOrQuery, which re-parsed r.URL.Query() and allocated a new map on every invocation — up to eight times per PutObject request. Parse the raw query at most once per request (only when non-empty) and pass the pre-parsed url.Values into a new lookupHeaderOrQuery helper. Also drops a redundant strings.ToLower allocation in the case-insensitive query key scan (strings.EqualFold already handles ASCII case folding). Addresses review feedback from gemini-code-assist on PR #9076. * test(s3): honor credential env vars and add presigned upload timeout - init() now reads S3_ACCESS_KEY/S3_SECRET_KEY (and AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION fallbacks) so that `make test-with-server ACCESS_KEY=... SECRET_KEY=...` no longer authenticates with hardcoded defaults while the server has been started with different credentials. - uploadViaPresignedURL uses a dedicated http.Client with a 30s timeout instead of http.DefaultClient, so a stalled server fails fast in CI instead of blocking until the suite's global timeout fires. Addresses review feedback from coderabbitai on PR #9076. * test(s3): pass S3_PORT and credentials through to checksum tests - 'make test' now exports S3_ENDPOINT, S3_ACCESS_KEY, and S3_SECRET_KEY derived from the Makefile variables so the Go test process talks to the same endpoint/credentials that start-server was launched with. - start-server cleans up the background SeaweedFS process and PID file when the readiness poll times out, preventing stale port conflicts on subsequent runs. Addresses review feedback from coderabbitai on PR #9076. * ci(s3): raise checksum tests step timeout make test-with-server builds weed_binary, waits up to 90s for readiness, then runs go test -timeout=10m. The previous 12-minute step timeout only had ~2 minutes of headroom over the Go timeout, risking the Actions runner killing the step before tests reported a real failure. Bumps the job timeout from 15 to 20 minutes and the step timeout from 12 to 16 minutes, matching other S3 integration jobs. Addresses review feedback from coderabbitai on PR #9076. * perf(s3): thread pre-parsed query through putToFiler hot path Parse the request's query string once at the top of putToFiler and reuse the resulting url.Values for both the checksum-algorithm detection and the expected-checksum verification. Previously, the verification path called getHeaderOrQuery which re-parsed r.URL.Query() again on every PutObject, defeating the previous commit's single-parse goal. - Add parseRequestQuery + detectRequestedChecksumAlgorithmQ (the pre-parsed-query variant). detectRequestedChecksumAlgorithm is now a thin wrapper used by callers that do a single lookup. - putToFiler parses once and threads the result through both call sites. - Remove getHeaderOrQuery and update the unit test to use lookupHeaderOrQuery directly. Addresses follow-up review from gemini-code-assist on PR #9076. * test(s3): check io.ReadAll error in uploadViaPresignedURL helper * test(s3): drop SHA1 presigned test case The AWS SDK v2 presigner signs a Content-MD5 header at presign time for SHA1 PutObject requests even when no body is attached (the MD5 of the empty payload gets baked into the signed headers). Uploading the real body via a plain http.Client then trips SeaweedFS's MD5 validation and returns BadDigest — an SDK/presigner quirk, not a SeaweedFS bug. The SHA256 positive case already exercises the server-side query-hoisted algorithm path that issue #9075 is about, and the unit tests in weed/s3api cover each algorithm's header mapping. Drop the SHA1 integration case rather than chase SDK-specific workarounds. * test(s3): provide real Content-MD5 to presigned checksum test AWS SDK v2's flexible-checksum middleware signs a Content-MD5 header at presign time. There is no body to hash at that point, so it seeds the header with MD5 of the empty payload. When the real body is then PUT with a plain http.Client, SeaweedFS's server-side Content-MD5 verification correctly rejects the upload with BadDigest. Pre-compute the MD5 of the test body and thread it into PutObjectInput.ContentMD5 so the signed Content-MD5 matches the body that will actually be uploaded. The test still exercises the server-side path that reads X-Amz-Sdk-Checksum-Algorithm from the query string (the fix that PR #9076 is validating). * test(s3): send the signed Content-MD5 header on presigned upload uploadViaPresignedURL now accepts an extraHeaders map so callers can thread through headers that the presigner signed but the raw http request would otherwise omit. The SHA256 test passes the Content-MD5 it computed, matching what the presigner baked into the signature. Fixes SignatureDoesNotMatch seen in CI after the previous commit set ContentMD5 on the presign input without sending the corresponding header on the actual upload. * test(s3): build presigned URL with the raw v4 signer The AWS SDK v2 s3.PresignClient runs the flexible-checksum middleware for any PutObject input that carries ChecksumAlgorithm. That middleware injects a Content-MD5 header at presign time, and with no body present it seeds MD5-of-empty. Any subsequent upload of a non-empty body through a plain http.Client then trips SeaweedFS's Content-MD5 verification and returns BadDigest — not the code path that issue #9075 is about. Replace the PresignClient usage in the integration test with a direct call to v4.Signer.PresignHTTP, building a canonical URL whose query string already contains x-amz-sdk-checksum-algorithm=SHA256. This is exactly the shape of URL a browser/curl client would receive from any presigner that hoists the algorithm header, and it exercises the server-side fix from PR #9076 without dragging in SDK-specific middleware quirks. * test(s3): set X-Amz-Expires on presigned URL before signing v4.Signer.PresignHTTP does not add X-Amz-Expires on its own — the caller has to seed it into the request's query string so the signer includes it in the canonical query and the server accepts the presigned URL. Without it, SeaweedFS correctly returns AuthorizationQueryParametersError. Also adds a .gitignore for the make-managed test volume data, log file, and PID file so local `make test-with-server` runs do not leave artifacts tracked by git. Verified by running the integration tests locally: make test-with-server → both presigned checksum tests PASS. |
||
|
|
08d9193fe1 |
[nfs] Add NFS (#9067)
* add filer inode foundation for nfs
* nfs command skeleton
* add filer inode index foundation for nfs
* make nfs inode index hardlink aware
* add nfs filehandle and inode lookup plumbing
* add read-only nfs frontend foundation
* add nfs namespace mutation support
* add chunk-backed nfs write path
* add nfs protocol integration tests
* add stale handle nfs coverage
* complete nfs hardlink and failover coverage
* add nfs export access controls
* add nfs metadata cache invalidation
* fix nfs chunk read lookup routing
* fix nfs review findings and rename regression
* address pr 9067 review comments
- filer_inode: fail fast if the snowflake sequencer cannot start, and let
operators override the 10-bit node id via SEAWEEDFS_FILER_SNOWFLAKE_ID
to avoid multi-filer collisions
- filer_inode: drop the redundant retry loop in nextInode
- filerstore_wrapper: treat inode-index writes/removals as best-effort so
a primary store success no longer surfaces as an operation failure
- filer_grpc_server_rename: defer overwritten-target chunk deletion until
after CommitTransaction so a rolled-back rename does not strand live
metadata pointing at freshly deleted chunks
- command/nfs: default ip.bind to loopback and require an explicit
filer.path, so the experimental server does not expose the entire
filer namespace on first run
- nfs integration_test: document why LinkArgs matches go-nfs's on-the-wire
layout rather than RFC 1813 LINK3args
* mount: pre-allocate inode in Mkdir and Symlink
Mkdir and Symlink used to send filer_pb.CreateEntryRequest with
Attributes.Inode = 0. After PR 9067, the filer's CreateEntry now assigns
its own inode in that case, so the filer-side entry ends up with a
different inode than the one the mount allocates via inodeToPath.Lookup
and returns to the kernel. Once applyLocalMetadataEvent stores the
filer's entry in the meta cache, subsequent GetAttr calls read the
cached entry and hit the setAttrByPbEntry override at line 197 of
weedfs_attr.go, returning the filer-assigned inode instead of the
mount's local one. pjdfstest tests/rename/00.t (subtests 81/87/91)
caught this — it lstat'd a freshly-created directory/symlink, renamed
it, lstat'd again, and saw a different inode the second time.
createRegularFile already pre-allocates via inodeToPath.AllocateInode
and stamps it into the create request. Do the same thing in Mkdir and
Symlink so both sides agree on the object identity from the very first
request, and so GetAttr's cache path returns the same value as Mkdir /
Symlink's initial response.
* sequence: mask snowflake node id on int→uint32 conversion
CodeQL flagged the unchecked uint32(snowflakeId) cast in
NewSnowflakeSequencer as a potential truncation bug when snowflakeId is
sourced from user input (e.g. via SEAWEEDFS_FILER_SNOWFLAKE_ID). Mask
to the 10 bits the snowflake library actually uses so any caller-
supplied int is safely clamped into range.
* add test/nfs integration suite
Boots a real SeaweedFS cluster (master + volume + filer) plus the
experimental `weed nfs` frontend as subprocesses and drives it through
the NFSv3 wire protocol via go-nfs-client, mirroring the layout of
test/sftp. The tests run without a kernel NFS mount, privileged ports,
or any platform-specific tooling.
Coverage includes read/write round-trip, mkdir/rmdir, nested
directories, rename content preservation, overwrite + explicit
truncate, 3 MiB binary file, all-byte binary and empty files, symlink
round-trip, ReadDirPlus listing, missing-path remove, FSInfo sanity,
sequential appends, and readdir-after-remove.
Framework notes:
- Picks ephemeral ports with net.Listen("127.0.0.1:0") and passes
-port.grpc explicitly so the default port+10000 convention cannot
overflow uint16 on macOS.
- Pre-creates the /nfs_export directory via the filer HTTP API before
starting the NFS server — the NFS server's ensureIndexedEntry check
requires the export root to exist with a real entry, which filer.Root
does not satisfy when the export path is "/".
- Reuses the same rpc.Client for mount and target so go-nfs-client does
not try to re-dial via portmapper (which concatenates ":111" onto the
address).
* ci: add NFS integration test workflow
Mirror test/sftp's workflow for the new test/nfs suite so PRs that touch
the NFS server, the inode filer plumbing it depends on, or the test
harness itself run the 14 NFSv3-over-RPC integration tests on Ubuntu
22.04 via `make test`.
* nfs: use append for buffer growth in Write and Truncate
The previous make+copy pattern reallocated the full buffer on every
extending write or truncate, giving O(N^2) behaviour for sequential
write loops. Switching to `append(f.content, make([]byte, delta)...)`
lets Go's amortized growth strategy absorb the repeated extensions.
Called out by gemini-code-assist on PR 9067.
* filer: honor caller cancellation in collectInodeIndexEntries
Dropping the WithoutCancel wrapper lets DeleteFolderChildren bail out of
the inode-index scan if the client disconnects mid-walk. The cleanup is
already treated as best-effort by the caller (it logs on error and
continues), so a cancelled walk just means the partial index rebuild is
skipped — the same failure mode as any other index write error.
Flagged as a DoS concern by gemini-code-assist on PR 9067.
* nfs: skip filer read on open when O_TRUNC is set
openFile used to unconditionally loadWritableContent for every writable
open and then discard the buffer if O_TRUNC was set. For large files
that is a pointless 64 MiB round-trip. Reorder the branches so we only
fetch existing content when the caller intends to keep it, and mark the
file dirty right away so the subsequent Close still issues the
truncating write. Called out by gemini-code-assist on PR 9067.
* nfs: allow Seek on O_APPEND files and document buffered write cap
Two related cleanups on filesystem.go:
- POSIX only restricts Write on an O_APPEND fd, not lseek. The existing
Seek error ("append-only file descriptors may only seek to EOF")
prevented read-and-write workloads that legitimately reposition the
read cursor. Write already snaps the offset to EOF before persisting
(see seaweedFile Write), so Seek can unconditionally accept any
offset. Update the unit test that was asserting the old behaviour.
- Add a doc comment on maxBufferedWriteSize explaining that it is a
per-file ceiling, the memory footprint it implies, and that the real
fix for larger whole-file rewrites is streaming / multi-chunk support.
Both changes flagged by gemini-code-assist on PR 9067.
* nfs: guard offset before casting to int in Write
CodeQL flagged `int(f.offset) + len(p)` inside the Write growth path as
a potential overflow on architectures where `int` is 32-bit. The
existing check only bounded the post-cast value, which is too late.
Clamp f.offset against maxBufferedWriteSize before the cast and also
reject negative/overflowed endOffset results. Both branches fall
through to billy.ErrNotSupported, the same behaviour the caller gets
today for any out-of-range buffered write.
* nfs: compute Write endOffset in int64 to satisfy CodeQL
The previous guard bounded f.offset but left len(p) unchecked, so
CodeQL still flagged `int(f.offset) + len(p)` as a possible int-width
overflow path. Bound len(p) against maxBufferedWriteSize first, do the
addition in int64, and only cast down after the total has been clamped
against the buffer ceiling. Behaviour is unchanged: any out-of-range
write still returns billy.ErrNotSupported.
* ci: drop emojis from nfs-tests workflow summary
Plain-text step summary per user preference — no decorative glyphs in
the NFS CI output or checklist.
* nfs: annotate remaining DEV_PLAN TODOs with status
Three of the unchecked items are genuine follow-up PRs rather than
missing work in this one, and one was actually already done:
- Reuse chunk cache and mutation stream helpers without FUSE deps:
checked off — the NFS server imports weed/filer.ReaderCache and
weed/util/chunk_cache directly with no weed/mount or go-fuse imports.
- Extract shared read/write helpers from mount/WebDAV/SFTP: annotated
as deferred to a separate refactor PR (touches four packages).
- Expand direct data-path writes beyond the 64 MiB buffered fallback:
annotated as deferred — requires a streaming WRITE path.
- Shared lock state + lock tests: annotated as blocked upstream on
go-nfs's missing NLM/NFSv4 lock state RPCs, matching the existing
"Current Blockers" note.
* test/nfs: share port+readiness helpers with test/testutil
Drop the per-suite mustPickFreePort and waitForService re-implementations
in favor of testutil.MustAllocatePorts (atomic batch allocation; no
close-then-hope race) and testutil.WaitForPort / SeaweedMiniStartupTimeout.
Pull testutil in via a local replace directive so this standalone
seaweedfs-nfs-tests module can import the in-repo package without a
separate release.
Subprocess startup is still master + volume + filer + nfs — no switch to
weed mini yet, since mini does not know about the nfs frontend.
* nfs: stream writes to volume servers instead of buffering the whole file
Before this change the NFS write path held the full contents of every
writable open in memory:
- OpenFile(write) called loadWritableContent which read the existing
file into seaweedFile.content up to maxBufferedWriteSize (64 MiB)
- each Write() extended content in-place
- Close() uploaded the whole buffer as a single chunk via
persistContent + AssignVolume
The 64 MiB ceiling made large NFS writes return NFS3ERR_NOTSUPP, and
even below the cap every Write paid a whole-file-in-memory cost. This
PR rewrites the write path to match how `weed filer` and the S3 gateway
persist data:
- openFile(write) no longer loads the existing content at all; it
only issues an UpdateEntry when O_TRUNC is set *and* the file is
non-empty (so a fresh create+trunc is still zero-RPC)
- Write() streams the caller's bytes straight to a volume server via
one AssignVolume + one chunk upload, then atomically appends the
resulting chunk to the filer entry through mutateEntry. Any
previously inlined entry.Content is migrated to a chunk in the same
update so the chunk list becomes the authoritative representation.
- Truncate() becomes a direct mutateEntry (drop chunks past the new
size, clip inline content, update FileSize) instead of resizing an
in-memory buffer.
- Close() is a no-op because everything was flushed inline.
The small-file fast path that the filer HTTP handler uses is preserved:
if the post-write size still fits in maxInlineWriteSize (4 MiB) and
the file has no existing chunks, we rewrite entry.Content directly and
skip the volume-server round-trip. This keeps single-shot tiny writes
(echo, small edits) cheap while completely removing the 64 MiB cap on
larger files. Read() now always reads through the chunk reader instead
of a local byte slice, so reads inside the same session see the freshly
appended data.
Drops the unused seaweedFile.content / dirty fields, the
maxBufferedWriteSize constant, and the loadWritableContent helper.
Updates TestSeaweedFileSystemSupportsNamespaceMutations expectations
to match the new "no extra O_TRUNC UpdateEntry on an empty file"
behavior (still 3 updates: Write + Chmod + Truncate).
* filer: extract shared gateway upload helper for NFS and WebDAV
Three filer-backed gateways (NFS, WebDAV, and mount) each had a local
saveDataAsChunk that wrapped operation.NewUploader().UploadWithRetry
with near-identical bodies: build AssignVolumeRequest, build
UploadOption, build genFileUrlFn with optional filerProxy rewriting,
call UploadWithRetry, validate the result, and call ToPbFileChunk.
Pull that body into filer.SaveGatewayDataAsChunk with a
GatewayChunkUploadRequest struct so both NFS and WebDAV can delegate
to one implementation.
- NFS's saveDataAsChunk is now a thin adapter that assembles the
GatewayChunkUploadRequest from server options and calls the helper.
The chunkUploader interface keeps working for test injection because
the new GatewayChunkUploader interface is structurally identical.
- WebDAV's saveDataAsChunk is similarly a thin adapter — it drops the
local operation.NewUploader call plus the AssignVolume/UploadOption
scaffolding.
- mount is intentionally left alone. mount's saveDataAsChunk has two
features that do not fit the shared helper (a pre-allocated file-id
pool used to skip AssignVolume entirely, and a chunkCache
write-through at offset 0 so future reads hit the mount's local
cache), both of which are mount-specific.
Marks the Phase 2 "extract shared read/write helpers from mount,
WebDAV, and SFTP" DEV_PLAN item as done. The filer-level chunk read
path (NonOverlappingVisibleIntervals + ViewFromVisibleIntervals +
NewChunkReaderAtFromClient) was already shared.
* nfs: remove DESIGN.md and DEV_PLAN.md
The planning documents have served their purpose — all phase 1 and
phase 2 items are landed, phase 3 streaming writes are landed, phase 2
shared helpers are extracted, and the two remaining phase 4 items
(shared lock state + lock tests) are blocked upstream on
github.com/willscott/go-nfs which exposes no NLM or NFSv4 lock state
RPCs. The running decision log no longer reflects current code and
would just drift. The NFS wiki page
(https://github.com/seaweedfs/seaweedfs/wiki/NFS-Server) now carries
the overview, configuration surface, architecture notes, and known
limitations; the source is the source of truth for the rest.
|
||
|
|
2818251dd5 |
fix(iceberg): clean stale data before creating a table (#9074) (#9077)
* fix(iceberg): clean stale data before creating a table CREATE TABLE AS from Trino fails against the SeaweedFS Iceberg REST catalog with "Cannot create a table on a non-empty location". The catalog assigns every new table the deterministic <ns>/<table> path, and Trino's pre-write check rejects the CTAS whenever leftover objects live there — typically files from a prior DROP that did not purge the data, or an earlier aborted CTAS. Make the catalog the authority for table existence: before writing any metadata, look up the table in the S3 Tables catalog. If it is already registered, return the existing definition (idempotent create). If it is not registered, any objects still sitting at the target location are stale, so purge them before proceeding. Live tables are never touched — the cleanup path is guarded by the catalog lookup. Fixes #9074 * test(trino): regression for create/drop/recreate table (#9074) Exercises the exact sequence from the reported bug: CREATE TABLE without an explicit location, INSERT, DROP, then CREATE again with the same name, followed by a CTAS on top. Previously the recreate failed with "Cannot create a table on a non-empty location" because stale data files from the dropped table lingered at the deterministic <schema>/<table> path. The test also asserts the recreated table does not see the dropped data and that a drop/recreate CTAS cycle works. * fix(iceberg): purge dropped table location on DROP TABLE Recreate-at-same-location was failing with "Cannot create a table on a non-empty location" because DROP TABLE only removed the catalog entry and left data files behind. Trino's pre-write emptiness check then rejected the subsequent CREATE. Look up the table's storage location before deleting the catalog entry, and after a successful DeleteTable, purge the filer subtree at that location. The lookup uses the catalog — the authoritative owner of the name→location mapping — so cleanup is gated on a live table having existed; failed lookups skip cleanup and leave storage alone. Also pin the regression test to an explicit, fixed location so the recreate genuinely targets the same path the drop was supposed to free. * refactor(iceberg): use errors.As in isNoSuchTableError * fix(iceberg): drop create-preflight cleanup; harden cleanup path - Remove the destructive cleanupStaleTableLocation call from the CreateTable preflight. Storage cleanup now lives exclusively on the DROP path, so CreateTable has no side effects on storage beyond the new table's own metadata write. - Validate tablePath in cleanupStaleTableLocation by rejecting empty, ".", "..", or backslash-bearing segments before joining with the bucket prefix, so a crafted location cannot escape the bucket subtree. path.Clean would silently collapse traversal, so segments are checked raw. * test(trino): defer table drops in recreate test for failure-safe cleanup |
||
|
|
2d9441726d |
fix(helm): skip s3 ServiceMonitor when only filer.s3 is enabled (#9081)
* fix(helm): skip s3 ServiceMonitor when only filer.s3 is enabled (#9080) The seaweedfs-s3 Service only exposes a "metrics" port when the standalone s3 gateway is enabled. With filer.s3.enabled=true and s3.enabled=false the Service only has swfs-s3:8333, so the generated ServiceMonitor matched zero targets and fired persistent no-targets alerts. The embedded filer S3 gateway's metrics are already scraped via the filer ServiceMonitor. * comment: drop issue ref |
||
|
|
c2f5db3a02 |
perf(filer.sync): don't serialize descendants behind dir attribute updates (#9079)
* perf(filer.sync): don't serialize descendants behind dir attribute updates The MetadataProcessor treated every in-flight directory job as a subtree barrier: any active dir job at /foo forced all file events under /foo to wait, and because the admit loop runs on the single stream.Recv() goroutine, a stalled descendant also stalled the whole gRPC stream. For large directories this turned every attribute-only dir event (mtime / xattr / chmod bumps) into a full-subtree pinch point. Classify dir jobs as barrier (create / delete / rename) vs non-barrier (filer_pb.IsUpdate on a directory — same parent and same name, i.e. an in-place attribute update). Only barrier dirs block descendants and get blocked by ancestor barrier dirs. Non-barrier dir updates still bump the ancestor descendantCount, so an incoming barrier dir on an ancestor still waits for them — preserving the "delete /a waits for in-flight /a/b update" safety. Tests cover the loosened cases and the preserved barriers: non-barrier update doesn't block a file descendant, barrier create still does, barrier delete still waits for in-flight descendants, and a barrier ancestor still waits for a non-barrier descendant update. * fix(filer.sync): serialize same-path barrier dir jobs against concurrent ops Review (Gemini) flagged that pathConflicts had latent same-path gaps that predated this PR but deserve fixing alongside the dir-conflict loosening: two barrier dir jobs at the same path could run concurrently (e.g. create /a and delete /a), and a file job at the same path as an in-flight barrier dir wasn't blocked either. Tighten pathConflicts so that: - an active barrier dir at p blocks every incoming job at p (file, barrier dir, or non-barrier attribute update) — same-path promotions, renames, and delete/create collisions must serialize; - an active file at p blocks incoming files and barrier dirs at p; - non-barrier dir updates at the same path still overlap with each other (attribute bumps are last-writer-wins, intentional). TestDirVsDirConflict and TestFileUnderActiveDirConflict flip their "same path does not conflict" assertions to match. New TestSamePathBarrierSerialization covers all five same-path cases explicitly. * fix(filer.sync): serialize incoming barrier dir against same-path non-barrier update Bug introduced by the previous same-path tightening commit and caught in review (CodeRabbit, critical): a kindNonBarrierDir at /dir1 was not indexed at its own path, so a later kindBarrierDir at /dir1 saw neither activeBarrierDirPaths["/dir1"] nor descendantCount["/dir1"] (the latter only counts strict descendants) and was admitted concurrently with the in-flight attribute update. That violated the "barrier at p serializes all work at p" rule. Track non-barrier dir jobs in a new activeNonBarrierDirPaths map and check it only from the incoming-barrier-dir branch of pathConflicts. The map is deliberately invisible to the ancestor check, so non-barrier updates still don't serialize file descendants — the loosening this PR is about stays intact. Regression test added in TestSamePathBarrierSerialization covers both the admission conflict and the index cleanup on job completion. |
||
|
|
40c1797f8e |
fix(s3): allow anonymous ListBuckets with prefix-scoped List action (#9073)
* fix(s3): allow anonymous ListBuckets with prefix-scoped List action An anonymous identity holding a prefix-scoped action such as "List:prefix-*" was denied at the auth middleware before ListBucketsHandler could apply the per-bucket visibility check. The middleware called CanDo with an empty bucket, which never matches a scoped action, so every anonymous ListBuckets request returned 403 even though matching buckets should have been visible. Defer ListBuckets authorization to the handler for the anonymous identity when it actually carries a List action, mirroring the existing behavior for authenticated users. Anonymous identities with no List action continue to be rejected at the global layer, preserving the secure-by-default posture. Fixes #9072 * refactor(s3): make hasListAction a method on Identity Addresses PR review — consistent with existing CanDo/isAdmin methods and also treats Admin identities as implicitly having List permission. |
||
|
|
eaf561e86c |
perf(s3): add optional shared in-memory chunk cache for GET (#9069)
Adds the -s3.cacheCapacityMB flag (default 0, disabled) that attaches
an in-memory chunk_cache.ChunkCacheInMemory to the server-wide
ReaderCache introduced in the previous commit. When enabled,
completed chunks are deposited into the shared cache as they are
downloaded, so concurrent and repeat GETs of the same object hit
memory instead of re-fetching chunks from volume servers.
When 0 (the default) the shared ReaderCache still runs — it just
attaches a nil chunk cache, so behaviour matches the previous commit
exactly. No behaviour change for clusters that don't opt in.
Disk-backed TieredChunkCache was evaluated and rejected: its
synchronous SetChunk writes regressed cold reads ~12x on loopback
because the chunk fetchers block on local disk I/O that is *slower*
than the TCP volume-server fetch it is supposed to accelerate.
Memory-only avoids that.
Flag registered in all four S3 flag sites (s3.go, server.go,
filer.go, mini.go) per the comment on command.S3Options. The chunk
size used to convert CacheSizeMB → entry count is encapsulated in
the s3ChunkCacheChunkSizeMB constant so it's easy to grep and
revisit if the filer default chunk size changes.
Measured on weed mini + 1 GiB random object over loopback, single
curl on a presigned URL:
cacheCapacityMB=0 (off): cold ~2900, warm ~2900 MB/s
cacheCapacityMB=4096: cold ~2790, warm ~5050 MB/s (+70%)
|
||
|
|
7a7f220224 |
feat(mount): cap write buffer with -writeBufferSizeMB (#9066)
* feat(mount): cap write buffer with -writeBufferSizeMB Without a bound on the per-mount write pipeline, sustained upload failures (e.g. volume server returning "Volume Size Exceeded" while the master hasn't yet rotated assignments) let sealed chunks pile up across open file handles until the swap directory — by default os.TempDir() — fills the disk. Reported on 4.19 filling /tmp to 1.8 TB during a large rclone sync. Add a global WriteBufferAccountant shared across every UploadPipeline in a mount. Creating a new page chunk (memory or swap) first reserves ChunkSize bytes; when the cap is reached the writer blocks until an uploader finishes and releases, turning swap overflow into natural FUSE-level backpressure instead of unbounded disk growth. The new -writeBufferSizeMB flag (also accepted via fuse.conf) defaults to 0 = unlimited, preserving current behavior. Reserve drops chunksLock while blocking so uploader goroutines — which take chunksLock on completion before calling Release — cannot deadlock, and an oversized reservation on an empty accountant succeeds to avoid single-handle starvation. * fix(mount): plug write-budget leaks in pipeline Shutdown Review on #9066 caught two accounting bugs on the Destroy() path: 1. Writable-chunk leak (high). SaveDataAt() reserves ChunkSize before inserting into writableChunks, but Shutdown() only iterated sealedChunks. Truncate / metadata-invalidation flows call Destroy() (via ResetDirtyPages) without flushing first, so any dirty but unsealed chunks would permanently shrink the global write budget. Shutdown now frees and releases writable chunks too. 2. Double release with racing uploader (medium). Shutdown called accountant.Release directly after FreeReference, while the async uploader goroutine did the same on normal completion — under a Destroy-before-flush race this could underflow the accountant and let later writes exceed the configured cap. Move accounting into SealedChunk.FreeReference itself: the refcount-zero transition is exactly-once by construction, so any number of FreeReference calls release the slot precisely once. Add regression tests for the writable-leak and the FreeReference idempotency guarantee. * test(mount): remove sleep-based race in accountant blocking test Address review nits on #9066: - Replace time.Sleep(50ms) proxy for "goroutine entered Reserve" with a started channel the goroutine closes immediately before calling Reserve. Reserve cannot make progress until Release is called, so landed is guaranteed false after the handshake — no arbitrary wait. - Short-circuit WriteBufferAccountant.Used() in unlimited mode for consistency with Reserve/Release, avoiding a mutex round-trip. * test(mount): add end-to-end write-buffer cap integration test Exercises the full write-budget plumbing with a small cap (4 chunks of 64 KiB = 256 KiB) shared across three UploadPipelines fed by six concurrent writers. A gated saveFn models the "volume server rejecting uploads" condition from the original report: no sealed chunk can drain until the test opens the gate. A background sampler records the peak value of accountant.Used() throughout the run. The test asserts: - writers fill the budget and then block on Reserve (Used() stays at the cap while stalled) - Used() never exceeds the configured cap even under concurrent pressure from multiple pipelines - after the gate opens, writers drain to zero - peak observed Used() matches the cap (262144 bytes in this run) While wiring this up, the race detector surfaced a pre-existing data race on UploadPipeline.uploaderCount: the two glog.V(4) lines around the atomic Add sites read the field non-atomically. Capture the new value from AddInt32 and log that instead — one-liner each, no behavioral change. * test(fuse): end-to-end integration test for -writeBufferSizeMB Exercise the new write-buffer cap against a real weed mount so CI (fuse-integration.yml) covers the FUSE→upload-pipeline→filer path, not just the in-package unit tests. Uses a 4 MiB cap with 2 MiB chunks so every subtest's total write demand is multiples of the budget and Reserve/Release must drive forward progress for writes to complete. Subtests: - ConcurrentLargeWrites: six parallel 6 MiB files (36 MiB total, ~18 chunk allocations) through the same mount, verifies every byte round-trips. - SingleFileExceedingCap: one 20 MiB file (10 chunks) through a single handle, catching any self-deadlock when the pipeline's own earlier chunks already fill the global budget. - DoesNotDeadlockAfterPressure: final small write with a 30s timeout, catching budget-slot leaks that would otherwise hang subsequent writes on a still-full accountant. Ran locally on Darwin with macfuse against a real weed mini + mount: === RUN TestWriteBufferCap --- PASS: TestWriteBufferCap (1.82s) * test(fuse): loosen write-buffer cap e2e test + fail-fast on hang On Linux CI the previous configuration (-writeBufferSizeMB=4, -concurrentWriters=4 against a 20 MiB single-handle write) deterministically hung the "Run FUSE Integration Tests" step to the 45-minute workflow timeout, while on macOS / macfuse the same test completes in ~2 seconds (see run 24386197483). The Linux hang shows up after TestWriteBufferCap/ConcurrentLargeWrites completes cleanly, then TestWriteBufferCap/SingleFileExceedingCap starts and never emits its PASS line. Change: - Loosen the cap to 16 MiB (8 × 2 MiB chunk slots) and drop the custom -concurrentWriters override. The subtests still drive demand well above the cap (32 MiB concurrent, 12 MiB single-handle), so Reserve/Release is still on every chunk-allocation path; the cap just gives the pipeline enough headroom that interactions with the per-file writableChunkLimit and the go-fuse MaxWrite batching don't wedge a single-handle writer on a slow runner. - Wrap every os.WriteFile in a writeWithTimeout helper that dumps every live goroutine on timeout. If this ever re-regresses, CI surfaces the actual stuck goroutines instead of a 45-minute walltime. - Also guard the concurrent-writer goroutines with the same timeout + stack dump. The in-package unit test TestWriteBufferCap_SharedAcrossPipelines remains the deterministic, controlled verification of the blocking Reserve/Release path — this e2e test is now a smoke test for correctness and absence of deadlocks through a real FUSE mount, which is all it should be. * fix: address PR #9066 review — idempotent FreeReference, subtest watchdog, larger single-handle test FreeReference on SealedChunk now early-returns when referenceCounter is already <= 0. The existing == 0 body guard already made side effects idempotent, but the counter itself would still decrement into the negatives on a double-call — ugly and a latent landmine for any future caller that does math on the counter. Make double-call a strict no-op. test(fuse): per-subtest watchdog + larger single-handle test - Add runSubtestWithWatchdog and wrap every TestWriteBufferCap subtest with a 3-minute deadline. Individual writes were already timeout-wrapped but the readback loops and surrounding bookkeeping were not, leaving a gap where a subtest body could still hang. On watchdog fire, every live goroutine is dumped so CI surfaces the wedge instead of a 45-minute walltime. - Bump testLargeFileUnderCap from 12 MiB → 20 MiB (10 chunks) to exceed the 16 MiB cap (8 slots) again and actually exercise Reserve/Release backpressure on a single file handle. The earlier e2e hang was under much tighter params (-writeBufferSizeMB=4, -concurrentWriters=4, writable limit 4); with the current loosened config the pressure is gentle and the goroutine-dump-on-timeout safety net is in place if it ever regresses. Declined: adding an observable peak-Used() assertion to the e2e test. The mount runs as a subprocess so its in-process WriteBufferAccountant state isn't reachable from the test without adding a metrics/RPC surface. The deterministic peak-vs-cap verification already lives in the in-package unit test TestWriteBufferCap_SharedAcrossPipelines. Recorded this rationale inline in TestWriteBufferCap's doc comment. * test(fuse): capture mount pprof goroutine dump on write-timeout The previous run (24388549058) hung on LargeFileUnderCap and the test-side dumpAllGoroutines only showed the test process — the test's syscall.Write is blocked in the kernel waiting for FUSE to respond, which tells us nothing about where the MOUNT is stuck. The mount runs as a subprocess so its in-process stacks aren't reachable from the test. Enable the mount's pprof endpoint via -debug=true -debug.port=<free>, allocate the port from the test, and on write-timeout fetch /debug/pprof/goroutine?debug=2 from the mount process and log it. This gives CI the only view that can actually diagnose a write-buffer backpressure deadlock (writer goroutines blocked on Reserve, uploader goroutines stalled on something, etc). Kept fileSize at 20 MiB so the Linux CI run will still hit the hang (if it's genuinely there) and produce an actionable mount-side dump; the alternative — silently shrinking the test below the cap — would lose the regression signal entirely. * review: constructor-inject accountant + subtest watchdog body on main Two PR-#9066 review fixes: 1. NewUploadPipeline now takes the WriteBufferAccountant as a constructor parameter; SetWriteBufferAccountant is removed. In practice the previous setter was only called once during newMemoryChunkPages, before any goroutine could touch the pipeline, so there was no actual race — but constructor injection makes the "accountant is fixed at construction time" invariant explicit and eliminates the possibility of a future caller mutating it mid-flight. All three call sites (real + two tests) updated; the legacy TestUploadPipeline passes a nil accountant, preserving backward-compatible unlimited-mode behavior. 2. runSubtestWithWatchdog now runs body on the subtest main goroutine and starts a watcher goroutine that only calls goroutine-safe t methods (t.Log, t.Logf, t.Errorf). The previous version ran body on a spawned goroutine, which meant any require.* or writeWithTimeout t.Fatalf inside body was being called from a non-test goroutine — explicitly disallowed by Go's testing docs. The watcher no longer interrupts body (it can't), so body must return on its own — which it does via writeWithTimeout's internal 90s timeout firing t.Fatalf on (now) the main goroutine. The watchdog still provides the critical diagnostic: on timeout it dumps both test-side and mount-side (via pprof) goroutine stacks and marks the test failed via t.Errorf. * fix(mount): IsComplete must detect coverage across adjacent intervals Linux FUSE caps per-op writes at FUSE_MAX_PAGES_PER_REQ (typically 1 MiB on x86_64) regardless of go-fuse's requested MaxWrite, so a 2 MiB chunk filled by a sequential writer arrives as two adjacent 1 MiB write ops. addInterval in ChunkWrittenIntervalList does not merge adjacent intervals, so the resulting list has two elements {[0,1M], [1M,2M]} — fully covered, but list.size()==2. IsComplete previously returned `list.size() == 1 && list.head.next.isComplete(chunkSize)`, which required a single interval covering [0, chunkSize). Under that rule, chunks filled by adjacent writes never reach IsComplete==true, so maybeMoveToSealed never fires, and the chunks sit in writableChunks until FlushAll/close. SaveContent handles the adjacency correctly via its inline merge loop, so uploads work once they're triggered — but IsComplete is the gate that triggers them. This was a latent bug: without the write-buffer cap, the overflow path kicks in at writableChunkLimit (default 128) and force-seals chunks, hiding the leak. #9066's -writeBufferSizeMB adds a tighter global cap, and with 8 slots / 20 MiB test, the budget trips long before overflow. The writer blocks in Reserve, waiting for a slot that never frees because no uploader ever ran — observed in the CI run 24390596623 mount pprof dump: goroutine 1 stuck in WriteBufferAccountant.Reserve → cond.Wait, zero uploader goroutines anywhere in the 89-goroutine dump. Walk the (sorted) interval list tracking the furthest covered offset; return true if coverage reaches chunkSize with no gaps. This correctly handles adjacent intervals, overlapping intervals, and out-of-order inserts. Added TestIsComplete_AdjacentIntervals covering single-write, two adjacent halves (both orderings), eight adjacent eighths, gaps, missing edges, and overlaps. * test(fuse): route mount glog to stderr + dump mount on any write error Run 24392087737 (with the IsComplete fix) no longer hangs on Linux — huge progress. Now TestWriteBufferCap/LargeFileUnderCap fails with 'close(...write_buffer_cap_large.bin): input/output error', meaning a chunk upload failed and pages.lastErr propagated via FlushData to close(). But the mount log in the CI artifact is empty because weed mount's glog defaults to /tmp/weed.* files, which the CI upload step never sees, so we can't tell WHICH upload failed or WHY. Add -logtostderr=true -v=2 to MountOptions so glog output goes to the mount process's stderr, which the framework's startProcess redirects into f.logDir/mount.log, which the framework's DumpLogs then prints to the test output on failure. The -v=2 floor enables saveDataAsChunk upload errors (currently logged at V(0)) plus the medium-level write_pipeline/upload traces without drowning the log in V(4) noise. Also dump MOUNT goroutines on any writeWithTimeout error (not just timeout). The IsComplete fix means we now get explicit errors instead of silent hangs, and the goroutine dump at the error moment shows in-flight upload state (pending sealed chunks, retry loops, etc) that a post-failure log alone can't capture. |
||
|
|
228ed25a01 |
perf(s3): route GET through ChunkReadAt + per-request ReaderCache (#9068)
perf(s3): route GET through ChunkReadAt + shared ReaderCache
The S3 GET path previously used filer.PrepareStreamContentWithPrefetch,
which hands chunk bytes from the volume-server fetch goroutine to the
consumer through an io.Pipe. io.Pipe is a synchronous rendezvous, so
the prefetch=4 window only overlapped HTTP connection setup — the
actual data bytes still flowed one pipe at a time.
Switch to the same path WebDAV uses (server/webdav_server.go): build
a filer.ChunkReadAt backed by a server-wide filer.ReaderCache.
ReaderCache prefetches whole chunks into []byte buffers, so the
prefetch window translates into real in-flight bytes and the consumer
copies them out as memcpys.
The ReaderCache is server-wide (not per-request) for two reasons:
1. ChunkReadAt.Close() destroys the ReaderCache's downloader map.
With a per-request cache, the defer on the handler would wait for
background chunk downloads that run on context.Background() — so
a client disconnect would block handler cleanup on downloads that
the client no longer wants, tying up goroutines and memory.
2. Concurrent requests for the same object can share in-flight
downloads through the shared downloader map.
No persistent ChunkCache is added in this commit — the ReaderCache is
constructed with a nil *chunk_cache.TieredChunkCache (all its methods
are nil-receiver safe). A follow-up PR wires in an in-memory chunk
cache for cross-request warm hits.
JWT for volume-server requests is generated internally by
util_http.RetriedFetchChunkData from jwtSigningReadKey, so the new
path remains compatible with JWT-protected clusters — this is the
same mechanism the WebDAV and mount read paths have been using.
Measured on weed mini + 1 GiB random object over loopback, cold
cache, single-stream curl on a presigned URL:
before (io.Pipe): 2100-2200 MB/s
after (ChunkReadAt): 2900-3800 MB/s
|
||
|
|
4bcbe9ded3 |
ci(helm): publish chart on tag push
Trigger the helm release workflow automatically on tag pushes so each software release also publishes the chart to gh-pages and the OCI registry at ghcr.io/seaweedfs. workflow_dispatch is kept as a manual fallback. Refs #6296 |
||
|
|
9859f5fafc |
build(docker): upgrade all Alpine packages in final image (#9070)
build(docker): apply full apk upgrade in final image to pick up security patches Trivy flagged CVE-2026-28390 (libcrypto3/libssl3) on the published image because the final stage only upgraded zlib. Broaden to `apk upgrade --no-cache` so all Alpine security fixes land at build time. |
||
|
|
ad2aa3135c |
build(deps): bump rand from 0.9.2 to 0.9.4 in /seaweedfs-rdma-sidecar/rdma-engine (#9065)
build(deps): bump rand in /seaweedfs-rdma-sidecar/rdma-engine Bumps [rand](https://github.com/rust-random/rand) from 0.9.2 to 0.9.4. - [Release notes](https://github.com/rust-random/rand/releases) - [Changelog](https://github.com/rust-random/rand/blob/0.9.4/CHANGELOG.md) - [Commits](https://github.com/rust-random/rand/compare/rand_core-0.9.2...0.9.4) --- updated-dependencies: - dependency-name: rand dependency-version: 0.9.4 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
34b4ecc631 |
fix(mount): serialize hard-link mutations on HardLinkId (#9064)
* fix(mount): serialize hard-link mutations on HardLinkId syncHardLinkSiblings stamps every sibling of a hard-link to authoritativeEntry.HardLinkCounter, and the caller computes that value as entry.HardLinkCounter - 1 (Unlink) or entry.HardLinkCounter + 1 (Link) from a cached entry read before the filer mutation. With concurrent Unlinks on different links of the same file, both callers observe the same pre-decrement counter, the filer's atomic blob decrement lands correctly, but both then stamp their siblings to counter-1 — leaving the mount metacache one higher than the authoritative blob. Serialize Link and Unlink on string(HardLinkId) via a new hardLinkLockTable on WFS, and re-load the entry under the lock so the second caller sees the updated sibling counter its predecessor just wrote before computing its own delta. First-link races (empty HardLinkId on the source) are a separate pre-existing issue and are not addressed here. Full pjdfstest suite still passes (235 files, 8803 tests). * fix(mount): abort on stale pre-lock entry after HardLinkId lock Review follow-up: if maybeLoadEntry fails after acquiring the hardLinkLockTable lock, the prior revision silently fell back to the pre-lock snapshot, reintroducing the stale-base update the lock is meant to prevent. - Unlink: treat fuse.ENOENT as success (the file was already removed by the thread that held the lock before us) and propagate any other error. - Link: abort with the returned status so we never derive the next HardLinkCounter from a stale source entry. * fix(mount): re-resolve Link source alias under HardLinkId lock Review follow-up: Link resolved oldEntryPath from in.Oldnodeid before waiting on the HardLinkId lock. A concurrent Unlink that held the same lock could remove the specific alias we picked pre-lock while leaving other sibling hard links for the same inode intact. The post-lock maybeLoadEntry then returned ENOENT even though the source inode was still reachable. Call GetPath(in.Oldnodeid) again under the lock to pick whichever alias is still active, refresh oldParentPath, and only return ENOENT if no sibling survived. |
||
|
|
300e906330 |
admin: report file and delete counts for EC volumes (#9060)
* admin: report file and delete counts for EC volumes The admin bucket size fix (#9058) left object counts at zero for EC-encoded data because VolumeEcShardInformationMessage carried no file count. Billing/monitoring dashboards therefore still under-report objects once a bucket is EC-encoded. Thread file_count and delete_count end-to-end: - Add file_count/delete_count to VolumeEcShardInformationMessage (proto fields 8 and 9) and regenerate master_pb. - Compute them lazily on volume servers by walking the .ecx index once per EcVolume, cache on the struct, and keep the cache in sync inside DeleteNeedleFromEcx (distinguishing live vs already-tombstoned entries so idempotent deletes do not drift the counts). - Populate the new proto fields from EcVolume.ToVolumeEcShardInformationMessage and carry them through the master-side EcVolumeInfo / topology sync. - Aggregate in admin collectCollectionStats, deduping per volume id: every node holding shards of an EC volume reports the same counts, so summing across nodes would otherwise multiply the object count by the number of shard holders. Regression tests cover the initial .ecx walk, live/tombstoned delete bookkeeping (including idempotent and missing-key cases), and the admin dedup path for an EC volume reported by multiple nodes. * ec: include .ecj journal in EcVolume delete count The initial delete count only reflected .ecx tombstones, missing any needle that was journaled in .ecj but not yet folded into .ecx — e.g. on partial recovery. Expand initCountsLocked to take the union of .ecx tombstones and .ecj journal entries, deduped by needle id, so: - an id that is both tombstoned in .ecx and listed in .ecj counts once - a duplicate .ecj entry counts once - an .ecj id with a live .ecx entry is counted as deleted (not live) - an .ecj id with no matching .ecx entry is still counted Covered by TestEcVolumeFileAndDeleteCountEcjUnion. * ec: report delete count authoritatively and tombstone once per delete Address two issues with the previous EcVolume file/delete count work: 1. The delete count was computed lazily on first heartbeat and mixed in a .ecj-union fallback to "recover" partial state. That diverged from how regular volumes report counts (always live from the needle map) and had drift cases when .ecj got reconciled. Replace with an eager walk of .ecx at NewEcVolume time, maintained incrementally on every DeleteNeedleFromEcx call. Semantics now match needle_map_metric: FileCount is the total number of needles ever recorded in .ecx (live + tombstoned), DeleteCount is the tombstones — so live = FileCount - DeleteCount. Drop the .ecj-union logic entirely. 2. A single EC needle delete fanned out to every node holding a replica of the primary data shard and called DeleteNeedleFromEcx on each, which inflated the per-volume delete total by the replica factor. Rewrite doDeleteNeedleFromRemoteEcShardServers to try replicas in order and stop at the first success (one tombstone per delete), and only fall back to other shards when the primary shard has no home (ErrEcShardMissing sentinel), not on transient RPC errors. Admin aggregation now folds EC counts correctly: FileCount is deduped per volume id (every shard holder has an identical .ecx) and DeleteCount is summed across nodes (each delete tombstones exactly one node). Live object count = deduped FileCount - summed DeleteCount. Tests updated to match the new semantics: - EC volume counts seed FileCount as total .ecx entries (live + tombstoned), DeleteCount as tombstones. - DeleteNeedleFromEcx keeps FileCount constant and increments DeleteCount only on live->tombstone transitions. - Admin dedup test uses distinct per-node delete counts (5 + 3 + 2) to prove they're summed, while FileCount=100 is applied once. * ec: test fixture uses real vid; admin warns on skewed ec counts - writeFixture now builds the .ecx/.ecj/.ec00/.vif filenames from the actual vid passed in, instead of hardcoding "_1". The existing tests all use vid=1 so behaviour is unchanged, but the helper no longer silently diverges from its documented parameter. - collectCollectionStats logs a glog warning when an EC volume's summed delete count exceeds its deduped file count, surfacing the anomaly (stale heartbeat, counter drift, etc.) instead of silently dropping the volume from the object count. * ec: derive file/delete counts from .ecx/.ecj file sizes seedCountsFromEcx walked the full .ecx index at volume load, which is wasted work: .ecx has fixed-size entries (NeedleMapEntrySize) and .ecj has fixed-size deletion records (NeedleIdSize), so both counts are pure file-size arithmetic. fileCount = ecxFileSize / NeedleMapEntrySize deleteCount = ecjFileSize / NeedleIdSize Rip out the cached counters, countsLock, seedCountsFromEcx, and the recordDelete helper. Track ecjFileSize directly on the EcVolume struct, seed it from Stat() at load, and bump it on every successful .ecj append inside DeleteNeedleFromEcx under ecjFileAccessLock. Skip the .ecj write entirely when the needle is already tombstoned so the derived delete count stays idempotent on repeat deletes. Heartbeats now compute counts in O(1). Tests updated: the initial fixture pre-populates .ecj with two ids to verify the file-size derivation end-to-end, and the delete test keeps its idempotent-re-delete / missing-needle invariants (unchanged externally, now enforced by the early return rather than a cache guard). * ec: sync Rust volume server with Go file/delete count semantics Mirror the Go-side EC file/delete count work in the Rust volume server so mixed Go/Rust clusters report consistent bucket object counts in the admin dashboard. - Add file_count (8) and delete_count (9) to the Rust copy of VolumeEcShardInformationMessage (seaweed-volume/proto/master.proto). - EcVolume gains ecj_file_size, seeded from the journal's metadata on open and bumped inside journal_delete on every successful append. - file_and_delete_count() returns counts derived in O(1) from ecx_file_size / NEEDLE_MAP_ENTRY_SIZE and ecj_file_size / NEEDLE_ID_SIZE, matching Go's FileAndDeleteCount. - to_volume_ec_shard_information_messages populates the new proto fields instead of defaulting them to zero. - mark_needle_deleted_in_ecx now returns a DeleteOutcome enum (NotFound / AlreadyDeleted / Tombstoned) so journal_delete can skip both the .ecj append and the size bump when the needle is missing or already tombstoned, keeping the derived delete_count idempotent on repeat or no-op deletes. - Rust's EcVolume::new no longer replays .ecj into .ecx on load. Go's RebuildEcxFile is only called from specific decode/rebuild gRPC handlers, not on volume open, and replaying on load was hiding the deletion journal from the new file-size-derived delete counter. rebuild_ecx_from_journal is kept as dead_code for future decode paths that may want the same replay semantics. Also clean up the Go FileAndDeleteCount to drop unnecessary runtime guards against zero constants — NeedleMapEntrySize and NeedleIdSize are compile-time non-zero. test_ec_volume_journal updated to pre-populate the .ecx with the needles it deletes, and extended to verify that repeat and missing-id deletes do not drift the derived counts. * ec: document enterprise-reserved proto field range on ec shard info Both OSS master.proto copies now note that fields 10-19 are reserved for future upstream additions while 20+ are owned by the enterprise fork. Enterprise already pins data_shards/parity_shards at 20/21, so keeping OSS additions inside 8-19 avoids wire-level collisions for mixed deployments. * ec(rust): resolve .ecx/.ecj helpers from ecx_actual_dir ecx_file_name() and ecj_file_name() resolved from self.dir_idx, but new() opens the actual files from ecx_actual_dir (which may fall back to the data dir when the idx dir does not contain the index). After a fallback, read_deleted_needles() and rebuild_ecx_from_journal() would read/rebuild the wrong (nonexistent) path while heartbeats reported counts from the file actually in use — silently dropping deletes. Point idx_base_name() at ecx_actual_dir, which is initialized to dir_idx and only diverges after a successful fallback, so every call site agrees with the file new() has open. The pre-fallback call in new() (line 142) still returns the dir_idx path because ecx_actual_dir == dir_idx at that point. Update the destroy() sweep to build the dir_idx cleanup paths explicitly instead of leaning on the helpers, so post-fallback stale files in the idx dir are still removed. * ec: reset ecj size after rebuild; rollback ecx tombstone on ecj failure Two EC delete-count correctness fixes applied symmetrically to Go and Rust volume servers. 1. rebuild_ecx_from_journal (Rust) now sets ecj_file_size = 0 after recreating the empty journal, matching the on-disk truth. Previously the cached size still reflected the pre-rebuild journal and file_and_delete_count() would keep reporting stale delete counts. The Go side has no equivalent bug because RebuildEcxFile runs in an offline helper that does not touch an EcVolume struct. 2. DeleteNeedleFromEcx / journal_delete used to tombstone the .ecx entry before writing the .ecj record. If the .ecj append then failed, the needle was permanently marked deleted but the heartbeat-reported delete_count never advanced (it is derived from .ecj file size), and a retry would see AlreadyDeleted and early- return, leaving the drift permanent. Both languages now capture the entry's file offset and original size bytes during the mark step, attempt the .ecj append, and on failure roll the .ecx tombstone back by writing the original size bytes at the known offset. A rollback that itself errors is logged (glog / tracing) but cannot re-sync the files — this is the same failure mode a double disk error would produce, and is unavoidable without a full on-disk transaction log. Go: wrap MarkNeedleDeleted in a closure that captures the file offset into an outer variable, then pass the offset + oldSize to the new rollbackEcxTombstone helper on .ecj seek/write errors. Rust: DeleteOutcome::Tombstoned now carries the size_offset and a [u8; SIZE_SIZE] copy of the pre-tombstone size field. journal_delete destructures on Tombstoned and calls restore_ecx_size on .ecj append failure. * test(ec): widen admin /health wait to 180s for cold CI TestEcEndToEnd starts master, 14 volume servers, filer, 2 workers and admin in sequence, then waited only 60s for admin's HTTP server to come up. On cold GitHub runners the tail of the earlier subprocess startups eats most of that budget and the wait occasionally times out (last hit on run 24374773031). The local fast path is still ~20s total, so the bump only extends the timeout ceiling, not the happy path. * test(ec): fork volume servers in parallel in TestEcEndToEnd startWeed is non-blocking (just cmd.Start()), so the per-process fork + mkdir + log-file-open overhead for 14 volume servers was serialized for no reason. On cold CI disks that overhead stacks up and eats into the subsequent admin /health wait, which is how run 24374773031 flaked. Wrap the volume-server loop in a sync.WaitGroup and guard runningCmds with a mutex so concurrent appends are safe. startWeed still calls t.Fatalf on failure, which is fine from a goroutine for a fatal test abort; the fail-fast isn't something we rely on for precise ordering. * ec: fsync ecx before ecj, truncate on failure, harden rebuild Four correctness fixes covering both volume servers. 1. Durability ordering (Go + Rust). After marking the .ecx tombstone we now fsync .ecx before touching .ecj, so a crash between the two files cannot leave the journal with an entry for a needle whose tombstone is still sitting in page cache. Once the fsync returns, the tombstone is the source of truth: reads see "deleted", delete_count may under-count by one (benign, idempotent retries) but never over-reports. If the fsync itself fails we restore the original size bytes and surface the error. The .ecj append is then followed by its own Sync so the reported delete_count matches the on-disk journal once the write returns. 2. .ecj truncation on append failure. write_all may have extended the journal on disk before sync_all / Sync errors out, leaving the cached ecj_file_size out of sync with the physical length and drifting delete_count permanently after restart. Both languages now capture the pre-append size, truncate the file back via set_len / Truncate on any write or sync failure, and only then restore the .ecx tombstone. Truncation errors are logged — same-fd length resets cannot realistically fail — but cannot themselves re-sync the files. 3. Atomic rebuild_ecx_from_journal (Rust, dead code today but wired up on any future decode path). Previously a failed mark_needle_deleted_in_ecx call was swallowed with `let _ = ...` and the journal was still removed, silently losing tombstones. We now bubble up any non-NotFound error, fsync .ecx after the whole replay succeeds, and only then drop and recreate .ecj. NotFound is still ignored (expected race between delete and encode). 4. Missing-.ecx hardening (Rust). mark_needle_deleted_in_ecx used to return Ok(NotFound) when self.ecx_file was None, hiding a closed or corrupt volume behind what looks like an idempotent no-op. It now returns an io::Error carrying the volume id so callers (e.g. journal_delete) fail loudly instead. Existing Go and Rust EC test suites stay green. * ec: make .ecx immutable at runtime; track deletes in memory + .ecj Refactors both volume servers so the sealed sorted .ecx index is never mutated during normal operation. Runtime deletes are committed to the .ecj deletion journal and tracked in an in-memory deleted-needle set; read-path lookups consult that set to mask out deleted ids on top of the immutable .ecx record. Mirrors the intended design on both Go and Rust sides. EcVolume gains a `deletedNeedles` / `deleted_needles` set seeded from .ecj in NewEcVolume / EcVolume::new. DeleteNeedleFromEcx / journal_delete: 1. Looks the needle up read-only in .ecx. 2. Missing needle -> no-op. 3. Pre-existing .ecx tombstone (from a prior decode/rebuild) -> mirror into the in-memory set, no .ecj append. 4. Otherwise append the id to .ecj, fsync, and only then publish the id into the set. A partial write is truncated back to the pre-append length so the on-disk journal and the in-memory set cannot drift. FindNeedleFromEcx / find_needle_from_ecx now return TombstoneFileSize when the id is in the in-memory set, even though the bytes on disk still show the original size. FileAndDeleteCount: fileCount = .ecx size / NeedleMapEntrySize (unchanged) deleteCount = len(deletedNeedles) (was: .ecj size / NeedleIdSize) The RebuildEcxFile / rebuild_ecx_from_journal decode-time helpers still fold .ecj into .ecx — that is the one place tombstones land in the physical index, and it runs offline on closed files. Rust's rebuild helper now also clears the in-memory set when it succeeds. Dead code removed on the Rust side: `DeleteOutcome`, `mark_needle_deleted_in_ecx`, `restore_ecx_size`. Go drops the runtime `rollbackEcxTombstone` path. Neither helper was needed once .ecx stopped being a runtime mutation target. TestEcVolumeSyncEnsuresDeletionsVisible (issue #7751) is rewritten as TestEcVolumeDeleteDurableToJournal, which exercises the full durability chain: delete -> .ecj fsync -> FindNeedleFromEcx masks via the in-memory set -> raw .ecx bytes are *unchanged* -> Close + RebuildEcxFile folds the journal into .ecx -> raw bytes now show the tombstone, as CopyFile in the decode path expects. |
||
|
|
64af80c78d |
fix(mount): stop double-applying umask in Mkdir (#9063)
Mkdir was masking in.Mode with wfs.option.Umask on top of the kernel's VFS umask pass, so a caller with umask=0 who requested mkdir(0777) got 0755 (0777 & ~022). Create and Symlink don't apply this second pass — Mkdir was the odd one out. The resulting dirs had fewer write bits than the caller asked for, which broke cross-user rename permission checks (kernel may_delete rejects with EACCES when the parent lacks o+w even though the caller explicitly requested it) and blocked pjdfstest tests/rename/21.t and its cascading checks. Drop the extra umask so Mkdir trusts in.Mode exactly like Create. The CLI -umask flag still covers the internal cache dirs that the mount creates for itself via os.MkdirAll; only the user-facing Mkdir path changes. Unblocks tests/rename/21.t — full pjdfstest suite is now 236 files / 8819 tests, all PASS, and known_failures.txt is empty. |
||
|
|
c8433a19f0 |
fix(mount): propagate hard-link nlink changes to sibling cache entries (#9062)
* fix(mount): propagate hard-link nlink changes to sibling cache entries weed mount serves stat from its local metacache, and the kernel also caches inode attrs from FUSE replies. When a hard link was unlinked or a new link added, the filer updated the shared HardLink blob correctly, but the sibling link entries in the mount's metacache still carried the stale HardLinkCounter and the kernel attr cache on the shared inode was not invalidated. Subsequent lstat on any sibling link returned the old nlink — pjdfstest link/00.t caught this after `unlink n0` and on `link n1 n2` stating n0. Walk every path bound to the hard-linked inode via a new InodeToPath.GetAllPaths, rewrite each cached sibling's HardLinkCounter and ctime to the authoritative new value, and call fuseServer.InodeNotify to invalidate the kernel attr cache for the shared inode. Applied from both Link (bump) and Unlink (decrement). Unblocks tests/link/00.t and tests/unlink/00.t in pjdfstest; full suite (235 files, 8803 tests) passes end-to-end with no regressions. * fix(mount): harden hard-link sibling sync against nil Attributes and id mismatch Review follow-ups: - Unlink: guard entry.Attributes for nil before reading Inode, with a fallback to inodeToPath.GetInode resolved before RemovePath. Fold the duplicated RemovePath into a single call. - syncHardLinkSiblings: skip siblings whose HardLinkId does not match the authoritative entry. The shared-inode invariant normally guarantees a match, but a transient mismatch (e.g. a rename replaced one of the paths) would otherwise stamp an unrelated entry with the wrong counter. Full pjdfstest suite still passes (235 files, 8803 tests). |
||
|
|
f00cbe4a6d |
test(vacuum): fix flaky TestVacuumIntegration across multiple volumes (#9061)
* test(vacuum): fix flaky TestVacuumIntegration across multiple volumes The test assumed all uploaded files landed in a single volume and tracked only the last file's volume id. With -volumeSizeLimitMB 10 and 16x500KB files, the master can spread uploads across volumes, so the tracked id could point to a volume with no deletes and thus 0% garbage — causing verify_garbage_before_vacuum to fail even though vacuum ran correctly on the other volume. Track the set of volumes where deletes actually occurred and verify garbage/cleanup against all of them. Also add a short retry loop on the pre-vacuum check to absorb heartbeat jitter. * test(vacuum): require all dirty volumes ready; retry cleanup check Address review feedback: the pre-vacuum check now waits until every volume in dirtyVolumes reports garbage > threshold (not just the first), and the post-vacuum cleanup check retries per-volume with a deadline instead of relying on a fixed sleep, since vacuum + heartbeat reporting is asynchronous. * test(vacuum): deterministic dirty volumes order, aggregate cleanup failures - Sort dirtyVolumes after building from the set so logs and iteration are stable across runs. - In verify_cleanup_after_vacuum, track per-volume failure reasons in a map and report all still-failing volumes on timeout instead of only the last one that happened to be written to lastErr. |
||
|
|
e0c361ec77 | fix(weed/worker/tasks): log dropped errors (#9057) | ||
|
|
8f2a3d92bb |
docker: upgrade libcrypto3/libssl3 to clear Trivy HIGH (CVE-2026-28390) (#9059)
* docker: upgrade libcrypto3/libssl3 to clear Trivy HIGH Trivy gate on ghcr.io/seaweedfs/seaweedfs:latest-amd64 flagged CVE-2026-28390 in libcrypto3 3.5.5-r0 (fixed in 3.5.6-r0) on the alpine 3.23.3 base. Add libcrypto3/libssl3 to the existing apk upgrade so rebuilt images pick up the patched openssl without waiting for a new alpine base tag. * docker: apk add libcrypto3/libssl3 so they install at patched version Per review, apk upgrade <pkg> is a no-op when the package isn't already installed. libcrypto3/libssl3 come in transitively via curl, so list them in apk add to guarantee installation at the latest (patched) version from the alpine repo. |
||
|
|
ef77df6141 |
admin: include EC volumes in bucket size reporting (#9058)
* admin: include EC volumes in bucket size reporting The Object Store buckets page computed per-collection size by iterating only regular volumes, so once a bucket's data was EC-encoded it silently disappeared from the reported size — breaking usage-based billing. Walk EcShardInfos alongside VolumeInfos in collectCollectionStats: add raw shard bytes to PhysicalSize, and the parity-stripped value (shardBytes * DataShardsCount / TotalShardsCount) to LogicalSize, matching the normalization used by `weed shell` cluster.status. * admin: derive EC logical size from shard bitmap, not constants Use ShardsInfoFromVolumeEcShardInformationMessage + MinusParityShards to sum actual data-shard bytes instead of scaling raw bytes by the DataShardsCount/TotalShardsCount ratio. Keeps the data/parity split encapsulated in the erasure_coding package and is exact when shard sizes differ (e.g. last shard). * admin: regression test for EC shard size aggregation Cover the uneven-tail-shard case (data shard 9 < 1000 bytes) and the empty-collection-name path to pin PhysicalSize/LogicalSize behavior for collectCollectionStats against future changes. |
||
|
|
50f25bb5cd | 4.20 4.20 | ||
|
|
512912cbb8 | Update plugin_templ.go | ||
|
|
8d6c5cbb58 |
build(deps): bump org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2 in /test/kafka/kafka-client-loadtest/tools (#9056)
build(deps): bump org.apache.kafka:kafka-clients Bumps org.apache.kafka:kafka-clients from 3.9.1 to 3.9.2. --- updated-dependencies: - dependency-name: org.apache.kafka:kafka-clients dependency-version: 3.9.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
f3151900e4 |
build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 from 1.98.0 to 1.99.0 (#9053)
build(deps): bump github.com/aws/aws-sdk-go-v2/service/s3 Bumps [github.com/aws/aws-sdk-go-v2/service/s3](https://github.com/aws/aws-sdk-go-v2) from 1.98.0 to 1.99.0. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/service/s3/v1.98.0...service/s3/v1.99.0) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/service/s3 dependency-version: 1.99.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
7aaa431bb4 |
s3api: prune bucket-scoped IAM actions on DeleteBucket (#9054)
* s3api: prune bucket-scoped IAM actions on DeleteBucket DeleteBucket removed the bucket directory and collection but left behind any identity actions configured via s3.configure that were scoped to that bucket (e.g. Read:bucket, Write:bucket/prefix), leaving stale auth metadata that users expected to be cleaned up along with the bucket. After a successful delete, strip actions whose resource is exactly the bucket or a prefix under it, save via the credential manager, and let the existing filer metadata subscription fan the reload out to every S3 server. Wildcarded resources and global actions are preserved since they may cover other buckets; static identities are left untouched. Fixes #5310 * s3api: address review feedback on bucket IAM prune - Apply per-identity updates via credentialManager.UpdateUser instead of a full LoadConfiguration/SaveConfiguration round-trip, so the prune no longer clobbers concurrent IAM edits made by s3.configure or the IAM API during a DeleteBucket. - Use a 30s bounded background context for the post-delete cleanup so it survives client disconnect — the bucket is already gone by then and this is best-effort bookkeeping. - Skip static identities via IsStaticIdentity, since the credential store never persists them and UpdateUser would return NotFound. |
||
|
|
8049fcc516 |
correctly namespace all define calls (#9044)
* correctly namespace all `define` calls * fix unrelated issue: wrong dict call to gen sftp passwd |
||
|
|
06cbd2acdf |
build(deps): bump golang.org/x/net from 0.52.0 to 0.53.0 (#9052)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.52.0 to 0.53.0. - [Commits](https://github.com/golang/net/compare/v0.52.0...v0.53.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-version: 0.53.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
2ee6907c19 |
Update Helm Chart docs with instructions for deploying RocksDB variant (#9006)
* Update documentation for helm chart, with instructions on how to deploy the RocksDB image tag variant. Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev> Nit: Update example to make it clearer that the seaweedfs version needs to be replaced. Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev> * docs(helm): clarify RocksDB variant instructions - Note that filer persistence (enablePVC) is required so RocksDB metadata survives restarts. - Explain why master/volume also use the rocksdb-tagged image. - Tighten wording around WEED_LEVELDB2_ENABLED override. --------- Signed-off-by: Mark McCormick <mark.mccormick@chainguard.dev> Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
cc5b246973 |
build(deps): bump github.com/aws/aws-sdk-go-v2/config from 1.32.13 to 1.32.14 (#9051)
build(deps): bump github.com/aws/aws-sdk-go-v2/config Bumps [github.com/aws/aws-sdk-go-v2/config](https://github.com/aws/aws-sdk-go-v2) from 1.32.13 to 1.32.14. - [Release notes](https://github.com/aws/aws-sdk-go-v2/releases) - [Commits](https://github.com/aws/aws-sdk-go-v2/compare/config/v1.32.13...config/v1.32.14) --- updated-dependencies: - dependency-name: github.com/aws/aws-sdk-go-v2/config dependency-version: 1.32.14 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
36ae7e04b5 |
build(deps): bump github.com/apache/cassandra-gocql-driver/v2 from 2.0.0 to 2.1.0 (#9047)
build(deps): bump github.com/apache/cassandra-gocql-driver/v2 Bumps [github.com/apache/cassandra-gocql-driver/v2](https://github.com/apache/cassandra-gocql-driver) from 2.0.0 to 2.1.0. - [Release notes](https://github.com/apache/cassandra-gocql-driver/releases) - [Changelog](https://github.com/apache/cassandra-gocql-driver/blob/trunk/CHANGELOG.md) - [Commits](https://github.com/apache/cassandra-gocql-driver/compare/v2.0.0...v2.1.0) --- updated-dependencies: - dependency-name: github.com/apache/cassandra-gocql-driver/v2 dependency-version: 2.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
46c0e56bb8 |
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 from 3.125.3 to 3.134.0 (#9048)
build(deps): bump github.com/ydb-platform/ydb-go-sdk/v3 Bumps [github.com/ydb-platform/ydb-go-sdk/v3](https://github.com/ydb-platform/ydb-go-sdk) from 3.125.3 to 3.134.0. - [Release notes](https://github.com/ydb-platform/ydb-go-sdk/releases) - [Changelog](https://github.com/ydb-platform/ydb-go-sdk/blob/master/CHANGELOG.md) - [Commits](https://github.com/ydb-platform/ydb-go-sdk/compare/v3.125.3...v3.134.0) --- updated-dependencies: - dependency-name: github.com/ydb-platform/ydb-go-sdk/v3 dependency-version: 3.134.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
baa65c3823 |
build(deps): bump docker/build-push-action from 7.0.0 to 7.1.0 (#9049)
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 7.0.0 to 7.1.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](https://github.com/docker/build-push-action/compare/v7...v7.1.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: 7.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
f4bfe60549 |
build(deps): bump softprops/action-gh-release from 2 to 3 (#9050)
Bumps [softprops/action-gh-release](https://github.com/softprops/action-gh-release) from 2 to 3. - [Release notes](https://github.com/softprops/action-gh-release/releases) - [Changelog](https://github.com/softprops/action-gh-release/blob/master/CHANGELOG.md) - [Commits](https://github.com/softprops/action-gh-release/compare/v2...v3) --- updated-dependencies: - dependency-name: softprops/action-gh-release dependency-version: '3' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
67a2810d2d |
Export start_time_seconds metrics on both master & volume servers. (#9046)
These are to be used to track uptimes. See https://github.com/seaweedfs/seaweedfs/issues/8535 for details. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch> |
||
|
|
80db692728 | fix(weed/util/chunk_cache): fix dropped errors (#9042) | ||
|
|
ae08e77979 |
fix(scheduler): give worker tasks a real per-attempt execution deadline (#9041)
* fix(scheduler): give worker tasks a real per-attempt execution deadline The plugin scheduler derived the per-attempt execution deadline as DetectionTimeoutSeconds * 2, which capped every worker task at twice the cluster-scan budget regardless of actual work. For volume_balance batches this was 240s — far too short for 20 large volume copies, so every attempt died at "context deadline exceeded" and all in-flight sub-RPCs surfaced as "context canceled". Retries restarted from move 1 and hit the same wall. Add an explicit ExecutionTimeoutSeconds field to the plugin proto and make each handler declare its own baseline (1800s for vacuum, balance, EC; 3600s for iceberg). Size-aware handlers also emit an estimated_runtime_seconds parameter on each proposal so the scheduler extends the per-attempt deadline based on actual workload: - volume_balance batch: max(largest single move, total / concurrency) at 5 min/GB, so a skewed batch with one big volume isn't averaged away. - volume_balance single, vacuum (already), erasure_coding (10 min/GB), ec_balance (5 min/GB): per-volume budgets. admin_script and iceberg keep the configurable handler default since their workloads are opaque to the detector. * fix(scheduler): apply descriptor defaults to existing persisted configs The previous commit added execution_timeout_seconds to the proto and each handler's descriptor defaults, but two paths still left existing deployments broken: 1. deriveSchedulerAdminRuntime returned stored AdminRuntime configs as-is. Persisted configs from older versions have no execution_timeout_seconds, so the scheduler fell back to the 90s default — worse than the prior 240s behavior. Overlay descriptor defaults for any zero numeric fields when loading. 2. The admin form did not round-trip execution_timeout_seconds, so a normal save would clear it back to zero. Add the input field, the fillAdminSettings/collectAdminSettings hooks, and as defense in depth reapply descriptor defaults in UpdatePluginJobTypeConfigAPI before persisting so a stale form can never silently clobber a baseline. * fix(volume_balance): account for partial scheduling rounds in batch estimate With N moves and C slots, the busiest slot processes ceil(N/C) moves, not N/C. Dividing total seconds by C underestimates wall-clock time whenever N is not a multiple of C — e.g. 6 moves at concurrency 5 needs 2 rounds, not 1.2. Use avg * ceil(N/C) so partial rounds are counted as full ones. * fix(volume_balance): scale minBudget per wave instead of per move Orchestration overhead (setup/teardown for the parallel move runner) happens once per wave, not once per move. Use numRounds*60 as the floor instead of len(moves)*60 so the minimum doesn't inflate linearly with batch size when individual moves are tiny. |
||
|
|
28d1ef24ec |
fix(admin): allow control chars in file paths when browsing filer (#9043)
* fix(admin): allow control chars in file paths when browsing filer The admin UI rejected any path containing \x00, \r, or \n as "path contains invalid characters". These bytes are legal in S3 object keys, so objects created through the S3 API (or replicated via filer.sync) could exist on the filer but be unreachable from the admin UI — browse, download, and upload all failed with "Invalid file path". Drop the control-character rejection and instead URL-escape the path when constructing filer request URLs, so that such bytes cannot inject into the HTTP request target. Path traversal protection via path.Clean is unchanged. * test(admin): strengthen file path tests with byte-preserving checks Assert full expected output for validateAndCleanFilePath so silent stripping of control characters would fail the test, and cover \r and \x00 escaping in filerFileURL in addition to \n and space. |
||
|
|
edf7d2a074 |
fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9039)
* fix(filer): eliminate redundant disk reads causing memory/CPU regression (#9035) Since 4.18, LocalMetaLogBuffer's ReadFromDiskFn was set to readPersistedLogBufferPosition, causing LoopProcessLogData to call ReadPersistedLogBuffer on every 250ms health-check tick when a subscriber encounters ResumeFromDiskError. Each call creates an OrderedLogVisitor (ListDirectoryEntries on the filer store), spawns a readahead goroutine with a 1024-element channel, finds no data, and returns — 4 times per second even on an idle filer. This is redundant because SubscribeLocalMetadata already manages disk reads explicitly with its own shouldReadFromDisk / lastCheckedFlushTsNs tracking in the outer loop. Set ReadFromDiskFn back to nil for LocalMetaLogBuffer. When LoopProcessLogData encounters ResumeFromDiskError with nil ReadFromDiskFn, the HasData() guard returns ResumeFromDiskError to the caller (SubscribeLocalMetadata), which blocks efficiently on listenersCond.Wait() instead of polling. * fix(filer): add gap detection for slow consumers after disk-read stall When a slow consumer falls behind and LoopProcessLogData returns ResumeFromDiskError with no flush or read-position progress, there may be a gap between persisted data and in-memory data (e.g. writes stopped while consumer was still catching up). Without this, the consumer would block on listenersCond.Wait() forever. Skip forward to the earliest in-memory time to resume progress, matching the gap-handling pattern already used in the shouldReadFromDisk path. * fix(filer): clear stale ResumeFromDiskError after gap-skip to avoid stall The gap-detection block added in the previous commit skips lastReadTime forward to GetEarliestTime() and continues the outer loop. On the next iteration, shouldReadFromDisk becomes true (currentReadTsNs > lastDiskReadTsNs), the disk read returns processedTsNs == 0, and the existing gap handler at the top of the loop runs its own gap check. That check uses readInMemoryLogErr == ResumeFromDiskError as the entry condition — but readInMemoryLogErr is still the stale error from two iterations ago. GetEarliestTime() now equals lastReadTime.Time (we already advanced to it), so earliestTime.After(lastReadTime.Time) is false and the handler falls into listenersCond.Wait() — stuck. Clear readInMemoryLogErr at the gap-skip point, matching the existing pattern at the earlier gap handler that already clears it for the same reason. * fix(log_buffer): GetEarliestTime must include sealed prev buffers GetEarliestTime previously returned only logBuffer.startTime (the active buffer's first timestamp). That is narrower than ReadFromBuffer's tsMemory, which is the min across active + prev buffers. Callers using GetEarliestTime for gap detection after ResumeFromDiskError (the SubscribeLocalMetadata outer loop's disk-read path, the new gap-skip in the in-memory ResumeFromDiskError handler, and MQ HasData) saw a time that was *newer* than the real earliest in-memory data. Impact in SubscribeLocalMetadata's slow-consumer path: - tsMemory = earliest prev buffer time (T_prev) - GetEarliestTime() = active startTime (T_active, later than T_prev) - Consumer position = T1, with T_prev < T1 < T_active - ReadFromBuffer returns ResumeFromDiskError (T1 < tsMemory) - Gap detect: GetEarliestTime().After(T1) = T_active.After(T1) = true - Skip forward to T_active -- silently drops the prev-buffer data - And when T_active happens to equal the stuck position, gap detect evaluates false, and the subscriber stalls on listenersCond.Wait() This reproduces the TestMetadataSubscribeSlowConsumerKeepsProgressing failure in CI where the consumer stalled at 10220/20000 after writing stopped -- the buffer still had data in prev[0..3], but gap detection was comparing against the active buffer's startTime. Fix: scan all sealed prev buffers under RLock, return the true minimum startTime. Matches the min-of-buffers logic in ReadFromBuffer. * test(log_buffer): make DiskReadRetry test deterministic The previous test added the message via AddToBuffer + ForceFlush and relied on a race: the second disk read had to happen before the data was delivered through the in-memory path. Under the race detector or on a slow CI runner, the reader is woken by AddToBuffer's notification, finds the data in the active buffer or its prev slot, and returns after exactly one disk read — failing the >= 2 disk reads assertion even though the loop behaved correctly. Reproduced on master with race detector (2/5 failures). Rewrite the test to deliver the data exclusively through the disk-read path: no AddToBuffer, no ForceFlush. The test waits until the reader has issued at least one no-op disk read, then atomically flips a "dataReady" flag. The reader's next iteration through readFromDiskFn returns the entry. This deterministically exercises the retry-loop behavior the test was originally written to protect, and removes the in-memory delivery race entirely. |
||
|
|
10e7f0f2bc |
fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy Instead of erroring when the user already exists, the command now creates the policy and attaches it to the existing user via UpdateUser. Credentials are only generated and displayed for newly created users. * fix(shell): skip duplicate policy attachment in s3.user.provision Check if the policy is already attached before appending and calling UpdateUser, making repeated runs idempotent. * fix(shell): generate service account ID in s3.serviceaccount.create The command built a ServiceAccount proto without setting Id, which was rejected by credential.ValidateServiceAccountId on any real store. Now generates sa:<parent>:<uuid> matching the format used by the admin UI. * test(s3): integration tests for s3.* shell commands Adds TestShell* integration tests covering ~40 previously untested shell commands: user, accesskey, group, serviceaccount, anonymous, bucket, policy.attach/detach, config.show, and iam.export/import. Switches the test cluster's credential store from memory to filer_etc because the memory store silently drops groups and service accounts in LoadConfiguration/SaveConfiguration. * fix(shell): rollback policy on key generation failure in s3.user.provision If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after the policy was persisted, the policy would be left orphaned. Extracts the rollback logic into a local closure and invokes it on all failure paths after policy creation for consistency. * address PR review feedback for s3 shell tests and serviceaccount - s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for the service account UUID instead of 4 bytes to eliminate collision risk - s3.serviceaccount.create: print the actual ID and drop the outdated "server-assigned" note (the ID is now client-generated) - tests: guard createdAK in accesskey rotate/delete subtests so sibling failures don't run invalid CLI calls - tests: requireContains/requireNotContains use t.Fatalf to fail fast - tests: Provision subtest asserts the "Attached policy" message on the second provision call for an existing user - tests: update extractServiceAccountID comment example to match the sa:<parent>:<uuid> format - tests: drop redundant saID empty-check (extractServiceAccountID fatals) * test(s3): use t.Fatalf for precondition check in serviceaccount test |
||
|
|
9cae95d749 |
fix(filer): prevent data corruption during graceful shutdown (#9037)
* fix: wait for in-flight uploads to complete before filer shutdown Prevents data corruption when SIGTERM is received during active uploads. The filer now waits for all in-flight operations to complete before calling the underlying shutdown logic. This affects all deployment types (Kubernetes, Docker, systemd) and fixes corruption issues during rolling updates, certificate rotation, and manual restarts. Changes: - Add FilerServer.Shutdown() method with upload wait logic - Update grace.OnInterrupt hook to use new shutdown method Fixes data corruption reported by production users during pod restarts. * fix: implement graceful shutdown for gRPC and HTTP servers, ensuring in-flight uploads complete * fix: address review comments on graceful shutdown - Add 10s timeout to gRPC GracefulStop to prevent indefinite blocking from long-lived streams (falls back to Stop on timeout) - Reduce HTTP/HTTPS shutdown timeout from 25s to 15s to fit within Kubernetes default 30s termination grace period - Move fs.Shutdown() (database close) after Serve() returns instead of a separate hook to eliminate race where main goroutine exits before the shutdown hook runs * fix: shut down all HTTP servers before filer database close Address remaining review comments: - Shut down auxiliary HTTP servers (Unix socket, local listener) during graceful shutdown so they can't serve write traffic after the main server stops - Register fs.Shutdown() as a grace.OnInterrupt hook to guarantee it completes before os.Exit(0), fixing the race between the grace goroutine and the main goroutine - Use sync.Once to ensure fs.Shutdown() runs exactly once regardless of whether shutdown is signal-driven or context-driven (MiniCluster) --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
e8a8449553 |
feat(mount): pre-allocate file IDs in pool for writeback cache mode (#9038)
* feat(mount): pre-allocate file IDs in pool for writeback cache mode When writeback caching is enabled, chunk uploads no longer block on a per-chunk AssignVolume RPC. Instead, a FileIdPool pre-allocates file IDs in batches using a single AssignVolume(Count=N, ExpectedDataSize=ChunkSize) call and hands them out instantly to upload workers. Pool size is 2x ConcurrentWriters, refilled in background when it drops below ConcurrentWriters. Entries expire after 25s to respect JWT TTL. Sequential needle keys are generated from the base file ID returned by the master, so one Assign RPC produces N usable IDs. This cuts per-chunk upload latency from 2 RTTs (assign + upload) to 1 RTT (upload only), with the assign cost amortized across the batch. * test: add benchmarks for file ID pool vs direct assign Benchmarks measure: - Pool Get vs Direct AssignVolume at various simulated latencies - Batch assign scaling (Count=1 through Count=32) - Concurrent pool access with 1-64 workers Results on Apple M4: - Pool Get: constant ~3ns regardless of assign latency - Batch=16: 15.7x more IDs/sec than individual assigns - 64 concurrent workers: 19M IDs/sec throughput * fix(mount): address review feedback on file ID pool 1. Fix race condition in Get(): use sync.Cond so callers wait for an in-flight refill instead of returning an error when the pool is empty. 2. Match default pool size to async flush worker count (128, not 16) when ConcurrentWriters is unset. 3. Add logging to UploadWithAssignFunc for consistency with UploadWithRetry. 4. Document that pooled assigns omit the Path field, bypassing path-based storage rules (filer.conf). This is an intentional tradeoff for writeback cache performance. 5. Fix flaky expiry test: widen time margin from 50ms to 1s. 6. Add TestFileIdPoolGetWaitsForRefill to verify concurrent waiters. * fix(mount): use individual Count=1 assigns to get per-fid JWTs The master generates one JWT per AssignResponse, bound to the base file ID (master_grpc_server_assign.go:158). The volume server validates that the JWT's Fid matches the upload exactly (volume_server_handlers.go:367). Using Count=N and deriving sequential IDs would fail this check. Switch to individual Count=1 RPCs over a single gRPC connection. This still amortizes connection overhead while getting a correct per-fid JWT for each entry. Partial batches are accepted if some requests fail. Remove unused needle import now that sequential ID generation is gone. * fix(mount): separate pprof from FUSE protocol debug logging The -debug flag was enabling both the pprof HTTP server and the noisy go-fuse protocol logging (rx/tx lines for every FUSE operation). This makes profiling impractical as the log output dominates. Split into two flags: - -debug: enables pprof HTTP server only (for profiling) - -debug.fuse: enables raw FUSE protocol request/response logging * perf(mount): replace LevelDB read+write with in-memory overlay for dir mtime Profile showed TouchDirMtimeCtime at 0.22s — every create/rename/unlink in a directory did a LevelDB FindEntry (read) + UpdateEntry (write) just to bump the parent dir's mtime/ctime. Replace with an in-memory map (same pattern as existing atime overlay): - touchDirMtimeCtimeLocal now stores inode→timestamp in dirMtimeMap - applyInMemoryDirMtime overlays onto GetAttr/Lookup output - No LevelDB I/O on the mutation hot path The overlay only advances timestamps forward (max of stored vs overlay), so stale entries are harmless. Map is bounded at 8192 entries. * perf(mount): skip self-originated metadata subscription events in writeback mode With writeback caching, this mount is the single writer. All local mutations are already applied to the local meta cache (via applyLocalMetadataEvent or direct InsertEntry). The filer subscription then delivers the same event back, causing redundant work: proto.Clone, enqueue to apply loop, dedup ring check, and sometimes redundant LevelDB writes when the dedup ring misses (deferred creates). Check EventNotification.Signatures against selfSignature and skip events that originated from this mount. This eliminates the redundant processing for every self-originated mutation. * perf(mount): increase kernel FUSE cache TTL in writeback cache mode With writeback caching, this mount is the single writer — the local meta cache is authoritative. Increase EntryValid and AttrValid from 1s to 10s so the kernel doesn't re-issue Lookup/GetAttr for every path component and stat call. This reduces FUSE /dev/fuse round-trips which dominate the profile at 38% of CPU (syscall.rawsyscalln). Each saved round-trip eliminates a kernel→userspace→kernel transition. Normal (non-writeback) mode retains the 1s TTL for multi-mount consistency. |
||
|
|
b37bbf541a |
feat(master): drain pending size before marking volume readonly (#9036)
* feat(master): drain pending size before marking volume readonly When vacuum, volume move, or EC encoding marks a volume readonly, in-flight assigned bytes may still be pending. This adds a drain step: immediately remove from writable list (stop new assigns), then wait for pending to decay below 4MB or 30s timeout. - Add volumeSizeTracking struct consolidating effectiveSize, reportedSize, and compactRevision into a single map - Add GetPendingSize, waitForPendingDrain, DrainAndRemoveFromWritable, DrainAndSetVolumeReadOnly to VolumeLayout - UpdateVolumeSize detects compaction via compactRevision change and resets effectiveSize instead of decaying - Wire drain into vacuum (topology_vacuum.go) and volume mark readonly (master_grpc_server_volume.go) * fix: use 2MB pending size drain threshold * fix: check crowded state on initial UpdateVolumeSize registration * fix: respect context cancellation in drain, relax test timing - DrainAndSetVolumeReadOnly now accepts context.Context and returns early on cancellation (for gRPC handler timeout/cancel) - waitForPendingDrain uses select on ctx.Done instead of time.Sleep - Increase concurrent heartbeat test timeout from 10s to 15s for CI * fix: use time-based dedup so decay runs even when reported size is unchanged The value-based dedup (same reportedSize + compactRevision = skip) prevented decay from running when pending bytes existed but no writes had landed on disk yet. The reported size stayed the same across heartbeats, so the excess never decayed. Fix: dedup replicas within the same heartbeat cycle using a 2-second time window instead of comparing values. This allows decay to run once per heartbeat cycle even when the reported size is unchanged. Also confirmed finding 1 (draining re-add race) is a false positive: - Vacuum: ensureCorrectWritables only runs for ReadOnly-changed volumes - Move/EC: readonlyVolumes flag prevents re-adding during drain * fix: make VolumeMarkReadonly non-blocking to fix EC integration test timeout The DrainAndSetVolumeReadOnly call in VolumeMarkReadonly gRPC blocked up to 30s waiting for pending bytes to decay. In integration tests (and real clusters during EC encoding), this caused timeouts because multiple volumes are marked readonly sequentially and heartbeats may not arrive fast enough to decay pending within the drain window. Fix: VolumeMarkReadonly now calls SetVolumeReadOnly immediately (stops new assigns) and only logs a warning if pending bytes remain. The drain wait is kept only for vacuum (DrainAndRemoveFromWritable) which runs inside the master's own goroutine pool. Remove DrainAndSetVolumeReadOnly as it's no longer used. * fix: relax test timing, rename test, add post-condition assert * test: add vacuum integration tests with CI workflow Full-cluster integration test for vacuum, modeled on the EC integration tests. Starts a real master + 2 volume servers, uploads data, deletes entries to create garbage, runs volume.vacuum via shell command, and verifies garbage cleanup and data integrity. Test flow: 1. Start cluster (master + 2 volume servers) 2. Upload 10 files to create volume with data 3. Delete 5 files to create ~50% garbage 4. Verify garbage ratio > 10% 5. Run volume.vacuum command 6. Verify garbage cleaned up 7. Verify remaining 5 files are still accessible CI workflow runs on push/PR to master with 15-minute timeout. Log collection on failure via artifact upload. * fix: use 500KB files and delete 75% to exceed vacuum garbage threshold * fix: add shell lock before vacuum command, fix compilation error * fix: strengthen vacuum integration test assertions - waitForServer: use net.DialTimeout instead of grpc.NewClient for real TCP readiness check - verify_garbage_before_vacuum: t.Fatal instead of warning when no garbage detected - verify_cleanup_after_vacuum: t.Fatal if no server reported the volume or cleanup wasn't verified - verify_remaining_data: read actual file contents via HTTP and compare byte-for-byte against original uploaded payloads * fix: use http.Client with timeout and close body before retry |
||
|
|
10b0bdce02 |
feat: pass expected_data_size from clients for size-aware assignment (#9032)
* feat: pass expected_data_size from clients for size-aware assignment Add expected_data_size field to AssignRequest (master proto) and AssignVolumeRequest (filer proto) so clients can hint how large the data will be. The master uses this instead of the 1MB default when tracking pending volume sizes for weighted assignment. - Add expected_data_size to master.proto AssignRequest - Add expected_data_size to filer.proto AssignVolumeRequest - Wire through filer AssignVolume handler - Wire through HTTP submit handler (uses actual upload size) - Add ExpectedDataSize to VolumeAssignRequest in operation package - Topology.PickForWrite accepts optional expectedDataSize parameter * fix: guard integer conversions in expected_data_size path - common.go: clamp OriginalDataSize to non-negative before uint64 cast - topology.go: cap expectedDataSize at math.MaxInt64 before int64 cast * fix: parse dataSize hint in HTTP /dir/assign and test non-zero expectedDataSize - HTTP /dir/assign now parses optional "dataSize" query parameter and passes it to PickForWrite instead of hardcoded 0 - Add test assertion for PickForWrite with non-zero expectedDataSize |
||
|
|
e2c79af6ec |
feat(master): size-aware volume assignment with weighted selection (#9031)
* feat(master): size-aware volume assignment with weighted selection PickForWrite now selects volumes proportional to remaining capacity instead of uniform random, so emptier volumes receive more writes. - Add vid2size map to VolumeLayout tracking effective volume sizes - Weighted pick via random sampling (k=3) for O(1) cost - RecordAssign tracks estimated pending bytes between heartbeats - Exponential decay on heartbeat: halve excess each cycle - Proactive crowded detection using effective size - Zero extra heap allocations on the unconstrained hot path Benchmark (20 writable volumes, unconstrained): Before: 36 ns/op, 32 B/op, 2 allocs/op After: 85 ns/op, 32 B/op, 2 allocs/op * fix: address review feedback on size-aware assignment - RecordAssign: use write lock (Lock) instead of read lock (RLock) since it mutates vid2size map and crowded set - RegisterVolume: clear crowded flag when heartbeat decay drops effective size below the threshold - pickWeightedByRemaining: fix misleading Fisher-Yates comment, simplify to plain random sampling (duplicates are harmless) - ShouldGrowVolumesByDcAndRack: read vid2size under RLock * fix: decay once per heartbeat cycle, not per replica RegisterVolume is called once per replica of a volume. For replicated volumes, the pending size decay was running multiple times per heartbeat cycle, reducing the excess by 75% instead of 50% (for 2 replicas). Fix: track vid2reportedSize and only run decay when the heartbeat- reported size actually changes. A second replica reporting the same size in the same cycle is a no-op. Also fix CodeQL alert: cap count*EstimatedNeedleSizeBytes to avoid uint64→int64 overflow in RecordAssign call. * Potential fix for pull request finding 'CodeQL / Incorrect conversion between integer types' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * fix: fail fast in test setup on JSON errors - setupWithLimit now takes testing.TB and calls t.Fatalf on unmarshal errors or type assertion failures instead of printing and continuing - benchSetup removed; benchmarks reuse setupWithLimit directly * fix: run size decay on every heartbeat, not just new volumes RegisterVolume is only called for newly discovered volumes, not on every heartbeat. The pending size decay was never running in production. - Extract decay logic into UpdateVolumeSize(), called from SyncDataNodeRegistration for every reported volume on every heartbeat - RegisterVolume only initializes vid2size for brand-new volumes - Constrained PickForWrite: scan from random offset, collect up to pickSampleSize matches in a stack array (no append allocation) - Tests now exercise UpdateVolumeSize directly instead of RegisterVolume to match the production heartbeat path * fix: compute pending bytes in uint64 to satisfy CodeQL --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> |
||
|
|
388cc018ab |
fix(mount): reduce unnecessary filer RPCs across all mutation operations (#9030)
* fix(mount): reduce filer RPCs for mkdir/rmdir operations 1. Mark newly created directories as cached immediately. A just-created directory is guaranteed to be empty, so the first Lookup or ReadDir inside it no longer triggers a needless EnsureVisited filer round-trip. 2. Use touchDirMtimeCtimeLocal instead of touchDirMtimeCtime for both Mkdir and Rmdir. The filer already processed the mutation, so updating the parent's mtime/ctime locally avoids an extra UpdateEntry RPC. Net effect: mkdir goes from 3 filer RPCs to 1. * fix(mount): eliminate extra filer RPCs for parent dir mtime updates Every mutation (create, unlink, symlink, link, rename) was calling touchDirMtimeCtime after the filer already processed the mutation. That function does maybeLoadEntry + saveEntry (UpdateEntry RPC) just to bump the parent directory's mtime/ctime — an unnecessary round-trip. Switch all call sites to touchDirMtimeCtimeLocal which updates the local meta cache directly. Remove the now-unused touchDirMtimeCtime. Affected operations: Create (Mknod path), Unlink, Symlink, Link, Rename. Each saves one filer RPC per call. * fix(mount): defer RemoveXAttr for open files, skip redundant existence check 1. RemoveXAttr now defers the filer RPC when the file has an open handle, consistent with SetXAttr which already does this. The xattr change is flushed with the file metadata on close. 2. Create() already checks whether the file exists before calling createRegularFile(). Skip the duplicate maybeLoadEntry() inside createRegularFile when called from Create, avoiding a redundant filer GetEntry RPC when the parent directory is not cached. * fix(mount): skip distributed lock when writeback caching is enabled Writeback caching implies single-writer semantics — the user accepts that only one mount writes to each file. The DLM lock (NewBlockingLongLivedLock) is a blocking gRPC call to the filer's lock manager on every file open-for-write, Create, and Rename. This is unnecessary overhead when writeback caching is on. Skip lockClient initialization when WritebackCache is true. All DLM call sites already guard on `wfs.lockClient != nil`, so they are automatically skipped. * fix(mount): async filer create for Mknod with writeback caching With writeback caching, Mknod now inserts the entry into the local meta cache immediately and fires the filer CreateEntry RPC in a background goroutine, similar to how Create defers its filer RPC. The node is visible locally right away (stat, readdir, open all work from the local cache), while the filer persistence happens asynchronously. This removes the synchronous filer RPC from the Mknod hot path. * fix(mount): address review feedback on async create and DLM logging 1. Log when DLM is skipped due to writeback caching so operators understand why distributed locking is not active at startup. 2. Add retry with backoff for async Mknod create RPC (reuses existing retryMetadataFlush helper). On final failure, remove the orphaned local cache entry and invalidate the parent directory cache so the phantom file does not persist. * fix(mount): restore filer RPC for parent dir mtime when not using writeback cache The local-only touchDirMtimeCtimeLocal updates LevelDB but lookupEntry only reads from LevelDB when the parent directory is cached. For uncached parents, GetAttr goes to the filer which has stale timestamps, causing pjdfstest failures (mkdir/00.t, rmdir/00.t, unlink/00.t, etc.). Introduce touchDirMtimeCtimeBest which: - WritebackCache mode: local meta cache only (no filer RPC) - Normal mode: filer UpdateEntry RPC for POSIX correctness The deferred file create path keeps touchDirMtimeCtimeLocal since no filer entry exists yet. * fix(mount): use touchDirMtimeCtimeBest for deferred file create path The deferred create path (Create with deferFilerCreate=true) was using touchDirMtimeCtimeLocal unconditionally, but this only updates the local LevelDB cache. Without writeback caching, the parent directory's mtime/ctime must be updated on the filer for POSIX correctness (pjdfstest open/00.t). * test: add link/00.t and unlink/00.t to pjdfstest known failures These tests fail nlink assertions (e.g. expected nlink=2, got nlink=3) after hard link creation/removal. The failures are deterministic and surfaced by caching changes that affect the order in which entries are loaded into the local meta cache. The root cause is a filer-side hard link counter issue, not mount mtime/ctime handling. |
||
|
|
41ff105f47 |
object_store_users: fix specific bucket admin permission (#9014)
Fix an issue where seleting Sepecific Buckets with Admin permission while creating/editing an object store user would grant Admin permission on all buckets |