Commit Graph

13649 Commits

Author SHA1 Message Date
Chris Lu
7a461ffc2f fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9278)
fix(mount): copy xattr value bytes to avoid FUSE buffer aliasing (#9275)

SetXAttr stored the caller-supplied `data` slice directly into
entry.Extended. That slice aliases go-fuse's per-request input buffer,
which is returned to a pool the moment the handler returns. When a file
is open during setxattr (the open-fh path defers persistence to flush),
the next FUSE request recycles the buffer and silently overwrites the
stored xattr bytes; flushMetadataToFiler then ships the corrupted bytes
to the filer. `cp -a` reproduces this because it issues a setxattr while
holding an open fh, then continues to issue follow-up FUSE ops that
reuse the same buffer.

The path-based setxattr (e.g. setfattr without an open fh) saves
synchronously inside the same handler, so the bytes were marshalled
before the buffer could be reused — that is why the source file in the
report looked fine and only the cp -a destination was garbage.

Defensively copy the bytes when storing them, and add a unit test that
mutates the caller buffer after SetXAttr returns to lock in the
invariant.
2026-04-28 23:54:35 -07:00
qzhello
108e42fb8b chore(shell): fix typo in EC shard helper param names (#9277)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison

* chore(shell): fix typo in EC shard helper param names
2026-04-28 23:09:26 -07:00
Chris Lu
02574314f6 test(s3): force-drop collection after deleteBucket in tagging/versioning/cors/copying (#9270)
* test(s3): force-drop collection after deleteBucket across tagging/versioning/cors/copying

Each test creates a unique bucket (= new SeaweedFS collection) and the master's
warm-create issues a 7-volume grow batch. The S3 DeleteBucket-driven collection
sweep snapshots the layout once, but in-flight `volume_grow` requests keep
registering volumes after the snapshot, leaking 1-3 volumes per bucket. On a
single `weed mini` data node with the auto-derived volume cap, those leaks pile
up fast and every subsequent PutObject 500s with "Not enough data nodes found".

Mirror the retention-suite fix (commits ac3a756d, 363d5caa) into the four other
suites that share the same shape: defer a master-side /col/delete after each
bucket teardown, and sweep stale prefix-matching buckets before every new
createBucket. Each suite gets its own MASTER_ENDPOINT default plus the
allTestBucketPrefixes / cleanupAllTestBuckets / cleanupLeftoverTestBuckets /
forceDeleteCollection helpers.

Skipped: iam (separate framework + v1 SDK; already passes
-master.volumeSizeLimitMB=100), sse (different cleanup signature; docker-compose
already passes -volumeSizeLimitMB=50), policy (its TestCluster already uses
-master.volumeSizeLimitMB=32 and tears the whole cluster down). Suites under 5
tests cannot exhaust the cap and were left untouched.

* test(s3): default master endpoint to 127.0.0.1 to avoid IPv6 resolution

Mirror the explicit IPv4 default already used by test/s3/copying. On hosts
where `localhost` resolves to ::1 first the master HTTP listener (bound on
0.0.0.0) is unreachable, so the new force-delete-collection helper would silently
skip cleanup. Pin the default to 127.0.0.1 in tagging/cors/versioning; the
MASTER_ENDPOINT env var still wins when set.

* test(s3): skip buckets newer than this process in leftover-bucket sweep

cleanupAllTestBuckets/cleanupTestBuckets used to delete every bucket whose name
matched the suite's prefix. With shared S3/master endpoints, a second concurrent
`go test` run could have its live buckets torn down mid-test by the first run's
sweep.

Capture testRunStart at package init (with a 1-minute backdate for clock skew)
and skip any bucket whose CreationDate is newer than that. Stale buckets from
panicked or interrupted prior runs (the original target of the sweep) still get
collected because they were created before this process started.

* test(s3-copying): sweep stale prefix buckets from createBucket, not just from a few callers

The s3-copying suite already had cleanupTestBuckets that walks every bucket and
drops the test-copying-* prefix matches, but only four tests in the file invoke
it; the rest go straight to createBucket. So a panicked or interrupted prior run
could leak buckets that survive into the next run and exhaust the data node's
volume slots before any of the prefix-sweeping tests get a chance to run.

Hoist the sweep into createBucket so every test that creates a bucket starts on
a clean slate. The per-process CreationDate filter from the prior commit keeps
this safe under concurrent runs.

* test(s3): scope leftover-bucket sweep to this run via runID marker

The CreationDate filter only protected against runs that started later than this
process. A different `go test` against the same S3/master endpoints that started
*earlier* and is still active has buckets older than testRunStart, so the sweep
would still tear them down mid-test.

Replace the time window with a per-process runID baked into bucket names: every
bucket this run creates gets `r{runID}-` after the suite's BucketPrefix, and the
sweep only matches that owned subset. Concurrent runs each carry their own
runID and never see each other's buckets.

Trade-off: buckets left behind by a crashed prior run carry a different runID
and won't be cleaned up by this sweep. That recovery path now belongs to
`make clean` / data-dir wipe, which is what CI already does between jobs.
Fixed-name versioning buckets (e.g. test-versioning-directories) bypass
getNewBucketName and so also bypass this sweep — they are short-lived and
handled by their own deferred deleteBucket.

* test(s3): drop cleanupLeftoverTestBuckets wrapper, call cleanupAllTestBuckets directly

cleanupLeftoverTestBuckets was a one-line forwarder kept only as a semantic name
at the createBucket call site. Inline the call and update the doc comments to
point at the actual implementation. tagging/cors/versioning all had the wrapper;
copying never did.
2026-04-28 22:13:42 -07:00
Chris Lu
d9b86fb495 fix(s3api): clear stale latest-version pointer when .versions dir cleanup is blocked (#9269)
* fix(s3api): clear stale latest-version pointer when post-deletion cleanup is blocked

In versioned buckets, when the only remaining children of a .versions
directory are orphan entries (v_<id> files that lack the version-id
extended attribute - e.g. left behind by older code paths or interrupted
writes), updateLatestVersionAfterDeletion's selectLatestVersion finds
zero promotable versions and falls through to s3a.rm. The non-recursive
rm fails because the orphan blocks the directory deletion. Previously
the code logged the failure and returned, leaving the LatestVersionId
pointer pointing at the version we just deleted.

For Veeam-style workloads that GET-PUT-GET-DELETE a small lock object on
every checkpoint, the stale pointer poisons every subsequent run: the
next GET re-enters getLatestObjectVersion's 13x retry loop on the
missing pointer file plus the self-heal rescan, all to return the same
404. The cycle is self-perpetuating until the orphan is removed by hand.

When rm fails, additionally clear the LatestVersionId / LatestVersionFile
pointer fields on the .versions directory entry. The orphan files stay
in place (an operator can audit and remove them); from the S3 API
perspective the object is now correctly absent and subsequent reads
short-circuit to ErrNotFound on the fast path instead of replaying the
heal cycle.

* fix(s3api): clear stale latest-version pointer on read-side self-heal failure

healStaleLatestVersionPointer is invoked by getLatestObjectVersion when the
pointed-at version file is missing. The rescan path can find no remaining
version-id-tagged entries (e.g. when only orphan v_<id> files lacking the
version-id extended attribute remain). Prior code returned ErrNotFound but
left the stale pointer in place, so every subsequent read replayed the same
13x retry loop on the missing file and re-entered self-heal, all to return
the same 404.

Reuse the same pointer-clear logic introduced for updateLatestVersionAfterDeletion.
The two call sites are now identical, so factor the body out into
clearStaleLatestVersionPointer. The caller parameter carries the source
function name so the log lines operators were already grepping
(updateLatestVersionAfterDeletion: cleared stale ... and
healStaleLatestVersionPointer: cleared stale ...) keep the same prefix.

* fix(s3api): re-validate before clearing latest-version pointer (CAS)

Reviewer feedback (gemini-code-assist, coderabbitai) on PR #9269 flagged
a TOCTOU in clearStaleLatestVersionPointer: between the caller loading
versionsEntry and this function persisting the cleared map, a concurrent
PUT could promote a fresh version. Persisting the caller's snapshot then
silently overwrites the live pointer and re-introduces the missing-pointer
state on a now-existing object.

Make the persist CAS-style:

  1. Re-scan .versions for any version-id-tagged entry. If one now exists,
     abort - the concurrent writer has populated the directory and either
     already updated the pointer or the next read's self-heal will pick
     up the new entry.
  2. Re-fetch the live .versions directory entry and only proceed if its
     latest-pointer fields still match the stale id the caller observed.
     A concurrent pointer update by another writer is detected here and
     the clear is skipped.
  3. Persist with mkFile using the live Extended map (minus the two
     pointer fields and the cached metadata) so any other Extended fields
     written concurrently between (2) and the persist are preserved.

A note on the literal suggestion of mutating updatedEntry.Extended in
the mkFile callback: that does not work because mkFile constructs a
fresh *filer_pb.Entry rather than reading the live entry first
(weed/pb/filer_pb/filer_client.go:247). The callback's updatedEntry is
nil at invocation, so a delete on it would be a no-op and we would lose
all Extended fields, not just the two we want to clear. The correct
shape - re-fetching the live entry before mkFile and carrying its
Extended map into the persist - is what this change implements.

True atomic CAS would require filer-level conditional update support;
this change narrows the race window from the full caller scope to the
~ms gap between (2) and (3), which is the best we can do without that.
2026-04-28 21:02:52 -07:00
chenshi
c93018d987 fix(s3api): fix uint16 overflow in doListFilerEntries Limit calculation (#9271) 2026-04-28 20:50:05 -07:00
Chris Lu
35fe3c801b feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder

The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.

Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.

This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.

References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.

* feat(nfs): advertise UDP MOUNT v3 in the portmap responder

The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.

Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.

* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener

Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.

The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.

Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.

* docs(nfs): note that the plain mount form works on UDP-default clients

With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.

The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.

* test(nfs): add kernel-mount e2e tests under test/nfs

The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.

Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:

  - TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
  - TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
    only — regression test for the new UDP MOUNT v3 responder.
  - TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
    asserting the kernel surfaces a protocol/version-level error rather
    than a generic "mount system call failed" — regression test for the
    PROG_MISMATCH path from #9262.

The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.

Run locally with:

	cd test/nfs
	sudo go test -v -run TestKernelMount ./...

CI wiring follows in the next commit.

* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow

Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:

  - Existing protocol-layer step now skips '^TestKernelMount' so a
    "skipped because not root" line doesn't appear on every run.
  - New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
    helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
    name lookups need to resolve `tcp`/`udp`).
  - New privileged step runs only the kernel-mount tests under sudo,
    preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
    caches so the second `go test` invocation reuses already-built
    test binaries instead of redownloading modules under root.

The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
2026-04-28 14:06:35 -07:00
Chris Lu
735e94f6ba mount: expose -fuse.maxBackground and -fuse.congestionThreshold flags (closes #9258) (#9268)
* mount: expose `-fuse.maxBackground` flag (closes #9258)

The Linux FUSE driver caps in-flight async requests via
`/sys/fs/fuse/connections/<id>/max_background` (and a derived
`congestion_threshold = 3/4 * max_background`). Heavy upload workloads
need this raised, but the cap currently lives only in `/sys`, so it
resets on reboot/remount.

`weed mount` was hardcoding `MaxBackground: 128`. Promote it to a flag,
default unchanged. Setting `-fuse.maxBackground=2048` reproduces the
manual `echo 2048 > .../max_background` (and gives 1536 for
congestion_threshold automatically) persistently across remounts.

`congestion_threshold` is not exposed as a separate flag because
go-fuse derives it as 3/4 of MaxBackground in InitOut and offers no
hook to override; users wanting a different ratio can still write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount.

* mount: add `-fuse.congestionThreshold` flag, bump go-fuse to v2.9.3

go-fuse v2.9.3 exposes CongestionThreshold as a separate MountOption,
so we can now let users override the kernel's default 3/4-of-max_background
ratio at mount time instead of having to write
/sys/fs/fuse/connections/<id>/congestion_threshold post-mount on every
remount/reboot.

Default 0 preserves existing behavior (kernel derives it as
3/4 * max_background). Non-zero is sent to the kernel verbatim; the
kernel clamps it to max_background if higher.
2026-04-28 13:42:58 -07:00
Chris Lu
08d59750ef rust(volume): export Prometheus metrics for scrubbing operations (#9266)
* rust(volume): export Prometheus metrics for scrubbing operations

Mirrors #9264 in the Rust volume server. Adds three metrics that match
the Go names so the same dashboards/alerts work against either binary:

  - SeaweedFS_volumeServer_scrub_last_time_seconds (gauge)
  - SeaweedFS_volumeServer_scrub_volume_failures   (counter)
  - SeaweedFS_volumeServer_scrub_shard_failures    (counter)

Metrics are aggregated at the volume / EC shard level, labelled by
VolumeScrubMode (UNKNOWN/INDEX/FULL/LOCAL) to match Go's
req.GetMode().String().

* rust(volume): record scrub metrics before post-scrub error check

Address PR feedback:
  - Move metric emission before the mark_broken_volumes_readonly error
    check so scrub failures are persisted even when the follow-up
    mark-readonly admin action fails (matches Go's volume_grpc_scrub.go).
  - Extract the duplicated metric block into emit_scrub_metrics() shared
    by both ScrubVolume and ScrubEcVolume. The shard-failures family
    stays untouched on regular volume scrubs to mirror Go.
2026-04-28 13:29:32 -07:00
Lisandro Pin
3f3aaa7cc8 Export Prometheus metrics for scrubbing operations. (#9264)
This PR introduces three new metrics...

  - `scrub_last_time_seconds`
  - `scrub_volume_failures`
  - `scrub_shard_failures`

...capturing overall volume scrub results, and allowing to construct alerts
and dashboards to monitor scrubbing progress.

Note that these metrics are aggregated at the volume/EC shard level, and not
intended for fine-grained tracking of scrubbing operations.
2026-04-28 12:34:02 -07:00
Chris Lu
294f7c3d04 shell: expand ~ in local file path arguments (#9265)
* shell: expand `~` in local file path arguments

The weed shell parses commands itself instead of going through an OS
shell, so a path like `~/Downloads/foo.meta` was passed verbatim to
`os.Open`, which fails because no `~` directory exists. Users had to
spell out absolute home paths in every command.

Add an `expandHomeDir` helper that resolves a leading `~` or `~/...` to
the user's home directory, and run user-supplied local file paths in
the affected shell commands through it:

  fs.meta.load          (positional file)
  fs.meta.save          (-o)
  fs.meta.changeVolumeId (-mapping)
  s3.iam.export         (-file)
  s3.iam.import         (-file)
  s3.policy             (-file)
  s3tables.bucket       (-file)
  s3tables.table        (-file, -metadata)
  volume.fsck           (-tempPath)

Filer-namespace path flags (`-dir`, `-path`, `-locationPrefix`, etc.)
are unaffected; they live in the filer, not on the local FS.

* shell: reuse util.ResolvePath instead of a new helper

util.ResolvePath already does tilde expansion; drop the local
expandHomeDir helper and route every shell call site through it.
2026-04-28 12:30:13 -07:00
Chris Lu
e2c8791441 fix(nfs): reject NFSv4 calls with PROG_MISMATCH so clients fall back to v3 (#9262)
* feat(nfs): add NFSv3-only RPC version filter

The upstream willscott/go-nfs library dispatches RPC calls by (program,
procedure) only — it does not validate the program version. A client
sending NFSv4 (prog 100003 vers 4 proc 1 COMPOUND) lands on the same
handler map as NFSv3 and gets routed to v3 SETATTR, which parses the
COMPOUND args as SETATTR3args and writes a malformed reply. The kernel
then returns EPROTONOSUPPORT and mount.nfs prints "requested NFS version
or transport protocol is not supported" without retrying v3.

This commit adds a listener wrapper that peeks the first RPC frame on
each new TCP connection. If the program is NFS or MOUNT and the version
is not 3, it writes a protocol-correct PROG_MISMATCH reply (supported
range 3..3, per RFC 5531) directly to the socket and closes the
connection. v3 frames are replayed unchanged via a bufio reader so go-nfs
sees the original bytes. Unknown programs pass through so go-nfs's own
PROG_UNAVAIL handling stays in charge.

The filter is not yet wired into the server; the next commit activates
it. Tests cover NFSv4 reject, MOUNTv4 reject, NFSv3 pass-through, and
unknown-program pass-through.

* fix(nfs): wire NFSv3 version filter into the listener chain

Place the version filter after the optional client allowlist so that
unauthorized peers are still rejected first by IP/CIDR before we look at
RPC content. With the filter active, a Linux client doing the default
v4-first probe gets a clean PROG_MISMATCH reply pointing at v3, which
lets mount.nfs (and the in-kernel client) skip v4 and reuse the same v3
mountOptions that already work for rclone serve nfs against this
deployment.

* test(nfs): exercise MOUNT v4 in the v4-rejection test, not v1

TestVersionFilterRejectsMOUNTv4WithProgMismatch was sending
mountProgramID with version 1, so the test never actually covered the
"reject MOUNT v4" path it claims to exercise. The filter does reject any
non-v3 version uniformly, so the test still passed, but a future change
that tightened the version check (for example, only rejecting v4) would
let this test silently lie about coverage. Bump the call to version 4 so
the name matches what is actually exercised.

* refactor(nfs): reuse package RPC constants and io.ReadFull in version filter

The RPC numeric constants (msg_type=CALL/REPLY, MSG_ACCEPTED, PROG_MISMATCH,
AUTH_NONE, the NFS/MOUNT program numbers) are already named in
portmap.go alongside the portmap responder. Reuse them here instead of
defining a parallel set in rpc_version_filter.go: keeping one source of
truth per package means a future correction in one spot can't drift away
from the other. The filter-only constants (peek timeout, peek length,
supportedNFSVer) stay local because they have no portmap analog.

In the test, drop the bespoke readFull loop in favor of io.ReadFull.
The custom version was a near-identical reimplementation that did not
return io.ErrUnexpectedEOF on short reads, so the standard library is
both shorter and more diagnostic-friendly.

* fix(nfs): move RPC peek off the Accept path

The previous wrapper called filterFirstRPCFrame inline inside
versionFilterListener.Accept, which meant a single slow or idle TCP
connect could hold rpcVersionFilterPeekTimeout (10s) of head-of-line
blocking against every other accept: gonfs.Serve calls Accept serially,
so each in-flight peek stalled the next legitimate client until the
deadline expired. An attacker who simply opens a TCP connection without
sending any RPC payload could trivially throttle accept throughput.

Restructure the wrapper so a background goroutine drives the inner
Accept loop and hands each raw conn to its own short-lived goroutine
that runs the peek. Validated conns are sent on a buffered-once channel,
which the wrapper's Accept reads from; rejected conns finish their
PROG_MISMATCH reply and disappear without ever reaching the channel.
This means N concurrent slow clients only block themselves, not the
N+1th fast client that connects after them.

Add Close coordination — sync.WaitGroup for the accept loop and per-conn
peek goroutines, plus a closed channel so Accept unblocks immediately on
shutdown — so the wrapper now satisfies the full net.Listener contract
instead of relying on the embedded listener.

Add a regression test that opens a slow conn (TCP only, never writes)
and a fast conn (sends a v3 frame) and asserts the fast conn reaches
the inner accept handler well below the peek timeout.

* test(nfs): assert io.EOF (not just any error) after PROG_MISMATCH close

The post-rejection check was only failing when conn.Read succeeded; any
error — including a deadline timeout because the server kept the socket
open — let the test pass. That defeats the point of the assertion: a
regression where the filter replies but forgets to close would slip
through silently.

Match against io.EOF explicitly. The TCP semantics are deterministic
here: the server writes PROG_MISMATCH, calls conn.Close(), the client
reads what's left in flight and then sees a clean FIN, which surfaces
as io.EOF on the next zero-byte read.

* fix(nfs): reject short first fragments before parsing RPC header fields

bufio.Reader.Peek(28) is willing to read across record boundaries to
satisfy the requested length, so a final fragment whose body is shorter
than the 24-byte fixed RPC CALL header (xid + msg_type + rpcvers + prog
+ vers + proc) leaves the trailing peek bytes pointing at the next
RPC's framing or whatever bytes happen to follow on the wire. Indexing
hdr[16:24] for prog/vers in that state can spuriously reject (or pass
through) traffic based on data that doesn't belong to the request being
classified.

Drop those frames out of the filter early: if the first fragment can't
possibly hold a full CALL header, pass the connection straight to
go-nfs, which has its own framing-error handling for malformed input.

Add a regression test that crafts a 12-byte first fragment whose
trailing peek bytes are deliberately shaped like an NFSv4 CALL — without
the length check the filter sends a PROG_MISMATCH; with it, the conn
passes through silently. Verified by stashing the production-code change
and running the test in isolation: it fails as expected without the fix.

* fix(nfs): retry transient Accept() errors instead of treating any error as terminal

acceptLoop previously exited on the first error returned by the inner
listener's Accept(). That conflates two very different failure modes:
permanent shutdown (the listener was Close()d, OS-level fatal failure)
and transient resource pressure (EMFILE, EAGAIN, ECONNABORTED on
accept). The transient case should not take the entire NFS server down
— a single fd-table-full event would leave the deployment offline until
restart.

Classify the error: errors.Is(err, net.ErrClosed) is the permanent
signal we already wanted to surface to Accept(); everything else is
transient. Log at V(1) and back off rpcVersionFilterAcceptBackoff
(50ms, mirroring portmap.go's portmapRetryBackoff) before retrying. The
backoff sleep is interruptible via the closed channel so Close() still
shuts the loop down promptly.

Add a regression test that wraps a real listener with one that injects
3 fake transient errors before delegating, and asserts Accept() still
delivers the next real connection. Verified the test fails on the old
"any error is terminal" loop and passes with this change.

* fix(nfs): only synthesize PROG_MISMATCH for ONC RPC v2 traffic

The filter was rejecting any CALL-shaped record with prog=100003 or
100005 and vers!=3, regardless of the rpcvers field. If the caller is
speaking some other protocol that happens to share the port — or just
sending garbled bytes — pretending to be an NFSv3 server replying
PROG_MISMATCH is misleading at best, and at worst fabricates a coherent
RPC reply for traffic we don't actually understand.

Add an rpcvers==2 check between the msg_type and prog/vers parses. Any
non-v2 record now passes through to go-nfs, whose RFC 5531 §9
RPC_MISMATCH handling is the correct place to reject mis-versioned RPC.

Regression test takes a normal v3 NFS CALL frame, overwrites the rpcvers
field with 99, and asserts no PROG_MISMATCH-shaped reply lands on the
client and that the conn is delivered to the inner accept handler.
Verified the test fails on the previous code (filter still rejected on
prog/vers alone) and passes with the guard in place.

* fix(nfs): bound Close() latency by evicting in-flight prefilter conns

Close() does wg.Wait() to drain handleConn goroutines, but each of those
goroutines can be parked inside filterFirstRPCFrame's bufio.Peek for up
to rpcVersionFilterPeekTimeout (10s) waiting for the very first RPC
header. A client that completes the TCP handshake but never sends a
byte therefore stretched shutdown by 10s per such conn — a real
regression for stop/restart paths and for tests that just want to tear
the listener down.

Track raw (pre-peek) conns in versionFilterListener.inFlight as
handleConn enters, untrack on exit, and have Close() forcibly close
every tracked conn before wg.Wait. Closing the underlying conn breaks
its Peek immediately, so handleConn returns within a single scheduler
hop. trackInFlight also short-circuits if shutdown has already started,
so a conn accepted after signalClose can't slip past the eviction.

Black-box regression test opens 4 idle TCP-handshake-only conns, lets
their handleConn goroutines settle into Peek, and asserts Close()
returns under 2s. Verified: same test fails on the previous code with
Close taking ~9.9s; passes here at ~100ms.
2026-04-28 12:17:54 -07:00
Chris Lu
0fa0a56a5a filer(mysql): TLS hostname/SNI knobs + MariaDB upsert documentation (#9260)
* refactor(filer/mysql): set tls.Config per-instance via Connector instead of global registry

Replace the use of `mysql.RegisterTLSConfig("mysql-tls", ...)` and the
`&tls=mysql-tls` DSN suffix with a per-instance setup that assigns the
`*tls.Config` directly to `mysql.Config.TLS` and opens the database via
`mysql.NewConnector` + `sql.OpenDB`.

The driver's TLS-config registry is process-wide; if a second `MysqlStore`
were ever initialized with different TLS settings (e.g., a filer plus a
separately configured store) the second registration would silently
overwrite the first. The connector pattern keeps the TLS configuration
attached to the connector and avoids that global side effect.

Behavior is otherwise unchanged: TLS is enabled when `enable_tls=true`,
the same `ca_crt`/`client_crt`/`client_key` knobs are honored, and the
TLS minimum version remains 1.2.

* filer(mysql): use system root CAs when ca_crt is empty

Previously, enabling `enable_tls=true` without setting `ca_crt` returned an
unhelpful empty-path read error. Many managed MySQL/MariaDB providers serve
certificates that chain to a public CA already in the host's trust store, so
requiring an explicit CA bundle adds friction with no security benefit.

Leave `RootCAs` unset when `ca_crt` is empty so Go's `tls.Config` falls back
to the system trust store, matching the standard behavior of `mysql --ssl`.
Existing setups with `ca_crt` configured are unaffected.

Also wraps the CA read/parse errors with the file path for easier diagnosis.

* filer(mysql): fail loudly when client_crt / client_key are unreadable

The previous implementation called `tls.LoadX509KeyPair` and silently
discarded any error, falling back to a non-mTLS connection. A typo or
permissions problem in `client_crt` / `client_key` therefore appeared as a
confusing server-side handshake error rather than as a config error,
because the server was expecting a client cert that the filer never sent.

Treat the keypair as required when either path is set, and surface the
underlying load error with both filenames so the misconfiguration is
obvious. The default (both paths empty) is unchanged: no client cert is
sent.

* filer(mysql): add tls_insecure_skip_verify and tls_server_name knobs

When the filer connects to a MySQL/MariaDB cluster whose server
certificate's SAN does not match the connection address (common with
internal load balancers, IP-only connection strings, or self-signed
cluster certs), the TLS handshake fails with `x509: certificate is valid
for X, not Y`. There was previously no way to fix this short of reissuing
the cert.

Expose two new optional knobs on `[mysql]`:

- `tls_server_name` overrides the SNI / cert hostname used for
  verification — the standard fix when the cert SAN is correct but the
  connection address is not.
- `tls_insecure_skip_verify` disables verification entirely as an escape
  hatch for testing or for clusters with no usable SAN.

Both default to off, so existing configurations continue to verify the
server certificate against the connection address as before.

* docs(scaffold/filer.toml): document mysql TLS knobs and MariaDB upsert override

- Document the new `tls_insecure_skip_verify` and `tls_server_name` options.
- Update the `ca_crt` comment to reflect that it is optional and that the
  system trust store is used when the path is empty (matches the runtime
  behavior in mysql_store.go).
- Reword the client cert comments to make the mTLS pairing requirement
  explicit (both `client_crt` and `client_key` must be set together).
- Add a commented-out MariaDB / MySQL 5.7 alternative for `upsertQuery`,
  noting that the default (`AS new` row alias) requires MySQL 8.0.19+.

* filer(mysql): drop redundant blank import of go-sql-driver/mysql

The package was imported twice: once with the `mysql` alias (used for
`mysql.MySQLError`, `mysql.Config`, `mysql.NewConnector`, etc.) and once
as `_` to register the driver. The named import already triggers
`init()` and registers the driver, so the blank import is dead weight.
2026-04-28 01:29:41 -07:00
Chris Lu
135af25b55 fix(grpc): require host match before routing dials to local Unix socket (#9254) (#9257)
* fix(grpc): require host match before routing dials to local Unix socket (#9254)

resolveLocalGrpcSocket keyed the Unix-socket hijack on port alone, so a
remote peer reusing a local gRPC port (e.g. a standalone `weed volume`
defaulting port.grpc=17334 against a `weed server` whose in-process
volume socket is also on 17334) had its inbound RPCs silently rerouted
to the local socket. In a cross-DC replication=100 cluster this surfaced
as persistent volume-grow failure: both AllocateVolume RPCs landed on
the local volume server, the second returned "Volume Id N already
exists!", the grow rolled back, and S3 PUTs to the new collection
returned 500.

Track a per-port set of host strings that count as "this machine" and
require the dial host to be in that set before redirecting. Loopback
aliases (localhost, 127.0.0.1, ::1, "") are always included so
same-process dials via loopback still take the socket fast path.

* test(grpc): cover empty-host bare-port dial in local-socket regression test

The empty alias is registered explicitly so SplitHostPort outputs like
":17334" (which can occur when a caller dials a bare port) take the
local-socket fast path. Add a case so that path is exercised.
2026-04-27 23:10:36 -07:00
Parviz Miriyev
e2f96687ff fix(admin): use protocol-relative URLs for component links so HTTPS clusters don't break clicks (#9256)
* fix(admin): use protocol-relative URLs for component links

Hardcoded http:// in admin UI templates breaks browser-initiated clicks
to master / volume / filer / EC shard / Iceberg REST URLs whenever the
target component runs HTTPS-only via security.toml [https.X] sections.
The browser sends plain HTTP to a TLS-only endpoint and gets 400
"client sent an HTTP request to an HTTPS server".

Same root pattern as #9227 (admin's own backend /dir/status fetch);
this PR is the browser-facing equivalent.

Replace fmt.Sprintf("http://%s...") with fmt.Sprintf("//%s...") and the
JS-string '<a href="http://' with '<a href="//' so the browser uses the
same scheme as the page hosting the link. Backwards compatible:
  - HTTPS-only deployments: links now work
  - HTTP-only deployments: identical behavior to before
  - Mixed: edge case, addressed by future per-component public-URL work

Affected templates (9 files), each kept in lockstep with its generated
_templ.go sibling so reviewers don't need to run templ generate:

- weed/admin/view/app/admin.templ
- weed/admin/view/app/cluster_filers.templ
- weed/admin/view/app/cluster_masters.templ (Go templ + JS modal)
- weed/admin/view/app/cluster_volume_servers.templ (Go templ + JS modal)
- weed/admin/view/app/cluster_volumes.templ
- weed/admin/view/app/ec_volume_details.templ
- weed/admin/view/app/volume_details.templ
- weed/admin/view/app/iceberg_catalog.templ
- weed/admin/view/app/s3tables_buckets.templ

17 link constructions total, +32/-32 lines.

* fix(admin): protocol-relative URLs in iceberg + s3tables JS overrides

Per Gemini code review on this PR: the JS scripts in iceberg_catalog
and s3tables_buckets templates overwrite the href attribute of the
"Open Iceberg REST" links after page load, replacing the
protocol-relative URL set by the templ render with a hardcoded
http://<host>:<port>/v1/config.

Apply the same protocol-relative fix to the JS template literals so
they don't undo the templ-side change. Browser uses the page scheme
(http or https) to fill in the protocol.

Mirrored in iceberg_catalog_templ.go and s3tables_buckets_templ.go.

* fix(admin): displayed Iceberg endpoint scheme follows page protocol

Per CodeRabbit review on this PR: the on-page guidance text in iceberg
and s3tables templates still showed a literal `http://` even after the
clickable link was switched to a protocol-relative URL. In HTTPS-only
deployments operators see `http://host:8181/v1` as the suggested
endpoint, copy it, and get a broken connection.

Wrap the scheme in <span id="iceberg-protocol"> (and the s3tables
counterpart) and have the existing inline script set its innerText to
window.location.protocol minus the trailing colon. Same pattern as the
existing dynamic host substitution. Mirrored in *_templ.go so reviewers
do not need templ generate.

SQL/JSON code-block examples (CREATE EXTERNAL TABLE ... ENDPOINT
'http://...', "uri": "http://..." ) are intentionally left as-is —
they are starter snippets users adapt to their environment, not
clickable or copy-paste-into-runtime values. Happy to follow up with
server-side scheme threading if requested.
2026-04-27 23:10:11 -07:00
Chris Lu
363d5caa85 test(s3-retention): purge stale buckets before each create to avoid volume exhaustion
The WORM suite creates one bucket per test, each backed by ~3 reserved
volumes on the data node. With ~30 tests and the default `weed mini`
volume cap, the data node runs out of slots midway through the run and
every PutObject after that fails with InternalError.

Hook a sweep of every test-prefix bucket into the create helpers so a
panicked or interrupted prior test cannot leak buckets into the next.
2026-04-27 22:14:20 -07:00
Chris Lu
e4a635a04d feat(docker): default CMD to mini -dir=/data for service-container use (#9255)
* feat(docker): default CMD to `mini -dir=/data` for service-container use

GitHub Actions service containers cannot pass arguments to the image
entrypoint, so `chrislusf/seaweedfs` is currently unusable as a service
because it requires a `weed` subcommand. Set a sensible default CMD so
the image starts a complete single-process cluster (master, volume,
filer, S3 on :8333, admin UI) out of the box, while still being
overridable by passing any other subcommand at `docker run` /
compose time.

Also add a `mini` case to entrypoint.sh so its logs go to stderr,
matching the existing master/volume/server cases.

Closes #9247

* fix(docker): make `isArgPassed` match `--flag` as well as `-flag`

The Go fla9 library accepts both `-flag` and `--flag` syntax, but
`isArgPassed` only matched the single-dash form. That meant a user
passing `--dir=/foo` to `weed mini` (or `--max=5` to `volume`,
`--volume.max=5` to `server`) would not suppress the entrypoint's
default, and the duplicate flag was silently appended to the command
line — relying on last-wins parsing for correctness. Match double-dash
explicitly so the override is detected for every case in the file.
2026-04-27 21:21:58 -07:00
Chris Lu
d92c5e057a test(iceberg): cross-engine regression coverage for deterministic table locations (#9074) (#9253)
* test(trino): regression for unique-table-location=false (#9074)

With #9246's namespace-location property, Trino's REST catalog can
resolve table locations even when the connector is configured with
`iceberg.unique-table-location=false`, and CREATE/CTAS lands at the
deterministic <namespace-location>/<tableName> path with no
UUID-suffixed sibling. Lock that in:

- writeTrinoConfig now parameterizes the unique-table-location flag via
  a withDeterministicTableLocation() option.
- setupTrinoTest forwards config options.
- TestTrinoDeterministicLocationCTAS exercises a fresh CREATE TABLE +
  CTAS with the flag flipped off and asserts the on-disk layout has
  no UUID-suffixed sibling under the namespace dir, proving each table
  occupies a single dir.

Refs #9074

* test(spark): regression for CTAS without explicit location (#9074)

iceberg-spark has no equivalent of Trino's unique-table-location flag —
its REST catalog interactions always produce the deterministic
<namespace-location>/<tableName> path. Without #9246's namespace-location
property, Spark cannot resolve a table location for a CREATE TABLE that
omits an explicit LOCATION clause; with it, the operation succeeds and
the table lands at the expected single-dir-per-table layout.

TestSparkDeterministicLocationCTAS walks the same scenario as the Trino
test: CREATE TABLE without LOCATION, INSERT, CTAS, SELECT count, then
asserts via S3 ListObjectsV2 that no UUID-suffixed sibling directory
appears under the namespace.

Refs #9074

* test(duckdb): read table at deterministic location via REST catalog (#9074)

DuckDB's iceberg extension is a read-only consumer in this flow — there
is no client-side UUID-suffixing toggle to test. The relevant question
post-#9246 is whether DuckDB can ATTACH the REST catalog and resolve a
table at a deterministic <bucket>/<namespace>/<tableName> path produced
by writers that don't suffix UUIDs (iceberg-spark, pyiceberg, Trino with
unique-table-location=false).

TestDuckDBDeterministicLocationRead creates a namespace + minimal table
via direct REST API calls (so no client-side UUID is added), confirms
the catalog returns a deterministic location URL, then runs DuckDB
through ATTACH ... TYPE 'ICEBERG' and DESCRIBE on the table. Asserting
DESCRIBE succeeds proves DuckDB walked the catalog → metadata → schema
chain against the deterministic on-disk path.

The test skips gracefully when the DuckDB image lacks the iceberg
extension or the ATTACH-iceberg syntax, so older base images don't fail
the suite.

Refs #9074
2026-04-27 20:14:48 -07:00
Chris Lu
fe50da4934 test(fuse): stream verify-phase diagnostics from writeback stress test
The 45m suite alarm fires on TestWritebackCacheStressSmallFiles with no
output from the test, since t.Logf is buffered until the test completes
and the alarm panic skips that flush. Add streaming stderr progress, an
explicit verify-phase budget that t.Fatalf's with a goroutine dump on
overrun, and per-retry/per-failure logging so the next hang shows which
file(s) the mount could not read back.
2026-04-27 19:44:26 -07:00
Chris Lu
9d6d068f41 feat(seaweed-volume): cross-disk EC shard reconciliation (#9212) (#9252)
* fix(seaweed-volume): fall back to idx dir when reading .vif

EcVolume::new and read_ec_shard_config only looked for .vif at the
data dir. With the cross-disk reconcile path (where shards live on
one disk and .ecx / .ecj / .vif live on a sibling disk —
seaweedfs/seaweedfs#9212 / #9244), this would either write a stub
.vif on the shard disk and lose the real EC config + dat_file_size
or fall back to default ratios despite a perfectly good .vif being
present elsewhere on the same volume server.

Add a small `locate_vif_path` helper that prefers the data dir and
falls back to the idx dir when it differs, and thread the data dir
+ idx dir pair through `read_ec_shard_config`. Three call sites in
grpc_server.rs (VolumeEcShardsGenerate, VolumeEcShardsRebuild, scrub)
updated; the scrub path passes the same dir for both args because
`find_ec_dir` is the only locator there.

* feat(seaweed-volume): primitives for cross-disk EC shard reconcile

Adds the three small helpers the reconcile pass needs:

- DiskLocation::mount_ec_shards_with_idx_dir — mounts shards on this
  disk while pointing the EcVolume at a sibling disk's idx dir for
  .ecx / .ecj / .vif. Mirrors loadEcShardsWithIdxDir in
  weed/storage/disk_location_ec.go. The existing mount_ec_shards is
  kept as a thin wrapper over it.

- EcVolume::has_shard — `pub` accessor over the internal Vec<Option>
  shard slot so the reconcile pass can skip shards that are already
  registered.

- pub(crate) re-exports of parse_collection_volume_id and
  parse_ec_shard_extension under names parse_collection_volume_id_pub
  and is_ec_shard_extension so the reconcile module can call them
  without re-implementing the parsers.

No behaviour change. Reconciliation logic in the next commit.

* feat(seaweed-volume): cross-disk EC shard reconciliation (#9212)

Closes the loader half of seaweedfs/seaweedfs#9212 on the Rust side,
mirroring the Go fix in seaweedfs/seaweedfs#9244. With the auto-load
in feat/rust-load-all-ec-shards-9212 in place, the only remaining gap
is shards that landed on a disk without their `.ecx` — for example
when ec.balance / ec.rebuild moved them onto a destination node's
second disk while leaving the index files on the disk that already
held the volume. Without this, those orphan shards stay invisible to
the master and ec.rebuild reports the volume as unrepairable.

After every DiskLocation has finished its per-disk EC scan, sweep the
store for shards that live on a disk without local index files and
load them by reaching across to a sibling disk's `.ecx` / `.ecj` /
`.vif`:

  - Store::reconcile_ec_shards_across_disks walks each disk for
    orphan `.ec??` files (present on disk, not yet registered to an
    EcVolume) and matches them against an `(collection, vid) ->
    EcxOwnerInfo` map of which disk owns each `.ecx`.
  - Each matched group is mounted on its physical disk's ec_volumes
    map (so heartbeat reporting carries the right disk_id per shard)
    via `mount_ec_shards_with_idx_dir`, pointing the EcVolume at the
    sibling's idx dir.
  - `index_ecx_owners` records the directory each `.ecx` was found in
    (IdxDirectory or Directory) so the loader doesn't ENOENT when the
    legacy "written before -dir.idx was set" layout puts `.ecx` in
    the data dir. This mirrors the PR #9244 review fix from
    @gemini-code-assist / @coderabbitai (see Go commit af57cc652).
  - True orphans (no `.ecx` anywhere on this server) log a warning
    and stay on disk untouched — operator can restore the index later.

Wired into Store::add_location and Store::load_new_volumes so a fresh
restart and any later disk additions both pick up cross-disk shards.

Tests cover all four behaviour shapes:
- shards on dir0 + .ecx on dir1 → reconciled to dir0's ec_volumes
- .ecx in owner's data dir (legacy layout) → reconciled correctly
- self-contained disks → reconcile is a no-op
- truly-orphan shards (no .ecx anywhere) → left on disk, logged

* fix(seaweed-volume): propagate EcVolume::new errors instead of unwrap

mount_ec_shards_with_idx_dir built the missing EcVolume inside an
entry().or_insert_with() closure, which can't return a Result — so
any EcVolume::new failure (e.g. .ecx open error, .ecj create error,
malformed .vif) panicked the volume server via unwrap(). The
constructor already returns Result<>, so propagate it as
VolumeError::Io instead.

Reported in PR #9252 review by @gemini-code-assist (high) and
@coderabbitai (critical).

* perf(seaweed-volume): use DirEntry::metadata in collect_orphan_ec_shards

Replaced the extra fs::metadata(&path) lookup with ent.metadata() so
we don't pay an additional stat syscall per directory entry beyond
what read_dir already returned. Drops the now-unused std::path::Path
import alongside.

Reported in PR #9252 review by @gemini-code-assist.

* fix(seaweed-volume): scrub uses EcVolume's real dir_idx for split-disk volumes

After cross-disk reconciliation an EcVolume can legitimately have
ecv.dir != ecv.dir_idx (shards on one disk, .ecx / .ecj / .vif on a
sibling). The scrub path collapsed both args to find_ec_dir's single
answer, so read_ec_shard_config fell back to the wrong .vif location
for exactly the split-disk layout this PR loads — skewing
shard-count detection and verification results.

Use ecv.dir / ecv.dir_idx directly so scrub reads the metadata from
where the volume's index files actually live.

Reported in PR #9252 review by @coderabbitai.

* feat(seaweed-volume): primitives for split-disk EC volume operations

Reconciliation can mount the same `vid` on multiple DiskLocations
with disjoint shard subsets. The existing first-match `find_ec_volume`
isn't enough for read/unmount/delete/decode paths that need to act on
a specific shard or aggregate across the whole volume — they have to
walk every location and find the right home for each shard.

Add the small Store-level lookup primitives Go's findEcShard /
CollectEcShards already provide:

- `Store::find_ec_shard_location(vid, shard_id)` — returns the index
  of the location that has `(vid, shard_id)` mounted, if any.
- `Store::find_ec_volume_with_shard(vid, shard_id)` — same idea but
  returns the EcVolume directly.
- `Store::collect_ec_shard_dirs(vid, max_shard_count)` — returns
  the EcVolume to use for metadata plus per-shard data dirs (None
  when the shard isn't mounted on any disk). Mirrors
  `Store.CollectEcShards` in `weed/storage/store_ec.go`.

And the EcVolume accessors callers need:

- `EcVolume::has_shard(shard_id)` — was already added for the cross-
  disk reconcile but is now a load-bearing primitive for placement
  decisions on a per-shard basis. Pulled into the dedicated commit.
- `EcVolume::ecx_actual_dir()` — exposes the directory the `.ecx`
  was actually opened from. The decoder needs it for the .ecx
  lookup when shards are split across data dirs and `.ecx` lives on
  a sibling idx dir.

Plus a small defensive change to `DiskLocation::unmount_ec_shards`:
only decrement the per-shard gauge for shards that were actually
mounted. Without this, the upcoming `Store::unmount_ec_shards`
fan-out to every location would underflow the metric whenever a
shard is requested for unmount on a sibling disk that doesn't have
it.

No behaviour change at the call sites yet — wiring follows in the
next commits.

* fix(seaweed-volume): unmount_ec_shards visits every location with the vid

Store::unmount_ec_shards and Store::unmount_ec_shard returned after
the first DiskLocation with the volume id, even if that location did
not contain the requested shard. With reconciled split-disk volumes
(shards 0/12 on disk 0, shard 1 on disk 1 — the issue #9212 layout
this PR loads), VolumeEcShardsUnmount for a later-disk shard became a
silent no-op and Store::delete_ec_shards could remove the shard file
while leaving an in-memory shard + open file handle stale on the
later location.

Walk all locations that have the EcVolume and ask each to unmount
whatever subset of `shard_ids` it actually has — the
`DiskLocation::unmount_ec_shards` defensive guard from the previous
commit makes the fan-out safe (no metric underflow when a sibling
disk is asked to unmount a shard it doesn't hold).

* fix(seaweed-volume): VolumeEcShardRead reads from the shard's home disk

VolumeEcShardRead resolved the EcVolume via first-match
`find_ec_volume(vid)` and then looked up the requested shard on that
single EcVolume. With reconciled split-disk volumes (the layout
seaweedfs/seaweedfs#9212 produces — shards 0/12 on disk 0, shard 1
on disk 1), a request for shard 1 hit disk 0 first and returned
"shard 1 not mounted" even though it was happily mounted on disk 1.

Switch to `find_ec_volume_with_shard(vid, shard_id)` so the lookup
walks every location and returns the EcVolume whose disk actually
holds the shard. The deleted-needle check still works because every
per-disk EcVolume for the same vid points at the same `.ecx` file
(post-reconcile, both disks open the same sealed index).

* fix(seaweed-volume): VolumeEcShardsToVolume aggregates shards across disks

VolumeEcShardsToVolume resolved a single EcVolume via
`find_ec_volume(vid)` and then checked `ec_vol.shards[i]` for each
data shard. With reconciled split-disk volumes that's the wrong
view: the first-match EcVolume only carries the shards on its disk,
so the presence check would either reject the request as
"missing shard" or — if shards happened to be on the first disk —
fall through to `write_dat_file_from_shards(&dir, ...)` which only
reads from the EcVolume's single dir.

Mirror Go's CollectEcShards by aggregating per-shard data dirs
across every location with the volume:

- Add `Store::collect_ec_shard_dirs` (in the previous primitives
  commit) returning the EcVolume to use for metadata + per-shard
  dir slots.
- Extend `find_dat_file_size` and `write_dat_file_from_shards` with
  `_with_dirs` variants that take the `.ec00` dir and per-shard
  dirs separately, so a decoded volume whose shards live on
  several disks can still be reconstructed. The original signatures
  delegate to the new ones with the same dir for all shards, so
  every existing caller keeps working unchanged.
- Rewire VolumeEcShardsToVolume through the helpers — presence
  check sees the union, dat_file_size reads `.ec00` from the right
  disk and `.ecx` from the EcVolume's actual idx dir, the decoder
  reads each shard from its own home dir.

* test(seaweed-volume): split-disk read / unmount / delete / collect

Five tests exercising the four behaviour shapes the PR #9252 review
flagged on multi-location EC volumes. Each builds the cross-disk
split layout from issue #9212 (shards 0 and 12 on disk 0, shard 1 +
.ecx on disk 1) via the new `build_split_disk_store` helper and
asserts:

- `find_ec_shard_location` / `find_ec_volume_with_shard` route to
  the disk that actually holds each shard (not first-match).
- `Store::unmount_ec_shards([1])` reaches disk 1 and removes shard 1
  while leaving disk 0's unrelated shards mounted (used to be a
  silent no-op).
- `Store::unmount_ec_shard(vid, 1)` ditto for the single-shard
  variant.
- `Store::delete_ec_shards` removes both the on-disk file and the
  in-memory mount on the right disk; previously deletion could
  remove the file while the in-memory shard with its open file
  handle survived on a different location.
- `collect_ec_shard_dirs` reports the right per-shard data dir for
  each location and `None` for unmounted shards.

* fix(seaweed-volume): retry same-disk legacy .ecx layout in reconcile

The unconditional `owner.location == loc_idx` skip missed the layout
where `idx_directory` is configured but the owner's `.ecx` / `.ecj` /
`.vif` still live in `loc.directory` (the legacy "written before
-dir.idx was set" shape). In that case the per-disk loader's
mount_ec_shards used `loc.idx_directory` and ENOENT'd, then this
branch suppressed the only recovery path — the owner disk's own
shards stayed unloaded after startup.

Tighten the skip so it only fires when the discovered owner dir is
already `loc.idx_directory` (the loader-already-tried-and-failed
case). When `owner.idx_dir` differs (legacy data-dir layout), queue
a same-disk retry through `mount_ec_shards_with_idx_dir(...,
&owner.idx_dir)` so reconcile becomes the recovery path.

Reported in PR #9252 review by @coderabbitai.

* fix(seaweed-volume): roll back partial mounts on cross-disk reconcile failure

mount_ec_shards_with_idx_dir adds shards one at a time and
increments the `ec_shards` gauge per shard that successfully attaches.
A mid-loop failure (e.g. an EcVolumeShard::open error after the
first few shards already attached) used to leave the EcVolume
half-mounted with stale metric increments — the warn!() branch only
logged the error.

Mirror DiskLocation::handle_found_ecx_file's recovery path: drive
the cleanup through `loc.unmount_ec_shards(vid, &shard_ids)` after
a failed mount. The defensive change in #9251 makes
unmount_ec_shards only decrement the gauge for shards that were
actually mounted and drops the EcVolume when it reaches zero
shards, so the rollback is safe even though some of `shard_ids`
never attached.

Reported in PR #9252 review by @coderabbitai.

* test(seaweed-volume): cover the two reconcile fixes from PR #9252 review

Two new tests in store_ec_reconcile:

- test_reconcile_recovers_same_disk_legacy_ecx_layout — sets up the
  layout where idx_directory is configured but the owner's .ecx
  lives in loc.directory. The per-disk loader's mount_ec_shards
  uses loc.idx_directory and fails; reconcile should retry on the
  same disk with the owner's actual idx_dir and the owner's own
  shards must come back online.

- test_reconcile_rolls_back_partial_mounts_on_failure — sabotages
  one of the orphan shard files (replaces it with a directory of
  the same name) so EcVolumeShard::open errors out partway through
  mount_ec_shards_with_idx_dir. Asserts the post-condition that no
  EcVolume entry retains a "shard mounted" claim that doesn't
  correspond to a real shard file.
2026-04-27 19:01:30 -07:00
Chris Lu
49e83a26cb feat(seaweed-volume): auto-load EC shards on startup (#9212) (#9251)
* feat(seaweed-volume): auto-load EC shards on startup

The Rust volume server's load_existing_volumes only scanned .dat
files; EC shards on disk stayed invisible until something explicitly
issued VolumeEcShardsMount. Strict superset of the issue
seaweedfs/seaweedfs#9212 reports for Go: after a fresh restart, every
local EC shard was missing from the master's view.

Port loadAllEcShards from weed/storage/disk_location_ec.go:

- DiskLocation::load_all_ec_shards walks Directory (and IdxDirectory
  if separate) sorted, groups .ec?? shard files by (collection, vid),
  validates and mounts each group when its matching .ecx is found.
- handle_found_ecx_file: validate_ec_volume + mount_ec_shards path,
  with cleanup when .dat exists and validation fails (incomplete
  encoding) or load fails.
- check_orphaned_shards: cleans up shard remnants whose .ecx never
  arrived AND whose stale .dat is still present (interrupted
  encoding); leaves them on disk otherwise so cross-disk
  reconciliation / operator recovery can find them.
- check_dat_file_exists / parse_collection_volume_id /
  parse_ec_shard_extension: small helpers mirroring Go's checkDatFileExists,
  parseCollectionVolumeId, and the `\.ec\d{2,3}` regex.
- Wire through load_existing_volumes after the .dat scan; failures
  log but don't fail the disk's startup.

Tests:
- test_parse_ec_shard_extension covers .ec00–.ec255 and the rejection
  of .ec0, .ec999, .ecx, .ecj, .dat, and missing leading dot.
- test_load_all_ec_shards_mounts_pairs_with_ecx: shards + .ecx + .vif
  on disk get mounted into ec_volumes after load_existing_volumes.
- test_load_all_ec_shards_keeps_orphan_shards_when_no_dat: orphan
  shards (no .ecx, no .dat) stay on disk untouched
  (distributed-EC scenario).
- test_load_all_ec_shards_cleans_orphan_shards_when_dat_exists:
  orphan shards alongside a stale .dat get cleaned up
  (interrupted-encoding scenario).

Prerequisite for porting the cross-disk orphan-shard reconciliation
in seaweedfs/seaweedfs#9244 to Rust.

* fix(seaweed-volume): dedupe filenames when scanning data + idx dirs

load_all_ec_shards scans both `directory` and `idx_directory` (when
they differ) so the loop can pair `.ec??` shards with their `.ecx`
regardless of which dir owns the index. If the same filename is
present in both — possible in idempotent legacy layouts that
pre-date `-dir.idx` — the previous implementation processed it
twice. mount_ec_shards increments the per-shard `ec_shards` metric
inside the loop, so a duplicated `.ec??` entry would double-count
the gauge.

Use a HashSet<String> while accumulating entries so each filename
is processed exactly once.

Reported in PR #9251 review by @gemini-code-assist.

* fix(seaweed-volume): drive partial-mount cleanup through unmount_ec_shards

handle_found_ecx_file calls mount_ec_shards which adds shards one at
a time. mount_ec_shards increments the `ec_shards` gauge per shard
that successfully attaches. If mount fails halfway, plain
ec_volumes.remove(vid) drops the EcVolume but leaves the gauge
incremented for whatever did mount.

Drive the cleanup branches through unmount_ec_shards instead — it
mirror-decrements the gauge per shard and only then drops the
EcVolume. Same shape applied to both .dat-exists and distributed-EC
fallbacks.

Reported in PR #9251 review by @gemini-code-assist.

* docs(seaweed-volume): clarify parse_ec_shard_extension shard-id range

Doc previously said `.ec00`–`.ec999` but the implementation rejects
any shard id > 255 (matches the `EcVolumeShard` u8 typed shard id
and Go's `strconv.ParseInt(... 10, 64)` + `> 255` guard). Fix the
doc to say `.ec00`–`.ec255` and explain why the 3-digit form is
still recognised.

Reported in PR #9251 review by @coderabbitai.
2026-04-27 16:41:46 -07:00
Chris Lu
933ae6e386 fix(seaweed-volume): port EC shard placement fix to Rust (#9212, mirrors #9245) (#9250)
* feat(seaweed-volume): add DiskLocation::has_ecx_file_on_disk

Mirrors `DiskLocation.HasEcxFileOnDisk` from the Go side
(seaweedfs/seaweedfs#9245). Reports whether this disk has a sealed
.ecx index file for (collection, vid) by stat'ing the IdxDirectory
first, then falling back to Directory if different — covers the
legacy "written before -dir.idx was set" layout. Skips entries that
are directories so a stray dir named `<col>_<vid>.ecx` doesn't
register as a present index file.

Unlike has_ec_volume() this does not require the EC volume to be
mounted in memory, which makes it the right primitive for placement
decisions during ec.balance / ec.rebuild flows where shards may
arrive before any VolumeEcShardsMount has happened on the receiving
disk.

Wiring + tests in follow-up commits.

* feat(seaweed-volume): add Store::find_ec_shard_target_location

Mirrors `Store.FindEcShardTargetLocation` from the Go side
(seaweedfs/seaweedfs#9245). Single canonical placement primitive for
new EC shard / index files. Selection order:

  1. a disk that already has the EC volume mounted (in-memory),
  2. a disk that owns the .ecx file on disk (volume not yet mounted),
  3. any HDD with free space,
  4. any disk with free space.

Step 2 is the missing primitive that pinned subsequent shards to the
first-shard disk during ec.rebuild — rebuild only sets
CopyEcxFile=true on the first shard, then relies on auto-select to
land later shards on the same disk. Without an on-disk check
has_ec_volume returns false (no mount yet) and the fallback picked
"any HDD with free space," splitting shards from their .ecx across
disks of the same node and producing the orphan-shard layout
seaweedfs/seaweedfs#9212 reports.

Implementation walks store.locations once with tier scoring; the
highest-tier disk wins, ties broken by free count. The earlier
4-pass waterfall in find_free_location_predicate would have
re-acquired locks per pass.

ec_free_shard_count returns the free count in shard slots (not
volume-equivalent slots). The pre-existing find_free_location*
helpers divide by DATA_SHARDS_COUNT at the end; that truncation can
exclude a disk that has room for several individual shards
(MaxVolumeCount=1, EcShardCount=1, dsc=10 → reports 0 despite 9
free slots), which would re-route subsequent shards off the
.ecx-owning disk and re-introduce the orphan layout. Keep the result
in shard slots throughout. The unlimited-disk branch
(MaxVolumeCount==0) reports a synthetic large free count
decremented by current usage so unlimited disks stay eligible and
tie-breaks still prefer the less-loaded one.

data_shard_count is taken as a parameter rather than read from
DATA_SHARDS_COUNT so custom-ratio builds can swap the default
without touching this helper.

Tests cover: pinning to .ecx on disk, mounted-wins-over-stray-.ecx,
HDD fallback, MaxVolumeCount=0 unlimited handling, and the
tight-provisioning truncation case.

* fix(seaweed-volume): route EC shard auto-select through new helper

VolumeEcShardsCopy and the ReceiveFile EC branch both used a 3-tier
inline waterfall: in-memory has_ec_volume → any HDD → any disk. That
checked in-memory state only and missed disks that own the .ecx on
disk but haven't been mounted yet — the orphan-shard placement
hazard from seaweedfs/seaweedfs#9212.

Replace both with a single call to
Store::find_ec_shard_target_location, which adds the .ecx-on-disk
tier between mounted and HDD, and accounts for free space in shard
slots so tight-provisioning configurations don't incorrectly skip a
disk that still has room for individual shards.

Pass DATA_SHARDS_COUNT as the data-shard count for free-slot maths;
the helper takes it as a parameter so custom-ratio builds can swap
the default without touching this file.

* fix(seaweed-volume): grow UNLIMITED_FREE budget and saturate the math

ec_free_shard_count's unlimited branch (MaxVolumeCount=0) used to
clamp to a constant `1` once usage exceeded `1 << 30 ≈ 1e9` shard
slots. With several unlimited disks all past that threshold, every
placement decision among them tied at 1 — tie-break degraded to
"first eligible disk."

Bump the synthetic budget to `1 << 60 ≈ 1.15e18` and use
saturating arithmetic so even pathological usage never wraps i64.
Clamp the return value to `≥ 1` so the disk stays eligible for
placement at any load. Tie-breaks among unlimited disks now keep
preferring the less-loaded one across all realistic deployments.

Reported in PR #9250 review by @gemini-code-assist.
2026-04-27 16:40:39 -07:00
Chris Lu
f50917224a fix(iceberg): default namespace location so fresh CTAS does not race metadata write (#9074) (#9246)
* fix(iceberg): advertise default namespace location for REST clients

Trino's REST catalog has two code paths for CREATE TABLE depending on
whether a table location can be resolved before the catalog call:

    // TrinoRestCatalog.newCreateTableTransaction (Trino 479)
    if (location.isEmpty()) {
        return tableBuilder.create().newTransaction();   // EAGER: REST POST now
    }
    return tableBuilder.withLocation(...).createTransaction(); // DEFERRED

`tableLocation` resolves to null when the user does not pass
`location = '...'` AND `defaultTableLocation` returns null. The latter
happens whenever the namespace's `loadNamespaceMetadata` response has no
`location` property — and our handler returned exactly that.

In the eager branch Trino calls REST POST /v1/.../tables immediately, so
our handleCreateTable persists `<location>/metadata/v1.metadata.json` to
the filer. Trino's IcebergMetadata.beginCreateTable then runs
`fileSystem.listFiles(location).hasNext()` on the same path, finds the
metadata file we just wrote, and throws "Cannot create a table on a
non-empty location" — even on a brand-new schema and table name.

Synthesize a default `location` of `s3://<bucket>/<flattened-namespace>`
on Get/Create namespace responses when one is not stored. With a non-
null `defaultTableLocation`, Trino takes the deferred branch, picks a
unique `<namespace-location>/<table>-<UUID>` path (UUID added by the
standard `iceberg.unique-table-location=true` setting), and the empty-
location check passes. The actual REST POST is deferred to commit time,
so our metadata write lands alongside the data files Trino has already
produced.

Existing namespaces with an explicit `location` property are untouched —
the synthesis only kicks in when the property is absent.

Refs #9074

* test(trino): regression for fresh CREATE TABLE without explicit location

Exercises the follow-up scenario reported on issue #9074: a CREATE TABLE
on a brand-new schema and brand-new table name with NO `location = '...'`
clause, followed by a CTAS on top — exactly the SQL pattern from the
report. Before advertising a default namespace location, the first
CREATE TABLE failed with

    Cannot create a table on a non-empty location:
    s3://iceberg-tables/<schema>/<table>, set
    'iceberg.unique-table-location=true' in your Iceberg catalog
    properties to use unique table locations for every table.

even though the bucket was genuinely empty. The bug surfaced because our
handleCreateTable persists `<location>/metadata/v1.metadata.json` during
Trino's eager-create branch, and Trino's post-create listFiles
emptiness check then trips on the metadata file we just wrote.

The test asserts the CTAS succeeds AND the resulting table contains the
source rows, since the reporter saw the table get created with empty
data when the query failed.

Refs #9074
2026-04-27 16:37:33 -07:00
Chris Lu
cba2f7b1dd fix(volume_server): load orphan EC shards across disks on startup (#9212) (#9244)
* fix(volume_server): load orphan EC shards across disks on startup (#9212)

When ec.balance / ec.rebuild copies an EC shard onto a destination node
without also pinning subsequent shards to the disk that holds .ecx, the
shard ends up on a different physical disk than its index files. The
per-disk loadAllEcShards has no visibility into other DiskLocations on
the same store, so those orphan shards were silently left out of
ecVolumes and never reported to master — volume.list showed partial
counts, and ec.rebuild reported the volume as unrepairable even though
all shards were physically present.

After every DiskLocation finishes its initial pass, sweep the store for
shard files that are on disk but not yet in any EcVolume, look up the
.ecx-owning sibling disk, and load each shard against its physical disk
with dirIdx pointing at the sibling. Each shard is still registered on
its own disk's ecVolumes map so heartbeat reporting carries the right
DiskId per shard (master fix #9219 already aggregates per-disk
messages correctly).

Also fall back to dirIdx for .vif lookup when dir != dirIdx, so the
reconciliation path doesn't write a stub .vif on the shard disk and
lose the real EC config and datFileSize.

* fix(volume_server): track actual .ecx dir in cross-disk reconcile

indexEcxOwners scans both IdxDirectory and Directory to find each
volume's .ecx — the second scan covers the legacy case where index
files were written into the data dir before -dir.idx was configured
(removeEcVolumeFiles already accounts for this in disk_location_ec.go).
But the returned map dropped which directory matched, and reconcile
unconditionally passed owner.IdxDirectory to loadEcShardsWithIdxDir.

When the owner's .ecx is in Directory and IdxDirectory != Directory
(server later re-configured with -dir.idx pointing at a fresh path),
NewEcVolume opens IdxDirectory/.ecx → ENOENT, retries the same-disk
fallback at dataBaseFileName+.ecx — but dataBaseFileName uses the
*orphan* disk's data dir, not the owner's, so it ENOENTs again and the
orphan shards stay unloaded.

Track which scan dir matched in indexEcxOwners and pass it through.
Adds TestLoadEcShardsWhenOwnerEcxIsInDataDir as the regression.

Reported in PR #9244 review by @gemini-code-assist and @coderabbitai.

* refactor(storage): thread dataShardCount as a parameter into calculateExpectedShardSize

The helper used erasure_coding.DataShardsCount directly, but tests in
store_ec_orphan_shard_test.go save .vif with a local dataShards=10
constant. If the package default ever diverged from 10 (e.g. an
enterprise build), the test would write a .vif for one layout while
sizing shard files for another and silently break.

Take dataShardCount as a parameter. Existing callers
(validateEcVolume + size-validation tests + real-world tests) pass
erasure_coding.DataShardsCount unchanged. The orphan-shard tests pass
the same dataShards local they save into .vif, so the persisted shape
and the on-disk shape stay consistent.

Reported in PR #9244 review by @coderabbitai.
2026-04-27 16:01:10 -07:00
Chris Lu
5fbe39320c fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212) (#9245)
* fix(volume_server): pin EC shard auto-select to the .ecx-owning disk (#9212)

ec.rebuild only sets CopyEcxFile=true on the first shard sent to the
rebuilder; subsequent shards rely on VolumeEcShardsCopy / ReceiveFile
auto-select to land on the same disk. The old auto-select used
FindEcVolume (in-memory) to detect the "already has this volume" case.
Mid-rebuild, no EC volume has been mounted yet on the destination, so
FindEcVolume returns nothing and the fallback picks "any HDD with free
space" — which can split shards from their .ecx across disks of the
same node and feed the orphan-shard layout reported in #9212 / fixed
on the loader side in #9244.

Add Store.FindEcShardTargetLocation as the canonical placement
primitive: prefer a mounted EC volume, then a disk that has the .ecx
on disk, then any HDD, then any disk. DiskLocation.HasEcxFileOnDisk is
the new on-disk check, and it looks at IdxDirectory first with a
fallback to Directory to handle .ecx written before -dir.idx was
configured.

Both VolumeEcShardsCopy and ReceiveFile now route through the new
helper, dropping their duplicated 4-level fallback ladder. No protocol
changes; explicit DiskId callers are unaffected.

* fix(volume_server): treat directories named *.ecx as no-match in HasEcxFileOnDisk

os.Stat(".ecx") succeeds for both files and directories. If something
happens to leave a directory named X.ecx in the data or idx folder,
HasEcxFileOnDisk would currently report true and FindEcShardTargetLocation
would route shards to that disk — where NewEcVolume's eventual
OpenFile(O_RDWR) on the same path errors out.

Add a !info.IsDir() check on both stat sites. Cheap and conservative.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): collapse EC placement helper to a single pass

FindEcShardTargetLocation called FindFreeLocation up to four times. Each
call iterates s.Locations and acquires VolumesLen / EcShardCount RLocks
per disk — for a typical 4-disk node that's 32 RLock cycles per
placement decision.

Walk s.Locations once, score each disk by tier (mounted > .ecx-on-disk
> HDD > any-disk), break ties by free count. The free-slot math is
factored into a small helper that mirrors FindFreeLocation's formula
without re-entering the location's locks. Behaviour is unchanged: each
existing tier still wins over later tiers, and within a tier the disk
with the most free count still wins, matching the original max-tracking
in FindFreeLocation.

Suggested in PR #9245 review by @gemini-code-assist.

* refactor(volume_server): thread dataShardCount as a parameter through EC placement

ecFreeShardCount and FindEcShardTargetLocation referenced
erasure_coding.DataShardsCount directly. Take it as a parameter so
custom-ratio builds (e.g. enterprise) can swap the default without
touching the helper itself, and so unit tests can pin a specific ratio
independent of the package constant. Default callsites in
VolumeEcShardsCopy and ReceiveFile now pass the package default
explicitly; tests pass a literal 10 for clarity.

* fix(volume_server): treat MaxVolumeCount=0 as unlimited in EC placement

ecFreeShardCount computed `MaxVolumeCount - VolumesLen()` and went
negative when MaxVolumeCount was 0 — the "unlimited disk" sentinel
already honoured by Store.hasFreeDiskLocation and friends. With a
negative free count, FindEcShardTargetLocation's `freeCount <= 0`
guard skipped the disk entirely, so unlimited disks could never receive
EC shards via the placement helper.

Special-case MaxVolumeCount<=0: report a synthetic large free count
that decrements with current usage, so unlimited disks are eligible
and tie-breaks still prefer the less-loaded one. Added
TestFindEcShardTargetLocation_HonoursUnlimitedDisk as the regression.

Reported in PR #9245 review by @gemini-code-assist.

* fix(volume_server): account in shard slots, not volume slots, in ecFreeShardCount

FindFreeLocation in store.go ends with `free /= DataShardsCount`,
converting "shard slots free" back to "volume-equivalent slots." The
truncation is harmless there, but my new ecFreeShardCount inherited
the same final divide and re-introduced exactly the orphan-shard
hazard #9245 was meant to prevent: with MaxVolumeCount=1,
VolumesLen=0, EcShardCount=1 the formula reports 0 even though the
disk has room for 9 more shards, so subsequent shards route off the
.ecx-owning disk into the HDD-fallback tier.

Drop the trailing divide and return the count directly in shard slots.
Same shape, finer granularity; tie-breaks still order by free count.
The unlimited branch's "used" calculation is updated to match (mix
volume-slots and shard-slots in shard units). Added
TestFindEcShardTargetLocation_TightProvisioningKeepsEcxDisk as the
regression.

Reported in PR #9245 review by @coderabbitai.
2026-04-27 15:59:57 -07:00
Lars Lehtonen
1add6cbbca chore(weed/topology): prune unused functions (#9249) 2026-04-27 15:54:52 -07:00
dependabot[bot]
4d8ddd8ded build(deps): bump aquasecurity/trivy-action from 0.35.0 to 0.36.0 (#9248) 2026-04-27 14:03:15 -07:00
Lisandro Pin
2c404f66bc Export file_read_invalid_needles metric for REST read requests on invalid file IDs. (#9241)
Provides a straightforward metric to count read requests with incorrect file/needle IDs,
which can indicate client issues.

Note that the metric does not cover gRPC calls, as the current proto service API
does not support seeking files by ID.
2026-04-27 12:22:42 -07:00
Chris Lu
ac3a756dae test(s3-retention): force-drop collection after deleteBucket to free volumes
COMPLIANCE-mode retention leaves objects that BypassGovernanceRetention cannot
clear, so the test's DeleteBucket keeps returning BucketNotEmpty and the
underlying SeaweedFS collection (with its 7 reserved volumes) leaks. After a
few leaks on the single-node `weed mini` server, the master logs "Not enough
data nodes found" and every subsequent PutObject 500s, timing the suite out.

Call the master's /col/delete admin endpoint from deleteBucket so the
collection's volumes are reclaimed even when S3-level cleanup is blocked.
2026-04-27 12:12:36 -07:00
Chris Lu
3f5b4814b7 fix(kafka): evict expired group members on rejoin to unblock fast restart (#9243)
fix(kafka): evict expired group members on JoinGroup and during cleanup

Phantom members lingering past their session timeout were blocking
TestOffsetManagement/ConsumerGroupResumption: the cleanup goroutine
runs every 30s but session timeouts can be as short as 6s, so a lost
LeaveGroup or ungraceful disconnect could leave a dead member holding
the leader slot or a partition until the next cleanup tick. New
consumers rejoining within that window saw stale group composition
and burned all 5 test-level join retries (~118s, generation churned to
22 in the failing CI run).

Add EvictExpiredMembersLocked on GroupCoordinator and call it from
handleJoinGroup so each rejoin sees a fresh member list. Refactor
performCleanup to share the helper, and on either path move surviving
members to PreparingRebalance, bump generation, clear cached
assignments (so the non-leader SyncGroup path returns
REBALANCE_IN_PROGRESS instead of serving stale partitions), re-elect
a leader from survivors when needed, and rebuild SubscribedTopics so
topics only the evicted member subscribed to are dropped.
2026-04-27 09:43:55 -07:00
Chris Lu
fa492a9eed fix(admin): wrap plugin URLs with basePath for subdir deployments
Two more spots that broke under a subdirectory deployment:

- plugin.templ pluginRequest() called fetch(url) with relative API
  paths from 14+ callers; wrap once inside the helper so they all
  honor window.__BASE_PATH__.
- plugin_lane.templ generated <a href="/plugin/configuration?job=...">
  with an absolute path; wrap with basePath() so the link stays
  inside the deployment prefix.

Follow-up to a6adf530c.
2026-04-27 09:04:12 -07:00
Chris Lu
3ea489d013 fix(admin): wrap plugin lane fetch URL with basePath
Plugin lane page fetches API endpoints with raw absolute URLs, breaking
deployments under a subdirectory. Wrap the fetch URL with basePath() so
window.__BASE_PATH__ is honored, matching other admin pages.

Addresses https://github.com/seaweedfs/seaweedfs/issues/9240
2026-04-27 09:00:29 -07:00
Chris Lu
7f770b1553 fix(filer): return 503 + Retry-After when remote object not cached yet (#9236)
* extend cache-not-ready handling to filer HTTP path

Mirror the s3api change for the native filer HTTP handlers. When the
filer GET hits a remote-only object whose cache fill hasn't completed,
return 503 Service Unavailable with Retry-After: 5 instead of 500
Internal Error, and treat client disconnects as silent cancellations
rather than logging them as errors.

Adds an ErrCacheNotReady sentinel and a small helper used at the
prepareWriteFn-error sites in ProcessRangeRequest, so the same
classification (cancel / not-ready / other) applies to plain GETs,
single-range, and multi-range requests.

* clear Content-Range on prepareWriteFn error

The single-range path sets Content-Range before calling prepareWriteFn.
If prepareWriteFn fails, http.Error is about to write a fresh body for
503 or 500, but the stale Content-Range header would still go out and
no longer match. Drop it alongside Content-Length in the shared helper
so all current and future callers are covered.

* strip success-path headers and forward NotFound on prepareWriteFn error

When ProcessRangeRequest writes an error response, the previously-set
success headers (Content-Disposition, ETag, Last-Modified, in addition
to Content-Length/Content-Range) shouldn't ride along on the new body.
With ?dl=1 a stale Content-Disposition would even cause browsers to
save the error message under the object's filename. Strip them all in
the shared helper.

Also forward filer_pb.ErrNotFound through the cache-failure branch so a
mid-cache entry deletion surfaces as 404, not as a 503 retry-loop.
Permanent upstream cloud errors (403/404 from the cloud SDK) still come
back as opaque wrapped strings via FetchAndWriteNeedle and remain
mapped to 503; distinguishing those would need a wider refactor.
2026-04-27 01:58:33 -07:00
Chris Lu
4c4d53ce23 fix(seaweed-volume): accept redb aliases for --index (#9237)
fix(seaweed-volume): accept redb aliases for --index and rename kinds

The Rust volume server's disk-backed index uses redb internally
(see RedbNeedleMap), but --index only accepted the legacy `leveldb`
spellings, contradicting the wiki and forcing users to read source to
figure out what value to pass.

- --index now accepts memory|redb|redbMedium|redbLarge as the canonical
  names, with leveldb/leveldbMedium/leveldbLarge kept as aliases.
- Rename NeedleMapKind variants LevelDb*->Redb* so the in-tree names
  match the actual backend.
- Update help text and add a parse-table test covering both names.

Refs #9234.
2026-04-27 01:44:40 -07:00
Chris Lu
045ace29d5 fix(seaweed-volume): parse host:port.grpcPort in master address (#9235)
The Go ServerAddress format encodes an optional explicit gRPC port as
host:port.grpcPort. The Rust heartbeat client only handled host:port
(falling back to port+10000), so feeding it host:port.grpcPort yielded
a malformed gRPC target like "host:port.grpcPort", which manifests as
checkWithMaster transport errors.

Mirror pb.ServerToGrpcAddress(): if the part after the last ':' contains
a '.' followed by a valid u16, treat that suffix as the explicit gRPC
port; otherwise keep the +10000 default.

Refs #9234.
2026-04-27 01:44:11 -07:00
Chris Lu
a2639b533e fix(s3api): return 503 + Retry-After when remote object not cached yet (#9233)
* fix(s3api): return 503 with Retry-After when remote object not cached yet

When a GET hits a remote-only object whose cache fill timed out or was
canceled, the handler returned 500 InternalError. SDK clients treat 500
as a server bug and surface it as a fatal error (boto3
S3DownloadFailedError), even though the cache is often still filling in
the background and the next request would succeed.

Return 503 ServiceUnavailable with Retry-After: 5 instead, matching
AWS S3's "try again later" semantics. AWS SDKs already classify 503 as
retryable and apply exponential backoff transparently, so clients
recover without changes.

Refs https://github.com/seaweedfs/seaweedfs/discussions/9174

* treat client cancel as cancellation, not 503

If r.Context() is already canceled when the cache attempt returns no
chunks, the cache failure is almost certainly a side-effect of the
client disconnecting, not real backpressure. Surface the context error
so GetObjectHandler logs at V(3) and skips writing a response, instead
of synthesizing a 503 that nobody will read.

Addresses Gemini review feedback on #9233.

* simplify comments
2026-04-27 01:30:39 -07:00
Chris Lu
503b6f2744 fix(seaweed-volume): ceil EC shard slots in maybe_adjust_volume_max (#9232)
Mirrors the volume-server side of seaweedfs/seaweedfs#9196: compute the
EC-shard contribution to maxVolumeCount with proper ceiling division
((N + D - 1) / D) instead of (N + D) / D, which over-counts by one slot
whenever the per-location EC-shard count is zero or an exact multiple of
DataShardsCount (10). The most common case -- a location with no EC
shards -- silently inflated maxVolumeCount by 1 on every recalculation.

The matching low-disk effective_max_count path in heartbeat.rs already
uses the correct ceiling form, and the master-side topology changes from
that PR have no Rust counterpart.
2026-04-26 22:31:56 -07:00
qzh
21fadf5582 fix(shell): correct volume.list -writable filter unit and comparison (#9231)
* fix(shell): correct volume.list -writable filter unit and comparison

* fix(shell): correct volume.list -writable filter unit and comparison
2026-04-26 22:20:46 -07:00
Chris Lu
0b3cc8d121 4.22 4.22 2026-04-26 21:06:39 -07:00
Chris Lu
6cbcdf488c chore(mount,fuse-test): diagnostics for FUSE ConcurrentReadWrite ENOENT flake
PR #9230 attempt 1 hit an intermittent
TestConcurrentFileOperations/ConcurrentReadWrite failure where stat
returned ENOENT for a path all writers had just succeeded against, and
the captured mount.log carried no signal about which layer dropped the
entry because the relevant lookup logged at V(4).

Two diagnostic-only changes (no behavior change on the happy path):

- weed/mount/weedfs.go: in lookupEntry, when filer GetEntry returns
  ErrNotFound for a path whose inode is still tracked locally with no
  in-flight create or flush, log Warningf with inode + dirtyHandle +
  pendingFlush + localCache + dirCached. This surfaces layer-by-layer
  state at the moment of the suspicious ENOENT.

- test/fuse_integration/framework_test.go: on AssertFileExists failure,
  dump five 100ms-spaced stat retries, a parent ReadDir, and a direct
  O_RDONLY open before failing. Triangulates kernel dentry caching vs
  mount lookup vs filer state.
2026-04-26 16:57:37 -07:00
Chris Lu
c934b5dab6 fix(credential/postgres,s3api/iam): rename safety + pgxutil follow-ups to #9226 (#9230)
* refactor(util): extract pgx OpenDB + DSN builder into shared pgxutil

The postgres filer store had OpenPGXDB plus duplicated key=value DSN
assembly across postgres/ and postgres2/. Move the connection helper to
weed/util/pgxutil and add BuildDSN so the credential postgres store can
land on the same code path.

filer/postgres/pgx_conn.go keeps OpenPGXDB as a thin alias so postgres2
keeps building unchanged.

* refactor(credential/postgres): use shared pgxutil for connection setup

Replace the bespoke fmt.Sprintf DSN + sql.Open("pgx", ...) path with
pgxutil.BuildDSN + pgxutil.OpenDB so the credential store mirrors the
postgres filer store. This also drops the leaky RegisterConnConfig-style
init in favor of stdlib.OpenDB(*config), which doesn't accumulate
entries in the global pgx config map.

Adds parity knobs the filer store already exposes: sslcrl, and
configurable connection_max_idle / connection_max_open /
connection_max_lifetime_seconds (with the previous hardcoded 25/5/5min
as defaults). Also moves the jsonbParam helper here so other store
files can reuse it. (Helper is also referenced by postgres_identity.go,
which is migrated to it in the next commit.)

* refactor(credential/postgres): use jsonbParam helper across all writers

Consolidate JSONB write handling on the new pgxutil-adjacent helper
jsonbParam(b []byte) interface{}, which returns nil (driver writes SQL
NULL) when the marshaled JSON is empty and string(b) otherwise.

postgres_identity.go: replace the inline 'var fooParam any' /
'fooParam = string(b)' pattern with the helper. Same in CreateUser
and UpdateUser.

postgres_inline_policy.go, postgres_policy.go, postgres_service_account.go,
postgres_group.go: every JSONB writer was still passing []byte. Under
pgx simple_protocol (pgbouncer_compatible=true), []byte is encoded as
bytea and Postgres rejects that against a JSONB column with "invalid
input syntax for type json". Route them through jsonbParam too.

* fix(credential/postgres): rework SaveConfiguration to handle rename + UNIQUE access keys

The IAM rename path (s3api UpdateUser) renames an identity in place
and keeps its access keys. With the previous flow — upsert each user,
then per-user delete-and-insert credentials, then prune absent users —
the renamed user's access keys were still owned by the old row when
the INSERT for the new name ran, tripping credentials.access_key's
global UNIQUE constraint and failing every rename of a user with
credentials.

Reorder the SaveConfiguration body so the prune step runs BEFORE the
credential replace. CASCADE on the old user releases its access keys
in the same transaction, and the new name can then claim them.

While here:
- Replace the per-user loop DELETE FROM users WHERE username = $1 with
  a single DELETE ... WHERE username = ANY($1), one round trip instead
  of N inside the transaction.
- Surface inline-policy CASCADE losses: count user_inline_policies for
  the prune set and emit a Warningf when the count is non-zero so
  rename-driven drops are visible in operator logs (the structural
  fix for renames lives at the IAM layer in a follow-up commit).
- Two-pass credential replace: clear credentials for every user we are
  about to rewrite first, then insert, so an access key can be moved
  between two users in the same SaveConfiguration call.
- credErr := credRows.Err() before credRows.Close() in
  LoadConfiguration — Err() is documented as safe after Close, but
  the leading-capture pattern matches the rest of the file.

* fix(s3api/iam): preserve inline policies when renaming a user

EmbeddedIamApi.UpdateUser renames an identity in place and the caller
persists via SaveConfiguration, which prunes the old username and
CASCADE-drops its rows from user_inline_policies. GetUserPolicy and
ListUserPolicies then return nothing under the new name even though
the API reported success — silent data loss.

Before flipping sourceIdent.Name, list the user's stored inline
policies and re-attach each one under the new name. The subsequent
SaveConfiguration prune still CASCADE-removes the old-name rows; only
the duplicates we just wrote under the new name survive. Adds a
regression test that puts a policy on the old name, renames, and
asserts the policy is readable under the new name.

* perf(credential/postgres): batch the credential clear in SaveConfiguration

The two-pass credential replace was clearing each incoming user's
credentials with its own DELETE statement — N round-trips inside the
transaction. Match the pattern already used for the user prune and
issue a single DELETE FROM credentials WHERE username = ANY($1)
instead.

* refactor(s3api/iam): plumb context through UpdateUser

UpdateUser was synthesizing a fresh context.Background() inside the
inline-policy migration block, which discards the request deadline,
cancellation, and tracing carried by the caller. Add ctx as the first
parameter and pass r.Context() in via the ExecuteAction dispatcher,
mirroring the signature already used by CreatePolicy /
AttachUserPolicy / DetachUserPolicy.

* fix(util/pgxutil): quote DSN values per libpq rules

BuildDSN was concatenating values directly, so any password / cert path
/ database name with a space, single quote, or backslash produced a
malformed connection string and pgx.ParseConfig either errored or
mis-parsed the remainder. Critical now that the helper is shared with
the credential store: mTLS deployments routinely sourcing passwords or
secret-mounted cert paths from a vault are exactly the case where
spaces and quotes show up.

Add quoteDSNValue: empty values and values containing whitespace, `'`,
or `\` are wrapped in single quotes with `'` and `\` escaped per
PostgreSQL libpq rules; plain alphanumeric values pass through
unchanged. Apply it to every variable field in BuildDSN.

Adds a test that round-trips a password containing spaces, quotes and
backslashes through pgx.ParseConfig and confirms the parsed Config
matches the input.

* fix(credential,s3api/iam): atomic UserRenamer to avoid FK violation on rename

The previous IAM rename path called PutUserInlinePolicy(newName, ...)
before SaveConfiguration created the new users row. user_inline_policies
has a non-deferrable FOREIGN KEY (username) REFERENCES users(username),
which Postgres validates at statement time, so every rename of a user
that owned at least one inline policy failed with an FK violation. The
existing memory-store regression test missed it because the memory
backend has no FK enforcement.

Add an optional credential.UserRenamer interface plus a
CredentialManager.RenameUser thin shim that returns (supported, err).

Implement it on PostgresStore as an atomic in-transaction migration:
INSERT the new users row by SELECT-copying from the old, UPDATE
credentials.username and user_inline_policies.username to the new
name (FK satisfied because both rows now exist), then DELETE the old
row. ErrUserNotFound / ErrUserAlreadyExists are surfaced cleanly.

Implement it on MemoryStore by re-binding store.users / store.accessKeys
/ store.inlinePolicies under the new name. Also fixes a small leak in
DeleteUser, which was forgetting to drop the user's inline-policy
bucket.

EmbeddedIamApi.UpdateUser now calls RenameUser first; if the store
implements the interface, that's the whole migration. If it doesn't
(stores without FK enforcement), fall back to the previous
list / get / put copy.

Adds a focused test for MemoryStore.RenameUser that asserts the
identity, the access-key index, and the inline policies all land
under the new name.
2026-04-26 16:31:53 -07:00
Chris Lu
4f628ff4e5 fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908) (#9228)
* fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908)

buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY
chunk upfront, then walked them with io.MultiReader. For a multipart
SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob
with 25+ chunks), N volume-server bodies sat live at once; chunks
1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent
load the volume server's keep-alive logic closed those idle responses
mid-flight, and the S3 client saw `unexpected EOF` partway through the
GET. Truncated bytes hash to the wrong SHA-256, which is exactly the
"Digest did not match" symptom Docker Registry reports in #8908 (and
which persisted even after the per-chunk metadata fix in #9211 and the
completion backfill in #9224).

Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk,
wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for
the SSE-specific decryption setup. Per-chunk metadata is still
validated UPFRONT so a malformed chunk fails fast without opening any
HTTP connection -- the eager validation contract callers and tests
rely on is preserved. The volume-server GET and the SSE-specific
decrypt wrap, however, fire LAZILY: at most one chunk body is live at
any time, regardless of object size.

This commit applies the new pattern to buildMultipartSSES3Reader only;
the SSE-KMS and SSE-C multipart readers retain their eager form for
now and will be migrated in follow-up commits, since the same shape
exists there too.

Tests:
  - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract:
    zero chunks opened at construction, peak liveness == 1, all closed
    after drain.
  - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch
    (replaces ClosesAppendedOnError) asserts a malformed chunk in
    position N causes zero fetches for chunks 0..N -- the previous
    test pinned a weaker contract (cleanup after eager open).
  - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same
    reason: the fetch callback must NOT be invoked at all on a bad-IV
    chunk.
  - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts
    encrypted the way putToFiler writes them (shared DEK + baseIV,
    partOffset=0, post-completion global offsets) and walks them
    through buildMultipartSSES3Reader.

* fix(s3api): stream multipart SSE-KMS chunks lazily

Apply the same fix as the previous commit to
createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata
is validated upfront, but volume-server GETs fire lazily through
lazyMultipartChunkReader. At most one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The wire format on disk is unchanged (same per-chunk metadata, same
encrypted bytes, same object Extended attributes). Existing SSE-KMS
multipart objects read back identically -- only when the volume-server
GETs fire changes.

* fix(s3api): stream multipart SSE-C chunks lazily

Apply the same fix as the previous two commits to
createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is
validated upfront (IV decode, IV length check, non-negative
PartOffset), but the volume-server GET and CreateSSECDecryptedReader-
WithOffset wrap fire lazily through lazyMultipartChunkReader. At most
one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-C multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The pre-existing TODO note about CopyObject SSE-C PartOffset handling
is preserved verbatim. The wire format on disk is unchanged (same
per-chunk metadata, same encrypted bytes); existing SSE-C multipart
objects read back identically.

After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS,
SSE-C) share lazyMultipartChunkReader as their streaming engine.

* test(s3): add Docker Registry-shape multipart SSE-S3 GET regression

Pin the end-to-end fix for #8908 with a test that mirrors what Docker
Registry actually does on pull: a 25-part * 5MB upload with bucket-
default SSE-S3, then a full GET, then SHA-256 over the streamed body
must match SHA-256 over the uploaded bytes.

The eager-multipart-reader bug was specifically a streaming truncation
under load: the response status was 200 with a Content-Length matching
the object size, but the body short-circuited mid-stream because
later chunks' volume-server connections had already been closed by
keepalive. The hash check is the symptom Docker Registry surfaces
("Digest did not match"), so this is the most faithful regression we
can pin without spinning up a registry.

uploadAndVerifyMultipartSSEObject already byte-compares the GET body,
but hashing on top is intentionally explicit -- it documents WHY the
test exists, and matches the failure mode reported in the issue.

* test(s3): add range-read coverage matrix across SSE modes and sizes

Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB)
single-chunk objects, with one ad-hoc range case per SSE mode and one
129-byte boundary-crossing case in TestSSEMultipartUploadIntegration.
Nothing exercised:

  - Range reads on single-PUT objects whose content crosses the 8MB
    internal chunk boundary (medium size class).
  - Range reads on multipart objects whose parts each span multiple
    internal chunks (large size class) -- the shape #8908 originally
    surfaced for full-object GETs and the most likely site of any
    future regression in per-chunk IV / PartOffset plumbing for
    partial reads.
  - A consistent range-pattern set applied uniformly across SSE modes,
    so any divergence between modes (SSE-C uses random IV + PartOffset;
    SSE-S3/KMS use base IV + offset) is comparable at a glance.

TestSSERangeReadCoverageMatrix introduces a parameterized matrix:

  modes:     no_sse, sse_c, sse_kms, sse_s3
  sizes:     small (256KB single chunk),
             medium (12MB single PUT crossing one internal boundary),
             large (5x9MB multipart, ~10 internal chunks, every part
                    itself spans an 8MB boundary)
  ranges:    single byte at 0, prefix 512B, single byte at last,
             suffix bytes=-100, open-ended bytes=N-, whole object,
             AES-block boundary 15-31, mid straddling one internal
             boundary (medium+large), mid spanning many internal
             boundaries (large only)

Per case it asserts: body bytes equal the expected slice, Content-Length
matches the range length, Content-Range matches start-end/total, and the
SSE response headers match the mode.

The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's
the remaining sse_kms subtests with a clear reason if the local server
has no KMS provider configured -- the default `weed mini` setup lacks
one; the Makefile target `test-with-kms` provides one via OpenBao. Other
modes always run.

Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against
weed mini, sse_kms cleanly skipped.

* test(s3): conform new test names to TestSSE*Integration so CI runs them

The two tests added in the previous commits had names that did NOT match
the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml
use to discover SSE integration tests:

  - test/s3/sse/Makefile `test` target:           TestSSE.*Integration
  - test/s3/sse/Makefile `test-multipart`:        TestSSEMultipartUploadIntegration
  - .github/workflows/s3-sse-tests.yml:           ...|.*Multipart.*Integration|.*RangeRequestsServerBehavior

Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and
the Docker-Registry-shape multipart regression in
TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible
to CI even though the underlying test setup (start-seaweedfs-ci using
s3-config-template.json with the embedded `local` KMS provider) already
has SSE-KMS configured.

Renames:

  TestSSERangeReadCoverageMatrix              -> TestSSERangeReadIntegration
  TestSSES3MultipartManyChunks_...            -> TestSSEMultipartManyChunksIntegration

Both names now match `TestSSE.*Integration` (Makefile `test` target) and
TestSSEMultipartManyChunksIntegration additionally matches
`.*Multipart.*Integration` (CI's comprehensive subset). No behavior
change; only the function names move.

Verified locally against `weed mini` with s3-config-template.json:
TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes
(none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns,
all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only
fires for ad-hoc local setups that don't load any KMS provider; the
project's standard test setup loads the local provider, so CI has full
SSE-KMS range coverage.

* fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch

Addresses CodeRabbit review on PR #9228: in
createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata
was deserialized in the prep loop but the IV length was only validated
later, inside CreateSSEKMSDecryptedReader, which runs from the wrap
closure -- AFTER the chunk's volume-server fetch has already started.
That weakens the new "reject malformed chunks before any fetch" contract
for SSE-KMS specifically: a chunk with a missing/short/long IV would
fire its HTTP GET, then fail mid-stream during decrypt.

The fix moves the existing ValidateIV check into the prep loop, matching
the SSE-S3 and SSE-C paths.

Drive-by: extract the SSE-KMS prep loop into a free
buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader,
so the new contract is unit-testable without an S3ApiServer. The
exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a
thin caller, so behavior for production callers is unchanged.

New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract:

  - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers
    missing IV, empty IV, short IV, long IV. Each case asserts both
    that an error is returned AND that the fetch callback is never
    invoked.
  - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch
    pins the analogous behavior when SseMetadata is nil on a chunk in
    position N: chunks 0..N-1 must not be fetched (the earlier eager
    implementation depended on a closeAppendedReaders cleanup path; the
    new contract is stronger -- nothing is opened in the first place).
  - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch
    covers the JSON-unmarshal failure branch.
  - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the
    documented sort-by-offset contract by recording the order in which
    fetch is invoked.

All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit
suite + the SSE integration suite (with the local KMS provider enabled
via s3-config-template.json) continue to pass.

* test(s3): address CodeRabbit nitpicks on range coverage matrix

Three small follow-ups on the range-read coverage matrix from the
previous commit, per CodeRabbit nitpicks on PR #9228:

1. Promote the body-length check from `assert.Equal` to `require.Equal`
   so a truncation regression -- the canonical #8908 failure mode --
   aborts the subtest immediately. Previously the assertion logged a
   length mismatch and then `assertDataEqual` ran on differently-sized
   slices, producing a noisy byte-diff on top of the actual symptom.
   The redundant trailing `t.Fatalf` block becomes dead and is removed.

2. Broaden the SSE-KMS probe-skip heuristic. The probe previously
   produced the friendly "KMS provider not configured" message only
   for 5xx responses; KMS-misconfig surfaces also include 501
   NotImplemented, 4xx KMS.NotConfigured, and error messages
   containing "KMS.NotConfigured" / "NotImplemented" /
   "not configured". The behaviour change is purely cosmetic (the
   caller t.Skip's on any non-empty reason either way) but the new
   diagnostic is more useful in CI logs.

3. Add `t.Parallel()` at the mode and size-class levels of the matrix.
   Each (mode, size) writes an independent object key under the shared
   bucket, with no cross-talk, so parallel execution is safe. Local
   wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%);
   the savings scale with chunk count and CI machine concurrency.

Verified locally against `weed mini` with s3-config-template.json:
  - go test ./weed/s3api/ -count=1                   PASS
  - TestSSERangeReadIntegration -v                   112 PASS, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.           PASS

* fix(s3api): tighten lazy reader error path; unify SSE IV validation

Three CodeRabbit nitpicks on PR #9228:

1. lazyMultipartChunkReader: mark finished on non-EOF Read errors

   The Read loop's three earlier failure paths (chunk index past end,
   fetch error, wrap error) all set l.finished = true before returning.
   The non-EOF Read path -- where l.current.Read itself errors mid-chunk
   -- did not, leaving l.current/l.closer set and l.finished = false. A
   caller that retried Read after an error would re-enter the same
   broken stream instead of advancing or giving up. Set l.finished =
   true on non-EOF Read error so post-error state is consistent across
   all four failure sites; Close() (which the GetObjectHandler defers)
   still releases the chunk body.

2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths

   The previous commit moved SSE-KMS to the shared ValidateIV helper
   but left SSE-S3 and SSE-C with bespoke inline `len(...) !=
   AESBlockSize` checks. All three are enforcing the same invariant;
   inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to
   ValidateIV too, with the same `<algo> chunk <fileId> IV` name
   convention. Error message wording shifts from "<algo> chunk X has
   invalid IV length N (expected 16)" to ValidateIV's "invalid <algo>
   chunk X IV length: expected 16 bytes, got N". The substring
   "IV length" is preserved across both, so the existing
   TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion
   is loosened to match either form.

3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering

   The test previously drove Read() to observe fetch-call order, but
   CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap
   the encrypted DEK -- unavailable in unit tests -- so the wrap
   closure failed on the first chunk and only one fetch was ever
   recorded. The test asserted only fetchOrder[0] == "c0", which is
   weaker than the comment promised.

   Switch to a static check: type-assert the returned reader to
   *lazyMultipartChunkReader (same package so unexported fields are
   accessible) and inspect the prepared chunks slice directly. This
   pins the entire [c0, c1, c2] sort order in one place, doesn't
   depend on KMS, and runs in zero fetch calls. The fetch closure
   now asserts it is never invoked during preparation.

All weed/s3api unit tests pass; integration suite (with KMS provider
configured via s3-config-template.json) passes.

* test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe

Two CodeRabbit comments on PR #9228, both about
test/s3/sse/s3_sse_range_coverage_test.go:

1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration

   The test creates one bucket up front, then runs subtests that call
   t.Parallel() at the mode and size levels (added in 058cbf27 to cut
   wall time). t.Parallel() pauses each subtest and yields back to the
   parent. The parent's for loop finishes scheduling, the function
   returns, and the deferred cleanupTestBucket fires -- BEFORE any
   parallel subtest body has executed. The bucket gets deleted out
   from under the parallel subtests, which then race the cleanup and
   either fail with NoSuchBucket or, depending on lazy-deletion
   behaviour on the server side, mask other regressions because
   chunks happen to still be readable for a brief window.

   The local matrix passing prior to this commit was a server-side
   coincidence; the t.Cleanup contract is the right one for parent
   tests with parallel children, and switching to it is a one-line
   change. t.Cleanup runs after the test AND all its (parallel)
   subtests complete, so the bucket survives until every leaf
   subtest is done.

2. MINOR: tighten the SSE-KMS probe-skip heuristic

   The previous broadening (058cbf27) treated `code == 400` as
   "KMS provider not configured", on the theory that some servers
   return 4xx for KMS misconfig. That is too aggressive: a real
   misconfiguration in the SSE-KMS test request itself (bad keyID
   format, missing header) ALSO surfaces as a 400, and would
   silently t.Skip the SSE-KMS subtree in CI -- which is exactly
   the integration coverage the new TestSSERangeReadIntegration is
   supposed to add. Drop the 400 branch (and the redundant 501
   match, since 501 >= 500 already covers it). Genuine
   "KMS.NotConfigured" / "NotImplemented" responses are still
   recognised via the string-match block immediately below,
   regardless of status code, so the friendly skip message survives
   for the cases where it actually applies.

Verified locally against `weed mini` with s3-config-template.json:

  - go test ./weed/s3api/                     PASS
  - TestSSERangeReadIntegration -v            113 PASS lines, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.    PASS
2026-04-26 16:31:42 -07:00
Jon E Nesvold
dc462a80d7 feat(credential/postgres): inline policies, mTLS and pgbouncer connection support (#9226)
* feat(credential/postgres): mTLS + pgbouncer support, InlinePolicyStore implementation, upsert SaveConfiguration

* fix(credential/postgres): add rows.Err() checks, inline policy tests, memory store LoadInlinePolicies

* fix(credential/postgres): cast JSONB params to string for pgbouncer simple protocol

* fix(credential/postgres): wrap tx.Commit errors with context

* fix(credential/postgres): use any type for JSONB params to preserve SQL NULL for nil fields
2026-04-26 14:54:53 -07:00
Parviz Miriyev
f407bdaa36 fix(admin): use TLS-aware HTTP client for /dir/status fetch (#9227)
fetchPublicUrlMap() in weed/admin/dash/cluster_topology.go uses a
dedicated &http.Client{} that doesn't honor security.toml client TLS
configuration, and hardcodes "http://" in the URL. When master is
configured HTTPS-only ([https.master] set), every cluster topology
cache refresh logs:

  NOTICE: http: TLS handshake error from <admin-ip>:<port>: client
  sent an HTTP request to an HTTPS server

The function falls through to glog.V(1).Infof and returns nil, so the
admin UI loses PublicUrl enrichment for data nodes. Cosmetic but noisy.

Switch to util_http.GetGlobalHttpClient() whose Do() calls
NormalizeHttpScheme(), which automatically rewrites http:// to https://
when [https.client] is enabled and presents the configured client cert.
Preserve the 5-second timeout via context.WithTimeout().

Same pattern as weed/admin/handlers/file_browser_handlers.go,
weed/server/master_server.go, weed/shell/command_volume_fsck.go.
2026-04-26 12:25:53 -07:00
Chris Lu
0716577ec8 fix(upload): rewind request body when retrying on connection reset (#9139) (#9222)
* fix(upload): rewind request body when retrying on connection reset (#9139)

When httpClient.Do() returned "connection reset by peer" or "use of
closed network connection", upload_content retried with the same
*http.Request. But the body is a *bytes.Reader the first attempt
already consumed, so the retry sent 0 bytes and Go's transport
surfaced "http: ContentLength=N with Body length 0".

http.NewRequestWithContext populates req.GetBody for *bytes.Reader
bodies; use it to attach a fresh body before retrying.

Reproduces the issue with a unit test (asserts both attempts see
the same payload bytes); the test fails without the fix.

* upload: skip inner retry when body cannot be rewound

Per review feedback: if req.GetBody is nil or returns an error, the
inner retry would call Do(req) with an already-consumed body and the
"connection reset" error would be replaced by the misleading
"ContentLength=N with Body length 0" — the very symptom this PR set
out to fix. Skip the inner retry on rewind failure and let the outer
retriedUploadData loop reissue with a fresh request, and log when
GetBody is unavailable for observability.

* upload: log the actual transport error in the inner retry log line

Per review feedback: the diagnostic glog at the top of the inner
retry branch was logging postErr — the request-construction error
from http.NewRequestWithContext, which is necessarily nil there
because the function returns early at line 423 if it isn't.
Operators were seeing "<nil>" instead of the transient transport
error that triggered the rewind. Reference post_err so the
connection-reset / closed-connection cause is actually visible.
2026-04-26 02:17:55 -07:00
Chris Lu
654292b57d fix(volume): cap leveldb OpenFilesCacheCapacity per index DB (#9139) (#9223)
* fix(volume): cap leveldb OpenFilesCacheCapacity per index DB (#9139)

The leveldb opt.Options for NeedleMapLevelDb / Medium / Large never
set OpenFilesCacheCapacity, so each leveldb instance defaulted to
goleveldb's 500. On servers with thousands of volumes, that ceiling
stacks across DBs and exhausts even high ulimits, starving WAL
rotation:

  failed to write leveldb: open .../000006.log: too many open files

CompactionTableSizeMultiplier=10 already keeps the SST count low,
so a small per-DB cache is sufficient. Cap at 16 / 32 / 64 for the
small / medium / large variants so per-DB FD usage is bounded.

* storage: hoist leveldb FD-cap values into named constants

Per review feedback: replace the inline 16/32/64 literals with
LevelDb{,Medium,Large}OpenFilesCacheCapacity, and move the rationale
(why 500 is too high per-DB on busy servers, what the tradeoff is)
into a package-level comment so future readers see the memory vs.
performance picture at the constant declaration instead of inline.
2026-04-26 02:15:15 -07:00
Chris Lu
525900dfe4 fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224)
* fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion

When a part of an SSE-S3 multipart upload lands with SseType=NONE on
its chunks (e.g. a transient failure to apply SSE-S3 setup in
PutObjectPart), the completed object inherits NONE-tagged chunks and
detectPrimarySSEType then misses the chunked SSE-S3 encryption. The
read path falls through to the unencrypted serve and GET returns
ciphertext, producing the SHA mismatch reported in #8908.

Recover at completion using the base IV and key data the upload
directory recorded at CreateMultipartUpload:

  - extractMultipartSSES3Info validates upload-entry metadata up
    front and hard-fails completion if the base IV or key data are
    malformed; serializing chunk metadata we then could not decrypt
    is worse than rejecting the upload.
  - completedMultipartChunk re-derives a per-chunk IV from baseIV +
    chunk.Offset (matching what putToFiler would have written) and
    serializes per-chunk SSE-S3 metadata when the chunk has no tag.
    Existing per-chunk metadata is left alone; we cannot recover an
    already-derived IV from the upload-entry alone.

The IV formula intentionally has no partNumber term: putToFiler
hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption
for every part, so each chunk's encryption IV is
calculateIVWithOffset(baseIV, chunk.Offset_part_local).
PartOffsetMultiplier is defined in s3_constants but is not consumed
by the encryption path. Adopting (partNumber-1)*PartOffsetMultiplier
+ chunk.Offset would produce IVs that fail to decrypt the bytes on
disk - a stronger failure mode than the bug being fixed. Tests pin
this:

  - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext
    runs the round trip across the encryption boundary: encrypt
    parts with CreateSSES3EncryptedReaderWithBaseIV (the call
    putToFiler uses), drop chunk metadata to reproduce #8908,
    backfill, decrypt with backfilled IV, assert plaintext intact.
  - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula
    constructs the IV the partNumber formula would produce and
    shows it does not decrypt the actual ciphertext.

This commit covers the chunk-level recovery only. The companion
fix for the object-level Extended attributes (SeaweedFSSSES3Key /
X-Amz-Server-Side-Encryption) follows separately.

* fix(s3api): backfill canonical SSE-S3 attributes onto multipart object

The previous commit ensures every chunk of an SSE-S3 multipart upload
carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct
read path can decrypt. The completed object's Extended map can still
miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal
look at:

  - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header
    detectPrimarySSEType reads on inline / small-object reads)
  - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by
    IsSSES3EncryptedInternal and by the read-path key lookup)

When a part of the upload was written by a path that did not set
those (the same #8908 race that produced the NONE chunks),
copySSEHeadersFromFirstPart finds nothing to copy and the final entry
ends up with only the multipart-init keys (SeaweedFSSSES3Encryption /
BaseIV / KeyData). The read path then mis-detects the object as
unencrypted.

applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair
from the multipart-init metadata in all three completion paths
(versioned, suspended, non-versioned), only when the keys are missing
so a healthy first part still wins. extractMultipartSSES3Info already
ran in prepareMultipartCompletionState, so the data is reused without
re-decoding.

Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill,
do-not-clobber, and nil-info no-op cases.

* fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt

decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV
(calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the
adjusted IV to CreateSSEKMSDecryptedReader, which itself runs
calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The
offset was being applied twice for any chunk with a non-zero
ChunkOffset, corrupting the keystream for range reads that cross
multipart chunk boundaries.

Pass the raw SSE-KMS key (with base IV and the original ChunkOffset
field) into CreateSSEKMSDecryptedReader so the offset is applied
exactly once, and remove the now-dead intra-block skip that was
compensating for the double adjustment.

Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment
that decrypts the same ciphertext with a deliberately double-adjusted
IV and asserts the output is corrupted, so any regression that
re-introduces the double application fails the unit test.

* test(s3): cover multipart SSE across chunk-spanning parts and ranges

Adds an integration subtest "Multipart Parts Larger Than Internal
Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that
exercises the end-to-end S3 path for the bugs fixed in this branch:

  - Two-part multipart upload with each part larger than the 8MB
    internal SeaweedFS chunk, so each part itself spans multiple
    underlying chunks.
  - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default
    SSE-S3 - the four paths multipart parts can take through the SSE
    pipeline.
  - Each subtest does a full GET (verifying every byte and the
    response Content-Length / SSE response headers) plus a 129-byte
    range read straddling the 8MB internal chunk boundary, which is
    the path that produced the SSE-KMS double-IV corruption (fix in
    the previous commit) and the SSE-S3 chunk-tag loss (fix in the
    earlier commits).

Factored the request shape behind multipartSSEOptions /
uploadAndVerifyMultipartSSEObject so all four SSE flavors share the
same upload+verify code; only the SSE-specific input/output
configuration differs per subtest.

* test(s3): abort orphan multipart uploads on test failure

Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The
helper used require.NoError after CreateMultipartUpload, UploadPart
and CompleteMultipartUpload, so a failure in any of those (or in the
later GET / range read on a still-incomplete upload) called t.Fatal
without aborting the in-flight MPU, leaving an orphan upload in the
bucket. Harmless in CI where the data dir is wiped on shutdown, but a
real annoyance when iterating locally and a textbook AWS S3 caveat in
production.

Register a t.Cleanup that calls AbortMultipartUpload unless a
"completed" flag was set right after a successful
CompleteMultipartUpload. Use context.Background for the abort call
since the parent ctx may already be cancelled at cleanup time, and
t.Logf the abort error rather than failing the test so the original
failure remains visible in the run output.
2026-04-25 23:06:37 -07:00
Chris Lu
5eead9409a fix(admin): S3 Tables CSRF token + non-empty 409 status (#9221)
* fix(admin): attach CSRF token to S3 Tables write requests

Several POST/PUT/DELETE calls in s3tables.js were sent without an
X-CSRF-Token header while the corresponding handlers in
weed/admin/dash/s3tables_management.go enforce CSRF via
requireSessionCSRFToken, so authenticated users hit "invalid CSRF token"
on actions like creating a table bucket (#9220), updating policies, and
managing tags.

Add an s3tWriteHeaders helper that pulls the token from the existing
csrf-token meta tag and use it on every write to /api/s3tables/buckets,
/bucket-policy, /tables, /table-policy, and /tags. The Iceberg-page
write paths already attached the token and are unchanged.

Fixes #9220

* fix(admin): map BucketNotEmpty/NamespaceNotEmpty to 409 for S3 Tables

DELETE on a non-empty table bucket or namespace returned HTTP 500
because s3TablesErrorStatus didn't list ErrCodeBucketNotEmpty or
ErrCodeNamespaceNotEmpty in its conflict case, even though the
backend handler emits them with 409 Conflict (matching AWS S3 Tables).
Add both codes to the existing conflict mapping.

* refactor(admin): route Iceberg S3 Tables writes through s3tWriteHeaders

Iceberg namespace/table create and Iceberg table delete were still
hand-rolling CSRF headers. Replace those blocks with the existing
s3tWriteHeaders() helper so every S3 Tables write uses the same code
path. Drop the now-unused csrfTokenInput.value population in
initIcebergNamespaces and initIcebergTables (the templ hidden inputs
have no server-rendered value, and nothing reads the input now that
the JS reads the token from the meta tag via getCSRFToken()).
2026-04-24 22:48:41 -07:00
Chris Lu
a14cbc176b debug(kafka): add restart flake diagnostics 2026-04-24 15:02:07 -07:00
Chris Lu
f1f720f5da fix(master): register EC shards per physical disk on full heartbeat sync (#9212) (#9219)
* refactor(types): add DiskId type for physical-disk identifiers

Names the uint32 physical-disk index that volume servers carry in
VolumeEcShardInformationMessage / VolumeInformationMessage, so EC shard
tracking that needs to distinguish disks within a DataNode can use a
dedicated type instead of an untyped uint32. No behaviour change.

* fix(master): register EC shards per physical disk on full heartbeat sync (#9212)

When a volume's EC shards are spread across multiple physical disks on the
same volume server (common after ec.balance / ec.rebuild on multi-disk
nodes), the volume server emits one VolumeEcShardInformationMessage per
(disk, volume) in its heartbeat. The master's DataNode.UpdateEcShards was
building a `map[VolumeId]*EcVolumeInfo` with last-write-wins, and
doUpdateEcShards then overwrote `disk.ecShards[vid]` once per message, so
all but the final disk's shards were silently dropped. Only the
topology-global ecShardMap (built via RegisterEcShards in a per-message
loop) stayed correct, which hid the problem from `topo.LookupEcShards`
but broke everything that reads the DataNode/Disk view — volume.list,
admin UI, ec.rebuild dry-run ("only 6 shards, skipping"), and
`DiskInfo.EcShardInfos` which the shell's ec.balance / ec.rebuild
planners group by `eci.DiskId`.

Change the shape of `Disk.ecShards` from
    map[VolumeId]*EcVolumeInfo
to
    map[VolumeId]map[types.DiskId]*EcVolumeInfo

so every physical disk keeps its own entry. UpdateEcShards aggregates
incoming messages by (vid, diskId) rather than vid alone; Add/Delete/
HasVolumesById and HasEcShards consult the nested map; doUpdateEcShards
rewrites the nested structure from the aggregated map. Per-physical-disk
attribution survives through DataNode.ToDataNodeInfo ->
DiskInfo.EcShardInfos, matching the wire format the volume server
produces and what downstream admin tooling expects.

Delta sync (AddOrUpdateEcShard / DeleteEcShard) already merged via
ShardsInfo.Add, so this only affects the full-sync path that runs on
heartbeat reconnect.

Adds data_node_ec_multi_disk_test.go with two regression tests that fail
on pre-fix master:
- TestEcShardsAcrossMultipleDisksOnSameNode: volume 15 spread over 3
  disks (matches the bug report's volume-2 row); asserts every shard
  visible via LookupEcShards, DataNode.GetEcShards, and ToDataNodeInfo's
  per-disk EcShardInfos entries.
- TestEcShardsAfterRestartHeartbeat: minimal 2-disk full sync case.

* fix(topology): tighten locking around EC shard map access

Addresses review comments on #9219:

* DataNode.UpdateEcShards now holds dn.Lock for the full read-diff-write
  cycle, matching UpdateVolumes' model, so concurrent heartbeats can no
  longer interleave their getOrCreateDisk / UpAdjustDiskUsageDelta
  updates with each other. Introduces a private getEcShardsLocked helper
  for reads under the held lock; renames doUpdateEcShards to
  doUpdateEcShardsLocked for the same reason.

* DataNode.HasEcShards now takes each disk's ecShardsLock while reading
  disk.ecShards, closing a pre-existing map race with concurrent
  Add/Delete/Update writers.

* doUpdateEcShardsLocked takes each disk's ecShardsLock around the
  reset-and-rewrite so readers (GetEcShards, HasEcShards) see a
  consistent map state rather than a partially-rebuilt one.

* Disk.GetEcShards' slice-capacity hint now accounts for the nested
  per-physical-disk entries (sum of inner lengths) instead of
  underestimating by the unique-volume count.
2026-04-24 14:01:09 -07:00
Chris Lu
d65c568cbb fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests (#9218)
* fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests

DeserializeSSES3Metadata does not require an IV, and a corrupted or
legacy chunk without one would have flowed into cipher.NewCTR and
panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes
before decryption, closing the current and any already-appended chunk
readers on error.

Factor the per-chunk decryption loop out of
createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader
so it can be driven with a mock chunk fetcher, and add tests covering:
the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order
chunks) to lock in the fix from #9211; missing-IV and short-IV metadata
rejection without panic; and reader cleanup when a later chunk fails.

* address review: sort chunks copy; close encryptedStream on error

- buildMultipartSSES3Reader now sorts a copy of the chunks slice so
  callers do not observe entry.Chunks reordered (other code paths,
  e.g. ETag computation, can rely on the original order).
- createMultipartSSES3DecryptedReaderDirect now closes encryptedStream
  on the error path from buildMultipartSSES3Reader. All current
  callers pass nil, but this keeps cleanup symmetric with the
  success path.
- Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the
  input slice is not mutated.

* address review: defer single close; extend chunk-copy + IV-guard pattern

- createMultipartSSES3DecryptedReaderDirect: collapse the duplicated
  encryptedStream.Close() calls into a single nil-guarded defer so the
  error and success paths share cleanup.
- createMultipartSSECDecryptedReaderDirect,
  createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks
  instead of mutating the caller's slice, matching the SSE-S3 helper.
- createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length
  before handing it to cipher.NewCTR; a base64-decoded empty or short
  IV from malformed/corrupt metadata would otherwise panic.
- SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls
  ValidateIV before cipher.NewCTR. Note recorded in the sort comment.

* address review: close appended readers on SSE-C/SSE-KMS error paths

createMultipartSSECDecryptedReaderDirect and
createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk
reader on error and leaked any chunk readers already appended to the
local readers slice, mirroring the leak previously fixed in the SSE-S3
helper. Add the same closeAppendedReaders() closure pattern to both
functions and invoke it on every error return inside the loop so failed
requests do not leak volume-server HTTP connections.

* address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment

- Move encryptedStream.Close() to a nil-guarded defer at the top of
  createMultipartSSECDecryptedReaderDirect and
  createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on
  every return path (including error returns from inside the per-chunk
  loop), mirroring the SSE-S3 helper.
- In buildMultipartSSES3Reader, iterate sortedChunks directly instead of
  reassigning chunks = sortedChunks.
2026-04-24 13:59:23 -07:00