- Add Index Residency and Failure Model section: filer-managed vs
colocated vs index-serving-tier options, and graceful-degradation
behavior when an index is missing or stale.
- Add Cost Model section spelling out when connectors should skip the
pushdown API (small tables, low selectivity, no predicate).
- Add Security section noting that zone maps, blooms, and inverted
indexes leak information beyond the underlying object's ACL, and
scoping v1 to inherit the file's ACL with row-level security
explicitly out of scope.
- Wording fixes: clarify CONTAINS is engine-side translation of
LIKE/MATCH/array_contains; clarify the embedding column is in
Parquet but the vector index is a side file; flag local filtering
on volume servers as new infrastructure introduced by this design.
- Phase 1 "row group stats cache" was misleading — those stats already
live in the Parquet footer. The Phase 1 win is a parsed-footer cache,
not new index data.
- The previous "/table/_seaweed_index/" path nested side indexes under
the table prefix, which Iceberg orphan-file removal and snapshot
expiration would surface or attempt to clean. Move side indexes to a
separate system bucket / filer mount keyed by file identity, with the
in-table-prefix layout listed only as a fallback.
The hybrid scalar+vector flow assumed the vector index could accept
arbitrary scalar pre-filters, which most off-the-shelf IVF/HNSW
implementations do not. Document the realistic strategies (partitioned
IVF for v1, post-filter with overscan as fallback, filtered HNSW
deferred) and the planner heuristic for choosing between them.
A predicate matching millions of rows would force the server to return
millions of row refs. Default the response to range-oriented data
(file/row-group/page refs), require RequestRowIds opt-in for per-row
output, and cap returned ids via MaxRowIds with a Truncated flag.
Replace opaque Predicate []byte with a tagged (PredicateKind, bytes)
pair, naming Substrait as the canonical v1 format and Iceberg
Expression JSON as a convenience. Replace the free-form Metric string
on VectorQuery with a VectorMetric enum.
Position deletes are file-scoped and precomputable as a per-data-file
bitmap. Equality deletes are predicate-scoped, can apply to many data
files including ones added later, and must be evaluated at query time.
Document both flows separately and remove the implication that
equality-delete state can be flattened into a static bitmap.
Drop modification time from the identity tuple — it is not stable across
replication and metadata-only operations and is unnecessary alongside
ETag or Iceberg manifest fields. Order remaining fields by strength,
with Iceberg manifest identity preferred when available.
The previous wording suggested SeaweedFS would split a Parquet file
into per-column-chunk needles, which contradicts the design's
"do not modify the file" goal. Clarify that the per-column-chunk view
is a planning abstraction over byte ranges in the unchanged object.
Draft design note for keeping Parquet files compatible with existing
engines (Spark, Trino, DuckDB, PyArrow, Iceberg) while building
auxiliary side indexes and a pushdown API for SeaweedFS-aware clients.
Covers physical layout, side index types (footer cache, zone map,
page index, bloom, bitmap, B-tree, inverted, vector), pushdown API
sketch, Iceberg delete handling, and a phased rollout plan.
* fix(admin): attach CSRF token to S3 Tables write requests
Several POST/PUT/DELETE calls in s3tables.js were sent without an
X-CSRF-Token header while the corresponding handlers in
weed/admin/dash/s3tables_management.go enforce CSRF via
requireSessionCSRFToken, so authenticated users hit "invalid CSRF token"
on actions like creating a table bucket (#9220), updating policies, and
managing tags.
Add an s3tWriteHeaders helper that pulls the token from the existing
csrf-token meta tag and use it on every write to /api/s3tables/buckets,
/bucket-policy, /tables, /table-policy, and /tags. The Iceberg-page
write paths already attached the token and are unchanged.
Fixes#9220
* fix(admin): map BucketNotEmpty/NamespaceNotEmpty to 409 for S3 Tables
DELETE on a non-empty table bucket or namespace returned HTTP 500
because s3TablesErrorStatus didn't list ErrCodeBucketNotEmpty or
ErrCodeNamespaceNotEmpty in its conflict case, even though the
backend handler emits them with 409 Conflict (matching AWS S3 Tables).
Add both codes to the existing conflict mapping.
* refactor(admin): route Iceberg S3 Tables writes through s3tWriteHeaders
Iceberg namespace/table create and Iceberg table delete were still
hand-rolling CSRF headers. Replace those blocks with the existing
s3tWriteHeaders() helper so every S3 Tables write uses the same code
path. Drop the now-unused csrfTokenInput.value population in
initIcebergNamespaces and initIcebergTables (the templ hidden inputs
have no server-rendered value, and nothing reads the input now that
the JS reads the token from the meta tag via getCSRFToken()).
* refactor(types): add DiskId type for physical-disk identifiers
Names the uint32 physical-disk index that volume servers carry in
VolumeEcShardInformationMessage / VolumeInformationMessage, so EC shard
tracking that needs to distinguish disks within a DataNode can use a
dedicated type instead of an untyped uint32. No behaviour change.
* fix(master): register EC shards per physical disk on full heartbeat sync (#9212)
When a volume's EC shards are spread across multiple physical disks on the
same volume server (common after ec.balance / ec.rebuild on multi-disk
nodes), the volume server emits one VolumeEcShardInformationMessage per
(disk, volume) in its heartbeat. The master's DataNode.UpdateEcShards was
building a `map[VolumeId]*EcVolumeInfo` with last-write-wins, and
doUpdateEcShards then overwrote `disk.ecShards[vid]` once per message, so
all but the final disk's shards were silently dropped. Only the
topology-global ecShardMap (built via RegisterEcShards in a per-message
loop) stayed correct, which hid the problem from `topo.LookupEcShards`
but broke everything that reads the DataNode/Disk view — volume.list,
admin UI, ec.rebuild dry-run ("only 6 shards, skipping"), and
`DiskInfo.EcShardInfos` which the shell's ec.balance / ec.rebuild
planners group by `eci.DiskId`.
Change the shape of `Disk.ecShards` from
map[VolumeId]*EcVolumeInfo
to
map[VolumeId]map[types.DiskId]*EcVolumeInfo
so every physical disk keeps its own entry. UpdateEcShards aggregates
incoming messages by (vid, diskId) rather than vid alone; Add/Delete/
HasVolumesById and HasEcShards consult the nested map; doUpdateEcShards
rewrites the nested structure from the aggregated map. Per-physical-disk
attribution survives through DataNode.ToDataNodeInfo ->
DiskInfo.EcShardInfos, matching the wire format the volume server
produces and what downstream admin tooling expects.
Delta sync (AddOrUpdateEcShard / DeleteEcShard) already merged via
ShardsInfo.Add, so this only affects the full-sync path that runs on
heartbeat reconnect.
Adds data_node_ec_multi_disk_test.go with two regression tests that fail
on pre-fix master:
- TestEcShardsAcrossMultipleDisksOnSameNode: volume 15 spread over 3
disks (matches the bug report's volume-2 row); asserts every shard
visible via LookupEcShards, DataNode.GetEcShards, and ToDataNodeInfo's
per-disk EcShardInfos entries.
- TestEcShardsAfterRestartHeartbeat: minimal 2-disk full sync case.
* fix(topology): tighten locking around EC shard map access
Addresses review comments on #9219:
* DataNode.UpdateEcShards now holds dn.Lock for the full read-diff-write
cycle, matching UpdateVolumes' model, so concurrent heartbeats can no
longer interleave their getOrCreateDisk / UpAdjustDiskUsageDelta
updates with each other. Introduces a private getEcShardsLocked helper
for reads under the held lock; renames doUpdateEcShards to
doUpdateEcShardsLocked for the same reason.
* DataNode.HasEcShards now takes each disk's ecShardsLock while reading
disk.ecShards, closing a pre-existing map race with concurrent
Add/Delete/Update writers.
* doUpdateEcShardsLocked takes each disk's ecShardsLock around the
reset-and-rewrite so readers (GetEcShards, HasEcShards) see a
consistent map state rather than a partially-rebuilt one.
* Disk.GetEcShards' slice-capacity hint now accounts for the nested
per-physical-disk entries (sum of inner lengths) instead of
underestimating by the unique-volume count.
* fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests
DeserializeSSES3Metadata does not require an IV, and a corrupted or
legacy chunk without one would have flowed into cipher.NewCTR and
panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes
before decryption, closing the current and any already-appended chunk
readers on error.
Factor the per-chunk decryption loop out of
createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader
so it can be driven with a mock chunk fetcher, and add tests covering:
the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order
chunks) to lock in the fix from #9211; missing-IV and short-IV metadata
rejection without panic; and reader cleanup when a later chunk fails.
* address review: sort chunks copy; close encryptedStream on error
- buildMultipartSSES3Reader now sorts a copy of the chunks slice so
callers do not observe entry.Chunks reordered (other code paths,
e.g. ETag computation, can rely on the original order).
- createMultipartSSES3DecryptedReaderDirect now closes encryptedStream
on the error path from buildMultipartSSES3Reader. All current
callers pass nil, but this keeps cleanup symmetric with the
success path.
- Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the
input slice is not mutated.
* address review: defer single close; extend chunk-copy + IV-guard pattern
- createMultipartSSES3DecryptedReaderDirect: collapse the duplicated
encryptedStream.Close() calls into a single nil-guarded defer so the
error and success paths share cleanup.
- createMultipartSSECDecryptedReaderDirect,
createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks
instead of mutating the caller's slice, matching the SSE-S3 helper.
- createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length
before handing it to cipher.NewCTR; a base64-decoded empty or short
IV from malformed/corrupt metadata would otherwise panic.
- SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls
ValidateIV before cipher.NewCTR. Note recorded in the sort comment.
* address review: close appended readers on SSE-C/SSE-KMS error paths
createMultipartSSECDecryptedReaderDirect and
createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk
reader on error and leaked any chunk readers already appended to the
local readers slice, mirroring the leak previously fixed in the SSE-S3
helper. Add the same closeAppendedReaders() closure pattern to both
functions and invoke it on every error return inside the loop so failed
requests do not leak volume-server HTTP connections.
* address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment
- Move encryptedStream.Close() to a nil-guarded defer at the top of
createMultipartSSECDecryptedReaderDirect and
createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on
every return path (including error returns from inside the per-chunk
loop), mirroring the SSE-S3 helper.
- In buildMultipartSSES3Reader, iterate sortedChunks directly instead of
reassigning chunks = sortedChunks.
* fix(iam): expand arbitrary jwt:/saml:/oidc: claim variables in policies
The policy engine gated variable substitution on a fixed allowlist
(jwt:sub, jwt:iss, jwt:aud, jwt:preferred_username), so patterns like
arn:aws:s3:::softs/${jwt:project_path}/* were passed through as literals
and never matched the requested resource. Dynamic claims from OIDC
providers (e.g. GitLab CI's project_path / namespace_path) could not be
used to scope policies.
Allow any jwt:/saml:/oidc: prefixed variable to be substituted when the
claim is present in RequestContext. These values originate from a
cryptographically verified identity token (the STS session JWT or
federated assertion), and the claim names are controlled by the trusted
identity provider, so the dynamic prefix is safe. Missing claims keep
the placeholder intact so the statement still fails to match.
Numeric JWT claims (JSON-decoded as float64) are now stringified so
patterns like ${jwt:project_id} work the same as string claims.
Fixes#9214
* fix(iam): cover all integer widths in claim stringification
Address PR review: stringifyClaimValue only handled int/int32/int64 on
the signed side and nothing on the unsigned side, so int8, int16, uint,
uint8, uint16, uint32, and uint64 claim values fell through to the
default branch and the placeholder was left unsubstituted.
JSON's generic decoder produces float64/json.Number for numbers, but
RequestContext can also be populated from typed sources (custom
providers or internal code), so cover all common integer widths -
signed and unsigned - explicitly. Extend TestStringifyClaimValue to
assert each supported type.
* fix(s3api): correct SSE-S3 decryption key handling in multipart uploads
* fix(s3api): preallocate readers and close on error in SSE-S3 direct path
Address review feedback on createMultipartSSES3DecryptedReaderDirect:
preallocate the readers slice with the known chunk count, and close any
already-appended chunk readers on error returns so failed requests do
not leak volume-server HTTP connections.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* Export gRPC `file_{read,write}_failures` metrics on volume servers.
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.
* Export REST `file_{read,write}_failures` metrics on volume servers.
* fix(weed/command) address unhandled errors
* fix(command): don't log graceful-shutdown sentinels; plug response-body leak
- s3: Serve on unix socket treated http.ErrServerClosed as fatal; now
excluded like the other Serve/ServeTLS paths in this file.
- mq_agent, mq_broker: filter grpc.ErrServerStopped so clean shutdown
doesn't log as an error.
- worker_runtime: the added decodeErr early-continue skipped
resp.Body.Close(); drop it since the existing check below already
surfaces the decode error.
- mount_std: the pre-mount Unmount commonly fails when nothing is
mounted; demote to V(1) Infof.
- fuse_std: tidy panic message to match sibling cases.
* fix(mq_broker): filter grpc.ErrServerStopped on localhost listener
The localhost listener goroutine logged any Serve error unconditionally,
which includes grpc.ErrServerStopped on graceful shutdown. Match the
main listener's check so clean stops don't surface as errors.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
AWS IAM treats a bare "*" in a statement's Resource as "any resource",
but the embedded IAM resource parser required a 6-segment S3 ARN and
silently skipped anything else. With a policy like
{Action: "s3:*", Resource: "*"}, every resource was dropped and the
statement produced no actions, so PutUserPolicy rejected the document
with "no valid actions found in policy document".
Short-circuit Resource == "*" to the same full-wildcard path that
"arn:aws:s3:::*" already takes.
* fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139)
A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash
"partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every
FUSE-initiated filer RPC fail with:
rpc error: code = Internal desc = grpc: error while marshaling:
string field contains invalid UTF-8
and then produced an avalanche of "connection is closing" errors on
unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the
volume-server QPS dips reported in #9139.
Root cause is twofold:
1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes
raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/
Symlink all forwarded those bytes directly into CreateEntryRequest.Name,
DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and
Entry.Name. saveDataAsChunk also copied the FullPath into
AssignVolumeRequest.Path unchecked.
2. When the marshal failed, shouldInvalidateConnection treated the
resulting codes.Internal as a connection problem and dropped the
shared cached ClientConn — canceling every other in-flight RPC on it.
Fix:
- Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching
util.FullPath.DirAndName) and make checkName return the sanitized name.
Apply at every FUSE entry point that passes a name to the filer RPC,
including Unlink/Rmdir (which did not previously call checkName) and
both oldName/newName in Rename. Add a backstop scrub for
AssignVolumeRequest.Path so async flush paths cannot reintroduce
invalid bytes from a pre-sanitization cached FullPath.
- In weed/pb.shouldInvalidateConnection, detect client-side marshal
errors via the gRPC library's "error while marshaling" prefix and
return false: the connection is healthy, only the request is bad.
Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231
* fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe)
Sanitized filenames flow downstream into HTTP URLs (volume-server uploads,
filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string
delimiter and would split the path the first time the name lands in one,
so swap every invalid-UTF-8 replacement to '_'. This covers the two
pre-existing sites in weed/util/fullpath.go as well, keeping all paths
sanitized the same way.
* refactor(pb): detect client-side marshal errors via errors.As, not substring
Replace the raw `strings.Contains(err.Error(), ...)` check with a
type-based carve-out: use errors.As against the `GRPCStatus() *Status`
interface to pull the original Status out of any fmt.Errorf("...: %w")
wrapping, then match the library-owned "grpc:" prefix on that Status's
Message.
Why not errors.Is against a proto-level sentinel: gRPC's encode()
collapses the inner proto error with "%v" (stringification) before
wrapping it in a Status, so the original error type does not survive
into the caller. The Status itself is the structural signal that does
survive.
Why not status.FromError: when the caller wraps the Status error with
fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message
with the full err.Error() of the outermost wrapper, which defeats a
prefix check on the library-owned message. errors.As gives us the
original Status whose Message is still verbatim from the gRPC library.
A new test asserts that a plain errors.New("grpc: error while marshaling: …")
— i.e. the same text attached to something that is NOT a gRPC status —
does not short-circuit invalidation, so we never silently keep a cached
connection alive based on a coincidental substring match.
* refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized
Addresses review feedback on PR #9207.
Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName,
Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes
through a single util.SanitizeUTF8Name helper, so the replacement char
('_' — URL-safe) is chosen in one place.
Outside-diff: three proto fields took raw FullPath strings that could
break marshaling if an entry ever carried invalid UTF-8
(CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in
Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The
reviewer's suggested fix — using DirAndName() — would have silently
changed Directory from parent to grandparent, because DirAndName
sanitizes only the trailing component. Added FullPath.Sanitized(), which
scrubs every component, and applied it at the three sites. Exposure is
narrow in practice (FUSE-boundary sanitization and the gRPC-side
isClientSideMarshalError carve-out already cover the #9139 cascade),
but the defense-in-depth is cheap and consistent with the existing
AssignVolume backstop.
New tests in weed/util/fullpath_test.go document:
- SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes
become '_' (not '?', which is URL-special).
- FullPath.Sanitized: scrubs bytes in any component, not just the last.
- FullPath.DirAndName: dir remains raw on purpose — callers needing a
clean full path must use Sanitized(). The test pins this behavior so
it is not accidentally "fixed" in a way that changes the (dir, name)
semantics callers depend on.
TestPosixFileLocking/ConcurrentLockContention failed in CI (run
24857323067) with ENOENT when re-opening the file after all 8 workers
had successfully written and closed. The 20s openWithRetry budget was
exhausted, pointing at a real but unproven metaCache/parent-cache
coherence issue in the mount under bursts of concurrent Release.
Test: hold the initial fd open for the whole subtest; use it for the
post-workers Sync() and the verification read. Workers still exercise
the concurrent-flock invariant and per-record write correctness; the
re-open path is no longer load-bearing. On Eventually failure, dump
ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future
recurrence has state to debug from. Drop the darwin-only ENOENT
t.Skip branches that hid this same flake.
Mount: in weedfs.lookupEntry, when returning ENOENT from the
"parent cached but child missing" branch, log at Warningf instead of
V(4) when the kernel is still tracking this path's inode. That
combination is the smoking-gun signal for cache drift and is rare
enough in normal use not to spam the log.
* fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202)
With `global.seaweedfs.enableSecurity=true` and the default `s3.httpsPort=0`,
the chart was unconditionally passing `-cert.file` / `-key.file` to the S3
frontend. In `weed/command/s3.go`, when `tlsPrivateKey != ""` and
`portHttps == 0`, the server promotes its main `-port` (8333 by default) into
an HTTPS listener. The pod's readiness / liveness probes still use
`scheme: HTTP`, so every kubelet probe produces
http: TLS handshake error from <node-ip>:<port>: client sent an HTTP
request to an HTTPS server
in the pod log, as reported in #9202. `enableSecurity=true` is supposed to
activate security.toml / gRPC mTLS, not silently flip the S3 HTTP port to
HTTPS.
Move the `seaweedfs.s3.tlsArgs` include inside the `if httpsPort` guard in
all three templates that wire up an S3 frontend (standalone S3 deployment,
filer with S3 sub-server, all-in-one deployment). The TLS cert args are now
emitted only when the user explicitly opts into an HTTPS port; the main
`-port` stays HTTP so probes work.
Also add a regression test to `.github/workflows/helm_ci.yml` that renders
all three templates with and without `httpsPort` and asserts the cert/key/
`-port.https` args are emitted together or not at all.
* test(helm): add bash -n parse check to the S3 TLS-gating regression test
Addresses gemini-code-assist review comment on #9206 flagging a potential
"dangling backslash" shell-syntax risk in the rendered all-in-one command
script when httpsPort is set but most S3/SFTP args are defaulted off. In
practice bash -n accepts a trailing `\<newline><EOF>` (it's line-continuation
to an empty line), so no current rendering is broken. Locking that contract
down in CI so a future helper change that leaves a dangling backslash — or
any other shell-syntax regression in the rendered command — fails loudly
instead of silently shipping broken pods.
* fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199)
The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and
does not register with portmap. Linux mount.nfs queries portmap on port 111
first, so the plain `mount -t nfs host:/export /mnt` form failed with
"portmap query failed" / "requested NFS version or transport protocol is
not supported" against a default `weed nfs` deployment.
- Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with
TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and
proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses.
Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port.
- New CLI flag `-portmap.bind` (empty, disabled by default) to opt into
the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE
and must not collide with a system rpcbind.
- Extended `weed nfs -h` help with the two supported ways to mount from
Linux (client-side portmap bypass, or server-side `-portmap.bind`).
- Startup log now prints a copy-pasteable mount command tailored to
whether portmap is enabled.
Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP
round-trip against the real listener.
Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0`
the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock
host:/export /mnt`) now succeeds and both read and write work.
* fix(nfs): harden portmap responder per review feedback (#9201)
Addresses three review findings on the portmap responder:
- parseRPCCall: validate opaque_auth length against the record limit
before applying the XDR 4-byte padding, so a near-uint32-max authLen
can no longer overflow (authLen + 3) and bypass the bounds check.
(gemini-code-assist)
- serveTCP/Close: track live TCP connections and evict them on Close()
so shutdown does not block on idle clients waiting for the read
deadline to trip. serveTCP also no longer tears the listener down on
a non-fatal Accept error (e.g. EMFILE); it logs and retries after a
small back-off. Replaces the atomic.Bool closed flag with a
mutex-guarded one so closed, conns, and the shutdown transition stay
consistent. (coderabbit, minor)
- handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s
in-flight) so a peer that opens the privileged port 111 and stalls
cannot pin a goroutine indefinitely. (coderabbit, major)
Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP
connection idle and asserts Close() returns within 2s (well under the
30s idle deadline) and that the client sees the eviction.
All existing tests still pass, including under -race.
* fix(nfs): keep portmap UDP responder alive on transient read errors (#9201)
- serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and
continue instead of returning. Matches how serveTCP now treats
non-fatal Accept errors so a transient network blip doesn't take
UDP portmap down until restart. (coderabbit)
- Rename portmapAcceptBackoff -> portmapRetryBackoff now that both
paths use it.
- pmapProcDump: fix the pre-allocation capacity to match the actual
encoding (20 bytes per entry + 4-byte terminator), replacing the
old over-estimate of 24 per entry. No behavior change; just
documents intent. (coderabbit nit)
* docs(nfs): clarify encodeAcceptedReply body semantics (#9201)
The prior comment said body is "nil when the accept_stat is itself an
error", which was misleading: the PROG_MISMATCH branch already passes
an 8-byte mismatch_info body. Rewrite to enumerate which error
accept_stat values omit the body and call out PROG_MISMATCH as the
exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit)
* fix(nfs): make portmap retry backoff interruptible by Close() (#9201)
serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a
non-fatal listener error. If Close() races in during that sleep, the
goroutine can't be interrupted, so Close() has to wait out the
remaining backoff before wg.Wait() returns.
Add a done channel that Close() closes once, and replace both
time.Sleep calls with a select on ps.done + time.After. The window
was tiny in practice but the select makes shutdown strictly bounded
by Close()'s own work. (coderabbit nit)
* fix(storage): use ceil division for EC shard slots in maxVolumeCount
* fix(topology): use ceil division for EC shard slots consistently
Applies the same ceiling-division formula used in store.go to the
four remaining master-side sites that computed volume-slots consumed
by EC shards with off-by-one approximations:
- disk.go ToDiskInfo / Disk.ToDiskInfo used (n+1)/d, which under-counts
slots for non-multiples of DataShardsCount, over-reporting
FreeVolumeCount.
- DiskUsageCounts.FreeSpace and NodeImpl.AvailableSpaceFor subtracted
n/d + 1, which over-counts slots at multiples of DataShardsCount,
under-reporting free space (and suppressing volume growth on nodes
that still had room).
All four now use (n + DataShardsCount - 1) / DataShardsCount, matching
store.go:393, store.go:810, and command_ec_decode.go:422.
* refactor(topology): extract ecShardSlots helper
Deduplicates the (n + DataShardsCount - 1) / DataShardsCount ceiling
expression now used by ToDiskInfo, DiskUsageCounts.FreeSpace,
Disk.ToDiskInfo, and AvailableSpaceFor. Addresses PR review feedback.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* fix(admin): use basePath for API fetches when urlPrefix is set
* fix(admin): drop duplicate iam-utils script on Groups page
* fix(admin): route topics page fetches through basePath
The Topics page missed two fetch() calls that still used root-relative
URLs, so create-topic and view-details still broke when -urlPrefix was
set.
---------
Co-authored-by: Maksim Babkou <maksim.babkou@innovatrics.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
* fix(filer.meta.tail): include extended metadata in Elasticsearch docs
The -es sink flattened only the FUSE attributes, so xattrs (including S3
user metadata like X-Amz-Meta-*) never reached Elasticsearch. Add an
Extended field and convert map[string][]byte to map[string]string so the
values index as text; non-UTF-8 values fall back to base64.
Addresses #9190 follow-up.
* fix(filer.meta.tail): prefix base64-encoded extended values with "base64:"
Addresses review feedback: a plain UTF-8 xattr and a base64 fallback are otherwise indistinguishable to a consumer reading the ES doc.
The memory credential store backs LoadConfiguration with a map, so the
identity order is not stable across a save/load round trip. Indexing
Identities[1] intermittently pointed at the owner identity and produced
a spurious credential leak.
Marp-based markdown deck walking through the three-layer production
topology (masters, volumes, filers + DB, S3 gateways, admin + workers)
plus an erasure-coding note. Uses the object-store-layout diagram on the
overview slide. Makefile renders PDF/HTML/PPTX via marp-cli.
* fix(filer/remote): keep re-cache work alive past caller cancellation (#9174)
For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot
finish before the S3 gateway's initial cache wait elapses. When it
does, the gRPC ctx cancellation cascades into the filer's chunk
downloads, the error path calls DeleteUncommittedChunks on every chunk
already written, and the next retry starts over. boto3 splitting the
GET into concurrent ranges (or any client tear-down on first failure)
shortens the window between retries, so the loop never converges.
Detach the caller's ctx with context.WithoutCancel before invoking
the singleflight work so the download runs to completion regardless
of client cancellations. Subsequent waiters — via the in-flight
singleflight, or a fresh retry landing after completion — observe the
cached entry and stream normally.
Same detach pattern is used in filer_server_handlers_write.go:53 and
volume_server_handlers_write.go:51.
* simplify rationale comment
* switch to DoChan so handler can return on caller cancel
Do keeps the handler goroutine blocked for the full detached download
even after the client is gone. DoChan lets the handler select on
ctx.Done() and exit immediately; the singleflight goroutine continues
on bgCtx and the next request either joins it or finds the entry
cached.
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard
ReceiveFile for an EC shard calls os.Create(filePath) which opens the
path with O_TRUNC. When the shard is already mounted, the in-memory
EcVolume holds a file descriptor against the same inode, so a second
ReceiveFile call for the same (volume, shard) truncates the live shard
file beneath the reader.
Reproducer: generate and mount shard 0 for a populated volume, capture
the on-disk size, then send a smaller payload for the same shard via
ReceiveFile. The current handler accepts the overwrite and leaves the
shard truncated in place; this test pins that behavior. When the fix
lands the server should reject (or rename-then-swap) and this test
must be inverted.
* fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard
ReceiveFile used os.Create on EC shard paths, which opens with
O_TRUNC and truncates in place. When an EC shard is already
mounted, the in-memory EcVolume holds file descriptors against the
same inodes, so the truncation corrupts the live shard beneath any
ongoing read. On retries of an EC task this produced the "missing
parts" class of errors in #9184.
The fix rejects any ReceiveFile for an EC volume that currently
has mounted shards. The caller must unmount before retrying —
silent truncation is never an acceptable outcome. Non-EC writes and
ReceiveFile for volumes that have never been mounted on this server
continue to work as before.
Tests:
- TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard,
attempts an overwrite, asserts the error response and that the
on-disk file and live reads are undisturbed.
- TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case
contract that a first write to a target still succeeds.
* fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard
Mirror the Go-side change: reject receive_file for any EC volume that
currently has mounted shards on this server. std::fs::File::create
truncates in place and the in-memory EcVolume holds fds on the same
inodes, so an overwrite would corrupt live readers.
* test(erasure_coding): reproduce #9184 deleteOriginalVolume swallowing errors
ErasureCodingTask.deleteOriginalVolume logs a warning when any replica
VolumeDelete fails and then returns nil, so the EC task reports
success to the admin even when a source replica survives. That stale
replica lets a later detection scan re-propose the same volume and,
once retried, drives the mounted-shard-truncation corruption that
issue 9184 also describes.
Reproducer: wire one reachable replica (succeeds) and one unreachable
replica (fails) and assert the function currently returns nil. After
the fix the function must surface the replica failure so the task is
retried rather than marked done, and this test needs to be inverted.
* fix(erasure_coding): surface replica delete failures from EC task
ErasureCodingTask.deleteOriginalVolume previously logged a warning
and returned nil when any VolumeDelete against a source replica
failed. The EC task therefore reported overall success to the admin
even when a source replica stayed on disk, which let a later
detection scan propose a duplicate EC encoding of the same volume.
The retry then walked the ReceiveFile path against servers that
already had mounted EC shards for the volume, truncating the live
shard files in place (the other half of #9184).
This change returns an error describing the per-replica failures
after the best-effort delete pass, so the task is marked failed
instead of silently moving on. Successful deletes are still applied
(per-replica progress is preserved); only the final return changes.
When combined with the ReceiveFile mount-safety check, a stuck
original replica now produces loud, actionable failures instead of
silent corruption.
Tests:
- TestDeleteOriginalVolumeSurfacesReplicaFailures: asserts an error
is returned and names the unreachable replica, while the reachable
replica still gets deleted.
- TestDeleteOriginalVolumeSucceedsWhenAllReplicasReachable: pins the
happy path.
* feat(iam): support caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey
Both IAM implementations (standalone and embedded) now check for
caller-supplied AccessKeyId and SecretAccessKey form parameters before
generating random credentials. If provided, the caller-supplied values
are used. If empty, random keys are generated as before.
This enables programmatic identity provisioning where the caller needs
to control the S3 credentials.
Backward-compatible: no behavior change for callers that omit these
parameters.
* refactor(iam): extract shared caller-supplied credential validation
Move the AccessKeyId/SecretAccessKey format checks and the in-memory
collision scan into weed/iam so the standalone IAM API, the embedded
IAM in s3api, and the admin dashboard all enforce the same rules.
- ValidateCallerSuppliedAccessKeyId: 4-128 alphanumeric (rejects
SigV4-breaking characters like '/' and '=').
- ValidateCallerSuppliedSecretAccessKey: 8-128 chars.
- FindAccessKeyOwner: scans identities and service accounts and
returns the owning entity type + name for debug logging, without
exposing the owner in caller-facing error messages.
The admin dashboard previously only length-checked caller-supplied
keys; it now enforces the same alphanumeric rule, which matches what
SigV4 actually accepts anyway.
* fix(iam): reject partial caller-supplied AccessKeyId/SecretAccessKey
Previously, if a caller supplied only one of AccessKeyId or
SecretAccessKey, CreateAccessKey logged a warning and auto-generated
the missing half. That silently returns a credential the caller did
not fully choose, which is surprising and easy to miss in a response
they expected to echo back their input.
Return ErrCodeInvalidInputException instead: either both are supplied
or neither is. Updates the mixed-supply tests in weed/iamapi and
weed/s3api to assert the rejection.
* chore(iam): centralize and broaden sensitive form redaction
DoActions and ExecuteAction both had an inline loop that redacted
SecretAccessKey from their debug-level request log. Replace the two
copies with iam.RedactSensitiveFormValues, backed by an explicit
sensitive-keys set.
The set now also covers Password, OldPassword, NewPassword,
PrivateKey, and SessionToken. None of those parameters are used by
today's IAM actions, but naming them here makes the log-safety
guarantee survive future additions such as LoginProfile / STS.
* test(iam): cover the upper length bound for CreateAccessKey
TestCreateAccessKeyBoundary / TestEmbeddedIamCreateAccessKeyBoundary
only exercised the 3/4-char lower edge. Add cases for 128 (accepted)
and 129 (rejected) for AccessKeyId, plus 7 / 128 / 129-char cases
for SecretAccessKey, so both ends of the validator are locked in at
the handler level (the pure validators in weed/iam already cover
this).
* fix(s3api/iam): verify user existence before RNG and collision scan
In the embedded IAM CreateAccessKey, the user lookup ran last: a
request for a non-existent user still walked the whole identity /
service-account list for collisions and, if no caller-supplied keys
were present, generated fresh random credentials with crypto/rand
before the NoSuchEntity error finally surfaced.
Reorder: validate inputs, then find the target identity, then do the
collision scan, then generate keys. A missing user now fails fast and
consumes no entropy, and the handler returns NoSuchEntity instead of
a misleading EntityAlreadyExists when both the user is missing and
the supplied AccessKeyId happens to collide with another identity's
key.
Add TestEmbeddedIamCreateAccessKeyRejectsMissingUser to lock in the
"no mutation on unknown user" guarantee.
The standalone iamapi CreateAccessKey intentionally keeps its
pre-existing "create-or-attach" semantics where a missing user is
implicitly provisioned — that is a behavior change beyond the scope
of this PR.
* test(iam): tighten collision leak assertion and cover 8-char secret
- Rename the collision-owner identity in TestCreateAccessKeyRejectsCollision
(both iamapi and the embedded s3api test) from "existing" / "ExistingUser"
to "ownerAlpha". The old assert.NotContains check was effectively a no-op
because the error message never contained those substrings; a distinctive
name shared with no part of the expected error body makes the leak guard
actually meaningful if the wording ever drifts. The embedded test also
adds a NotContains assertion that was previously missing entirely.
- Add an explicit 8-char SecretAccessKey pass case to both boundary tests
so the lower edge of the validator is locked in at the handler level
alongside the 7 / 128 / 129-char cases.
* fix(iamapi): enforce both-or-none before the collision lookup
In the standalone IAM CreateAccessKey, FindAccessKeyOwner ran before
the partial-credential check. If a caller supplied only AccessKeyId
and it happened to collide with an existing key, the response was
EntityAlreadyExists instead of the more fundamental InvalidInput for
omitting SecretAccessKey — wrong error class, and leaked the fact
that the probed key is already in use.
Swap the order: validate both-or-none first, then do the collision
scan. Matches the embedded IAM path and AWS behavior.
Add a case to TestCreateAccessKeyRejectsPartialSupply that combines
partial supply with a collision to lock in the ordering.
* fix(admin): reject partial caller-supplied AccessKey/SecretKey
The admin dashboard path silently generated the missing half when a
caller supplied only one of AccessKey or SecretKey, while the IAM
API and embedded IAM paths now reject this. Align the three: if
exactly one is provided, return ErrInvalidInput.
Also simplifies the generator block — either both are provided or
neither is, so there is no mixed path to handle.
* test(s3api/iam): guard dereferences in caller-supplied-keys test
TestEmbeddedIamCreateAccessKeyWithCallerSuppliedKeys dereferenced
*AccessKeyId/*SecretAccessKey/*UserName and indexed
Identities[0].Credentials[0] without first verifying shape, so any
future regression that returns a partial response or skips the
config mutation would panic mid-assertion instead of failing with
a clear message.
Add require.NotNil on the response pointers and require.Len on
the identities/credentials slices before the asserts.
* test(iamapi): exercise the service-account branch of the collision check
FindAccessKeyOwner scans both Identities[*].Credentials and
ServiceAccounts[*].Credential, but TestCreateAccessKeyRejectsCollision
only covered the identity branch. Split the test into two subtests —
one per branch — so a future refactor that drops the service-account
scan (or mutates the existing credential) trips a failure.
Also asserts the existing service-account credential is not mutated
and no credential is attached to the target identity on rejection.
* test(iam): isolate 129-char secret subcase from prior credential
In both TestCreateAccessKeyBoundary (iamapi) and
TestEmbeddedIamCreateAccessKeyBoundary (s3api), the 129-char
SecretAccessKey subcase reused the "validkey" AccessKeyId that the
preceding 8-char subcase had just persisted into the config. The
test still asserted the right outcome because the handler validates
secret length before running the collision scan — but if the two
checks ever swap, the subcase would pass (or fail) for the wrong
reason.
Reset the in-memory credentials before the 129-char subcase, matching
the pattern already used by the 3/128/129-char AccessKeyId and
7-char secret subcases. No behavior change; purely test isolation.
---------
Co-authored-by: Chris Lu <chris.lu@gmail.com>
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.
Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch>
TestPosixFileLocking/FcntlReleaseOnClose was flaky because the
subprocess spawned by startLockHolder occasionally saw ENOENT when
opening a file the parent had just created on the FUSE mount. Retry on
ENOENT (matching the existing openWithRetry pattern used in
testConcurrentLockContention) so the subprocess waits for the mount's
dentry state to propagate before reporting the lock acquire as failed.
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug
The plugin-worker EC task sends shards via ReceiveFile, which picks
Locations[0] as the target directory regardless of the admin planner's
TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there
is no wire channel to honor the plan.
Adds StartSingleVolumeClusterWithDataDirs to the integration framework
so tests can launch a volume server with N data directories. The new
repro asserts the current (buggy) behavior: sending three distinct EC
shards via ReceiveFile leaves all three files in dir[0] and the other
dirs empty. When the fix adds disk_id to ReceiveFileInfo, this
assertion must flip to verify the planned placement is respected.
* fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement
Before this change, VolumeServer.ReceiveFile for EC shards always
selected the first HDD location (Locations[0]). The plugin-worker EC
task had no way to pass the admin planner's per-shard disk
assignment — ReceiveFileInfo carried no disk_id field — so every
received EC shard piled onto a single disk per destination server.
On multi-disk servers this caused uneven load (one disk absorbing all
EC shard I/O), frequent ENOSPC retries, and a growing EC backlog
under sustained ingest (see issue #9184).
Changes:
- proto: add disk_id to ReceiveFileInfo, mirroring
VolumeEcShardsCopyRequest.disk_id.
- worker: DistributeEcShards tracks the planner-assigned disk per
shard; sendShardFileToDestination forwards that disk id. Metadata
files (ecx/ecj/vif) inherit the disk of the first data shard
targeting the same node so they land next to the shards.
- server: ReceiveFile honors disk_id when > 0 with bounds
validation; disk_id=0 (unset) falls back to the same
auto-selection pattern as VolumeEcShardsCopy (prefer disk that
already has shards for this volume, then any HDD with free space,
then any location with free space).
Tests updated:
- TestReceiveFileEcShardHonorsDiskID asserts three shards sent with
disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively.
- TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range
disk_id rejection path.
* fix(volume-rust): honor disk_id in ReceiveFile for EC shards
Mirror the Go-side change: when disk_id > 0 place the EC shard on the
requested disk; when unset, auto-select with the same preference order
as volume_ec_shards_copy (disk already holding shards, then any HDD,
then any disk).
* fix(volume): compare disk_id as uint32 to avoid 32-bit overflow
On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a
high-bit uint32 to a negative int, bypassing the bounds check before the
index operation. Compare in the uint32 domain instead.
* test(ec): fail invalid-disk_id test on transport error
Previously a transport-level error from CloseAndRecv silently passed the
test by returning early, masking any real gRPC failure. Fail loudly so
only the structured ReceiveFileResponse rejection path counts as a pass.
* docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test
Documents the load-bearing assumption that shards are never mounted in
this test, so loc.FindEcVolume always returns false and auto-select
falls through to the first HDD. Saves future readers from re-deriving
the expected directory for the DiskId=0 case.
* fix(test): preserve baseDir/volume path for single-dir clusters
StartSingleVolumeClusterWithDataDirs started naming the data directory
volume0 even in the dataDirCount=1 case, which broke Scrub tests that
reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile /
CorruptEcxFile. Keep the legacy name for single-dir clusters; only use
the indexed "volumeN" layout when multiple disks are requested.
fix(filer.meta.tail): error instead of silently dropping events when -es is used without elastic build tag
The default chrislusf/seaweedfs image builds without the `elastic` build
tag, so sendToElasticSearchFunc was a no-op that returned a function
discarding every event. Users passing -es saw the subscription wire up
in filer logs but nothing ever reached Elasticsearch.
Return an error explaining the binary wasn't built with ES support and
pointing at the build flag. The caller already prints the error and
exits, so users now get an immediate, actionable message.
Fixes#9190
AllocateMiniPorts(1) reserved masterPort and masterPort+GrpcPortOffset
by holding listeners open, but closed them on return. The subsequent
AllocatePorts call bound 127.0.0.1:0, so the OS could immediately reuse
the just-released mini gRPC port as a volume port — causing the volume
server to fail at bind time with "address already in use".
Introduce AllocatePortSet(miniCount, regularCount) that holds every
listener open until the full set is chosen, and route the five volume
test cluster builders through it.
createTableBucket ran `weed shell` via exec.Command with no deadline.
When the shell's first command retries on a transient master connection
blip, the trailing `exit` on stdin never gets processed and the
subprocess blocks until the outer 20m `go test` timeout fires — the
surfacing symptom is a flaky 20m panic with no diagnostic output.
Wrap the invocation in exec.CommandContext with a 30s timeout, matching
the existing pattern in test/s3tables/catalog_risingwave/setup_test.go.
The filer-side mount peer registry (tier 1 of peer chunk sharing) was
gated behind -mount.p2p (default true). Idle cost is negligible — a
tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the
surface area.
Removes the flag from weed filer, weed server (-filer.mount.p2p), and
weed mini, and always constructs the registry in NewFilerServer. Also
drops the now-dead nil guards in MountRegister/MountList/sweeper and
the TestMountRegister_DisabledIsNoOp case.
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin
S3 and filer already use a refreshing pemfile provider for their HTTPS
cert, so rotated certificates (e.g. from k8s cert-manager) are picked up
without a restart. Master, volume, webdav, and admin, however, passed
cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at
startup — rotating those certs required a pod restart.
Add a small helper NewReloadingServerCertificate in weed/security that
wraps pemfile.Provider and returns a tls.Config.GetCertificate closure,
then wire it into the four remaining HTTPS entry points. httpdown now
also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates
but CertFile/KeyFile are empty, so volume server can pre-populate
TLSConfig.
A unit test exercises the rotation path (write cert, rotate on disk,
assert the callback returns the new cert) with a short refresh window.
* refactor(security): route filer/s3 HTTPS through the shared cert reloader
Before: filer.go and s3.go each kept a *certprovider.Provider on the
options struct plus a duplicated GetCertificateWithUpdate method. Both
were loading pemfile themselves. Behaviorally they already reloaded, but
the logic was duplicated two ways and neither path was shared with the
newly-added master/volume/webdav/admin wiring.
After: both use security.NewReloadingServerCertificate like the other
servers. The per-struct certProvider field and GetCertificateWithUpdate
method are removed, along with the now-unused certprovider and pemfile
imports. Net: -32 lines, one code path for all HTTPS cert reloading.
No behavior change — the refresh window, cache, and handshake contract
are identical (the helper wraps the same pemfile.NewProvider).
* feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc
The HTTP client in weed/util/http/client loaded the mTLS client cert
once at startup via tls.LoadX509KeyPair. That left every long-lived
HTTPS client process (weed mount, backup, filer.copy, filer→volume,
s3→filer/volume) unable to pick up a rotated client cert without a
restart — even though the same cert-manager setup was already rotating
the server side fine.
Swap the client cert loader for a tls.Config.GetClientCertificate
callback backed by the same refreshing pemfile provider. New TLS
handshakes pick up the rotated cert; in-flight pooled connections keep
their old cert and drop as normal transport churn happens.
To keep this reusable from both server and client TLS code without an
import cycle (weed/security already imports weed/util/http/client for
LoadHTTPClientFromFile), extract the pemfile wrapper into a new
weed/security/certreload subpackage. weed/security keeps its thin
NewReloadingServerCertificate wrapper. The existing unit test moves
with the implementation.
gRPC mTLS was already handled by security.LoadServerTLS /
LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ
agent, Kafka gateway, and FUSE mount control plane are gRPC-only and
therefore already rotate.
CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted
as a known limitation in the wiki.
* fix(security): address PR review feedback on cert reloader
Bots (gemini-code-assist + coderabbit) flagged three real issues and a
couple of nits. Addressing them here:
1. KeyMaterial used context.Background(). The grpc pemfile provider's
KeyMaterial blocks until material arrives or the context deadline
expires; with Background() a slow disk could hang the TLS handshake
indefinitely. Switched both the server and client callbacks to use
hello.Context() / cri.Context() so a stuck read is bounded by the
handshake timeout.
2. Admin server loaded TLS inside the serve goroutine. If the cert was
bad, the goroutine returned but startAdminServer kept blocking on
<-ctx.Done() with no listener, making the process look healthy with
nothing bound. Moved TLS setup to run before the goroutine starts
and propagate errors via fmt.Errorf; also captures the provider and
defers Close().
3. HTTP client discarded the certprovider.Provider from
NewClientGetCertificate. That leaked the refresh goroutine, and
NewHttpClientWithTLS had a worse case where a CA-file failure after
provider creation orphaned the provider entirely. Added a
certProvider field and a Close() method on HTTPClient, and made
the constructors close the provider on subsequent error paths.
4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain
the provider. filer and webdav run ServeTLS synchronously, so a
plain defer works. master/volume/s3 dispatch goroutines and return
while the server keeps running, so they hook Close() into
grace.OnInterrupt.
5. Test: certreload_test now tolerates transient read/parse errors
during file rotation (writeSelfSigned rewrites cert before key) and
reports the last error only if the deadline expires.
No user-visible behavior change for the happy path.
* test(tls): add end-to-end HTTPS cert rotation integration test
Boots a real `weed master` with HTTPS enabled, captures the leaf cert
served at TLS handshake time, atomically rewrites the cert/key files
on disk (the same rename-in-place pattern kubelet does when it swaps
a cert-manager Secret), and asserts that a subsequent TLS handshake
observes the rotated leaf — with no process restart, no SIGHUP, no
reloader sidecar. Verifies the full path: on-disk change → pemfile
refresh tick → provider.KeyMaterial → tls.Config.GetCertificate →
server TLS handshake.
Runtime is ~1s by exposing the reloader's refresh window as an env
var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the
test. The same env var is user-facing — documented in the wiki — so
operators running short-lived certs (Vault, cert-manager with
duration: 24h, etc.) can tighten the rotation-pickup window without a
rebuild. Defaults to 5h to preserve prior behavior.
security.CredRefreshingInterval is kept for API compatibility but now
aliases certreload.DefaultRefreshInterval so the same env controls
both gRPC mTLS and HTTPS reload.
* ci(tls): wire the TLS rotation integration test into GitHub Actions
Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner,
Go 1.25, build weed, run `go test` in test/tls_rotation, upload master
logs on failure. 10-minute job timeout; the test itself finishes in
about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms
inside the test.
Runs on every push to master and on every PR to master.
* fix(tls): address follow-up PR review comments
Three new comments on the integration test + volume shutdown path:
1. Test: peekServerCert was swallowing every dial/handshake error,
which meant waitForCert's "last err: <nil>" fatal message lost all
diagnostic value. Thread errors back through: peekServerCert now
returns (*x509.Certificate, error), and waitForCert records the
latest error so a CI flake points at the actual cause (master
didn't come up, handshake rejected, CA pool mismatch, etc.).
2. Test: set HOME=<tempdir> on the master subprocess. Viper today
registers the literal path "$HOME/.seaweedfs" without env
expansion, so a developer's ~/.seaweedfs/security.toml is
accidentally invisible — the test was relying on that. Pinning
HOME is belt-and-braces against a future viper upgrade that does
expand env vars.
3. volume.go: startClusterHttpService's provider close was registered
via grace.OnInterrupt, which fires on SIGTERM but NOT on the
v.shutdownCtx.Done() path used by mini / integration tests. The
pemfile refresh goroutine leaked in that shutdown path. Now the
helper returns a close func and the caller invokes it on BOTH
shutdown paths for parity.
Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the
ast-grep static-analysis nit — zero-risk since the pool only trusts
our in-memory CA.
Test runs clean 3/3.
* fix(s3/shell): include EC volumes in bucket size metrics and collection.list
S3 bucket size metrics exported to Prometheus (and fed through
stats.UpdateBucketSizeMetrics) are computed by
collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos.
As soon as a volume was encoded to EC it dropped out of every aggregate,
so Grafana showed bucket sizes shrinking while physical disk usage kept
climbing. The shell helper collectCollectionInfo — used by collection.list
and s3.bucket.quota.enforce — had the same gap, with the EC branch left as
a commented-out TODO.
Fold EC shards into both paths using the same approach the admin dashboard
already uses (PR #9093):
- PhysicalSize / Size sum across shard holders: EC shards are node-local
(not replicas), so per-node TotalSize() and MinusParityShards().TotalSize()
sum to the whole-volume physical and logical sizes respectively.
- FileCount is deduped via max across reporters (every shard holder reports
the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and
must not pin the aggregate).
- DeleteCount is summed (each delete tombstones exactly one node's .ecj).
- VolumeCount increments once per unique EC volume id.
Adds regression tests covering pure-EC, mixed regular+EC, and the
slow-reporter FileCount dedupe case.
Refs #9086
* Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe
- Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding
package that walk the shard bitmap directly instead of materializing a
ShardsInfo and copying it via MinusParityShards(). Keeps the
DataShardsCount dependency encapsulated in one place and avoids the
per-shard allocation/copy overhead in the metrics hot path.
- Switch shell collectCollectionInfo ecVolumes map to a composite
{collection, volumeId} key, matching the bucket_size_metrics collector
and defending against any cross-collection volume id aliasing.
- Dedupe VolumeCount in shell addToCollection by volume id so regular
volumes aren't counted once per replica presence. Aligns the shell's
collection.list output with the S3 metrics collector and the EC branch,
all of which now report logical volume counts.
- Add unit tests for the new helpers and for the regular-volume
VolumeCount dedupe.
* Parameterize EcShardsDataSize with dataShards for custom EC ratios
Add a dataShards parameter to EcShardsDataSize so forks with per-volume
ratio metadata (e.g. the enterprise data_shards field carried on an
extended VolumeEcShardInformationMessage) can pass the configured value
and get accurate logical sizes under custom EC policies like 6+3 or 16+6.
Passing 0 or a negative value falls back to the upstream DataShardsCount
default, which is correct for the fixed 10+4 layout — so OSS callers in
s3api and shell pass 0 and keep their current behavior.
Added table cases covering the custom 6+3 and 16+6 paths so the
parameterization is pinned by tests.