mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-22 17:51:30 +00:00
ca4c8d977dc76faa911ba7441d65c2bdddb0dde8
13604 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ca4c8d977d |
docs: add SeaweedFS-aware Parquet pushdown design draft
Draft design note for keeping Parquet files compatible with existing engines (Spark, Trino, DuckDB, PyArrow, Iceberg) while building auxiliary side indexes and a pushdown API for SeaweedFS-aware clients. Covers physical layout, side index types (footer cache, zone map, page index, bloom, bitmap, B-tree, inverted, vector), pushdown API sketch, Iceberg delete handling, and a phased rollout plan. |
||
|
|
5eead9409a |
fix(admin): S3 Tables CSRF token + non-empty 409 status (#9221)
* fix(admin): attach CSRF token to S3 Tables write requests Several POST/PUT/DELETE calls in s3tables.js were sent without an X-CSRF-Token header while the corresponding handlers in weed/admin/dash/s3tables_management.go enforce CSRF via requireSessionCSRFToken, so authenticated users hit "invalid CSRF token" on actions like creating a table bucket (#9220), updating policies, and managing tags. Add an s3tWriteHeaders helper that pulls the token from the existing csrf-token meta tag and use it on every write to /api/s3tables/buckets, /bucket-policy, /tables, /table-policy, and /tags. The Iceberg-page write paths already attached the token and are unchanged. Fixes #9220 * fix(admin): map BucketNotEmpty/NamespaceNotEmpty to 409 for S3 Tables DELETE on a non-empty table bucket or namespace returned HTTP 500 because s3TablesErrorStatus didn't list ErrCodeBucketNotEmpty or ErrCodeNamespaceNotEmpty in its conflict case, even though the backend handler emits them with 409 Conflict (matching AWS S3 Tables). Add both codes to the existing conflict mapping. * refactor(admin): route Iceberg S3 Tables writes through s3tWriteHeaders Iceberg namespace/table create and Iceberg table delete were still hand-rolling CSRF headers. Replace those blocks with the existing s3tWriteHeaders() helper so every S3 Tables write uses the same code path. Drop the now-unused csrfTokenInput.value population in initIcebergNamespaces and initIcebergTables (the templ hidden inputs have no server-rendered value, and nothing reads the input now that the JS reads the token from the meta tag via getCSRFToken()). |
||
|
|
a14cbc176b | debug(kafka): add restart flake diagnostics | ||
|
|
f1f720f5da |
fix(master): register EC shards per physical disk on full heartbeat sync (#9212) (#9219)
* refactor(types): add DiskId type for physical-disk identifiers Names the uint32 physical-disk index that volume servers carry in VolumeEcShardInformationMessage / VolumeInformationMessage, so EC shard tracking that needs to distinguish disks within a DataNode can use a dedicated type instead of an untyped uint32. No behaviour change. * fix(master): register EC shards per physical disk on full heartbeat sync (#9212) When a volume's EC shards are spread across multiple physical disks on the same volume server (common after ec.balance / ec.rebuild on multi-disk nodes), the volume server emits one VolumeEcShardInformationMessage per (disk, volume) in its heartbeat. The master's DataNode.UpdateEcShards was building a `map[VolumeId]*EcVolumeInfo` with last-write-wins, and doUpdateEcShards then overwrote `disk.ecShards[vid]` once per message, so all but the final disk's shards were silently dropped. Only the topology-global ecShardMap (built via RegisterEcShards in a per-message loop) stayed correct, which hid the problem from `topo.LookupEcShards` but broke everything that reads the DataNode/Disk view — volume.list, admin UI, ec.rebuild dry-run ("only 6 shards, skipping"), and `DiskInfo.EcShardInfos` which the shell's ec.balance / ec.rebuild planners group by `eci.DiskId`. Change the shape of `Disk.ecShards` from map[VolumeId]*EcVolumeInfo to map[VolumeId]map[types.DiskId]*EcVolumeInfo so every physical disk keeps its own entry. UpdateEcShards aggregates incoming messages by (vid, diskId) rather than vid alone; Add/Delete/ HasVolumesById and HasEcShards consult the nested map; doUpdateEcShards rewrites the nested structure from the aggregated map. Per-physical-disk attribution survives through DataNode.ToDataNodeInfo -> DiskInfo.EcShardInfos, matching the wire format the volume server produces and what downstream admin tooling expects. Delta sync (AddOrUpdateEcShard / DeleteEcShard) already merged via ShardsInfo.Add, so this only affects the full-sync path that runs on heartbeat reconnect. Adds data_node_ec_multi_disk_test.go with two regression tests that fail on pre-fix master: - TestEcShardsAcrossMultipleDisksOnSameNode: volume 15 spread over 3 disks (matches the bug report's volume-2 row); asserts every shard visible via LookupEcShards, DataNode.GetEcShards, and ToDataNodeInfo's per-disk EcShardInfos entries. - TestEcShardsAfterRestartHeartbeat: minimal 2-disk full sync case. * fix(topology): tighten locking around EC shard map access Addresses review comments on #9219: * DataNode.UpdateEcShards now holds dn.Lock for the full read-diff-write cycle, matching UpdateVolumes' model, so concurrent heartbeats can no longer interleave their getOrCreateDisk / UpAdjustDiskUsageDelta updates with each other. Introduces a private getEcShardsLocked helper for reads under the held lock; renames doUpdateEcShards to doUpdateEcShardsLocked for the same reason. * DataNode.HasEcShards now takes each disk's ecShardsLock while reading disk.ecShards, closing a pre-existing map race with concurrent Add/Delete/Update writers. * doUpdateEcShardsLocked takes each disk's ecShardsLock around the reset-and-rewrite so readers (GetEcShards, HasEcShards) see a consistent map state rather than a partially-rebuilt one. * Disk.GetEcShards' slice-capacity hint now accounts for the nested per-physical-disk entries (sum of inner lengths) instead of underestimating by the unique-volume count. |
||
|
|
d65c568cbb |
fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests (#9218)
* fix(s3api): validate SSE-S3 chunk IV length; add multipart direct reader tests DeserializeSSES3Metadata does not require an IV, and a corrupted or legacy chunk without one would have flowed into cipher.NewCTR and panicked. Validate that each per-chunk IV is exactly AESBlockSize bytes before decryption, closing the current and any already-appended chunk readers on error. Factor the per-chunk decryption loop out of createMultipartSSES3DecryptedReaderDirect into buildMultipartSSES3Reader so it can be driven with a mock chunk fetcher, and add tests covering: the happy path with two parts (distinct per-chunk DEKs/IVs, out-of-order chunks) to lock in the fix from #9211; missing-IV and short-IV metadata rejection without panic; and reader cleanup when a later chunk fails. * address review: sort chunks copy; close encryptedStream on error - buildMultipartSSES3Reader now sorts a copy of the chunks slice so callers do not observe entry.Chunks reordered (other code paths, e.g. ETag computation, can rely on the original order). - createMultipartSSES3DecryptedReaderDirect now closes encryptedStream on the error path from buildMultipartSSES3Reader. All current callers pass nil, but this keeps cleanup symmetric with the success path. - Extend TestBuildMultipartSSES3Reader_PerChunkKeys to assert the input slice is not mutated. * address review: defer single close; extend chunk-copy + IV-guard pattern - createMultipartSSES3DecryptedReaderDirect: collapse the duplicated encryptedStream.Close() calls into a single nil-guarded defer so the error and success paths share cleanup. - createMultipartSSECDecryptedReaderDirect, createMultipartSSEKMSDecryptedReaderDirect: sort a copy of entry.Chunks instead of mutating the caller's slice, matching the SSE-S3 helper. - createMultipartSSECDecryptedReaderDirect: validate per-chunk IV length before handing it to cipher.NewCTR; a base64-decoded empty or short IV from malformed/corrupt metadata would otherwise panic. - SSE-KMS needs no IV guard: CreateSSEKMSDecryptedReader already calls ValidateIV before cipher.NewCTR. Note recorded in the sort comment. * address review: close appended readers on SSE-C/SSE-KMS error paths createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect only closed the current chunk reader on error and leaked any chunk readers already appended to the local readers slice, mirroring the leak previously fixed in the SSE-S3 helper. Add the same closeAppendedReaders() closure pattern to both functions and invoke it on every error return inside the loop so failed requests do not leak volume-server HTTP connections. * address review: defer encryptedStream close in SSE-C/SSE-KMS; drop chunks reassignment - Move encryptedStream.Close() to a nil-guarded defer at the top of createMultipartSSECDecryptedReaderDirect and createMultipartSSEKMSDecryptedReaderDirect so the stream is closed on every return path (including error returns from inside the per-chunk loop), mirroring the SSE-S3 helper. - In buildMultipartSSES3Reader, iterate sortedChunks directly instead of reassigning chunks = sortedChunks. |
||
|
|
fe1d7a404d |
fix(iam): substitute dynamic jwt:/saml:/oidc: claim variables in policies (#9217)
* fix(iam): expand arbitrary jwt:/saml:/oidc: claim variables in policies
The policy engine gated variable substitution on a fixed allowlist
(jwt:sub, jwt:iss, jwt:aud, jwt:preferred_username), so patterns like
arn:aws:s3:::softs/${jwt:project_path}/* were passed through as literals
and never matched the requested resource. Dynamic claims from OIDC
providers (e.g. GitLab CI's project_path / namespace_path) could not be
used to scope policies.
Allow any jwt:/saml:/oidc: prefixed variable to be substituted when the
claim is present in RequestContext. These values originate from a
cryptographically verified identity token (the STS session JWT or
federated assertion), and the claim names are controlled by the trusted
identity provider, so the dynamic prefix is safe. Missing claims keep
the placeholder intact so the statement still fails to match.
Numeric JWT claims (JSON-decoded as float64) are now stringified so
patterns like ${jwt:project_id} work the same as string claims.
Fixes #9214
* fix(iam): cover all integer widths in claim stringification
Address PR review: stringifyClaimValue only handled int/int32/int64 on
the signed side and nothing on the unsigned side, so int8, int16, uint,
uint8, uint16, uint32, and uint64 claim values fell through to the
default branch and the placeholder was left unsubstituted.
JSON's generic decoder produces float64/json.Number for numbers, but
RequestContext can also be populated from typed sources (custom
providers or internal code), so cover all common integer widths -
signed and unsigned - explicitly. Extend TestStringifyClaimValue to
assert each supported type.
|
||
|
|
8815844278 |
fix(s3api): correct SSE-S3 decryption key handling in multipart uploads (#9211)
* fix(s3api): correct SSE-S3 decryption key handling in multipart uploads * fix(s3api): preallocate readers and close on error in SSE-S3 direct path Address review feedback on createMultipartSSES3DecryptedReaderDirect: preallocate the readers slice with the known chunk count, and close any already-appended chunk readers on error returns so failed requests do not leak volume-server HTTP connections. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
93247d6de4 |
Export REST file_{read,write}_failures metrics on volume servers (#9215)
* Export gRPC `file_{read,write}_failures` metrics on volume servers.
Allows to track overall R/W errors in real time through Prometheus.
Will follow up with a PR for Seaweed's REST API.
* Export REST `file_{read,write}_failures` metrics on volume servers.
|
||
|
|
352ffdffe1 |
build(deps): bump rustls-webpki from 0.103.10 to 0.103.13 in /seaweed-volume (#9216)
build(deps): bump rustls-webpki in /seaweed-volume Bumps [rustls-webpki](https://github.com/rustls/webpki) from 0.103.10 to 0.103.13. - [Release notes](https://github.com/rustls/webpki/releases) - [Commits](https://github.com/rustls/webpki/compare/v/0.103.10...v/0.103.13) --- updated-dependencies: - dependency-name: rustls-webpki dependency-version: 0.103.13 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
29e14f89f1 |
fix(weed/command) address unhandled errors (#9208)
* fix(weed/command) address unhandled errors * fix(command): don't log graceful-shutdown sentinels; plug response-body leak - s3: Serve on unix socket treated http.ErrServerClosed as fatal; now excluded like the other Serve/ServeTLS paths in this file. - mq_agent, mq_broker: filter grpc.ErrServerStopped so clean shutdown doesn't log as an error. - worker_runtime: the added decodeErr early-continue skipped resp.Body.Close(); drop it since the existing check below already surfaces the decode error. - mount_std: the pre-mount Unmount commonly fails when nothing is mounted; demote to V(1) Infof. - fuse_std: tidy panic message to match sibling cases. * fix(mq_broker): filter grpc.ErrServerStopped on localhost listener The localhost listener goroutine logged any Serve error unconditionally, which includes grpc.ErrServerStopped on graceful shutdown. Match the main listener's check so clean stops don't surface as errors. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
88c2f3c34d |
fix(iam): accept bare "*" resource in PutUserPolicy (#9209) (#9210)
AWS IAM treats a bare "*" in a statement's Resource as "any resource",
but the embedded IAM resource parser required a 6-segment S3 ARN and
silently skipped anything else. With a policy like
{Action: "s3:*", Resource: "*"}, every resource was dropped and the
statement produced no actions, so PutUserPolicy rejected the document
with "no valid actions found in policy document".
Short-circuit Resource == "*" to the same full-wildcard path that
"arn:aws:s3:::*" already takes.
|
||
|
|
da2e90aefd |
fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9207)
* fix(mount): sanitize non-UTF-8 filenames; keep marshal errors per-request (#9139) A single file with invalid-UTF-8 bytes in its name (e.g. a GNOME Trash "partial" like \x10\x98=\\\x8a\x7f.trashinfo.9a51454f.partial) made every FUSE-initiated filer RPC fail with: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 and then produced an avalanche of "connection is closing" errors on unrelated LookupEntry / ReadDirAll / UpdateEntry calls, causing the volume-server QPS dips reported in #9139. Root cause is twofold: 1. Proto3 `string` fields require valid UTF-8, but the FUSE kernel passes raw name bytes. Create/Mknod/Mkdir/Unlink/Rmdir/Rename/Lookup/Link/ Symlink all forwarded those bytes directly into CreateEntryRequest.Name, DeleteEntryRequest.Name, StreamRenameEntryRequest.{Old,New}Name and Entry.Name. saveDataAsChunk also copied the FullPath into AssignVolumeRequest.Path unchecked. 2. When the marshal failed, shouldInvalidateConnection treated the resulting codes.Internal as a connection problem and dropped the shared cached ClientConn — canceling every other in-flight RPC on it. Fix: - Add sanitizeFuseName (strings.ToValidUTF8 with '?' replacement, matching util.FullPath.DirAndName) and make checkName return the sanitized name. Apply at every FUSE entry point that passes a name to the filer RPC, including Unlink/Rmdir (which did not previously call checkName) and both oldName/newName in Rename. Add a backstop scrub for AssignVolumeRequest.Path so async flush paths cannot reintroduce invalid bytes from a pre-sanitization cached FullPath. - In weed/pb.shouldInvalidateConnection, detect client-side marshal errors via the gRPC library's "error while marshaling" prefix and return false: the connection is healthy, only the request is bad. Refs: https://github.com/seaweedfs/seaweedfs/issues/9139#issuecomment-4301184231 * fix(mount,util): use '_' for invalid-UTF-8 replacement (URL-safe) Sanitized filenames flow downstream into HTTP URLs (volume-server uploads, filer HTTP API, S3/WebDAV gateways). '?' is the URL query-string delimiter and would split the path the first time the name lands in one, so swap every invalid-UTF-8 replacement to '_'. This covers the two pre-existing sites in weed/util/fullpath.go as well, keeping all paths sanitized the same way. * refactor(pb): detect client-side marshal errors via errors.As, not substring Replace the raw `strings.Contains(err.Error(), ...)` check with a type-based carve-out: use errors.As against the `GRPCStatus() *Status` interface to pull the original Status out of any fmt.Errorf("...: %w") wrapping, then match the library-owned "grpc:" prefix on that Status's Message. Why not errors.Is against a proto-level sentinel: gRPC's encode() collapses the inner proto error with "%v" (stringification) before wrapping it in a Status, so the original error type does not survive into the caller. The Status itself is the structural signal that does survive. Why not status.FromError: when the caller wraps the Status error with fmt.Errorf("...: %w", ...), status.FromError rewrites Status.Message with the full err.Error() of the outermost wrapper, which defeats a prefix check on the library-owned message. errors.As gives us the original Status whose Message is still verbatim from the gRPC library. A new test asserts that a plain errors.New("grpc: error while marshaling: …") — i.e. the same text attached to something that is NOT a gRPC status — does not short-circuit invalidation, so we never silently keep a cached connection alive based on a coincidental substring match. * refactor(util): centralize UTF-8 sanitization; add FullPath.Sanitized Addresses review feedback on PR #9207. Nitpick: every invalid-UTF-8 replacement across the codebase (DirAndName, Name, mount.sanitizeFuseName, the weedfs_write.go backstop) now goes through a single util.SanitizeUTF8Name helper, so the replacement char ('_' — URL-safe) is chosen in one place. Outside-diff: three proto fields took raw FullPath strings that could break marshaling if an entry ever carried invalid UTF-8 (CreateEntryRequest.Directory in Mkdir, DeleteEntryRequest.Directory in Unlink, AssignVolumeRequest.Path in command_fs_merge_volumes). The reviewer's suggested fix — using DirAndName() — would have silently changed Directory from parent to grandparent, because DirAndName sanitizes only the trailing component. Added FullPath.Sanitized(), which scrubs every component, and applied it at the three sites. Exposure is narrow in practice (FUSE-boundary sanitization and the gRPC-side isClientSideMarshalError carve-out already cover the #9139 cascade), but the defense-in-depth is cheap and consistent with the existing AssignVolume backstop. New tests in weed/util/fullpath_test.go document: - SanitizeUTF8Name: valid UTF-8 passes through unchanged; invalid bytes become '_' (not '?', which is URL-special). - FullPath.Sanitized: scrubs bytes in any component, not just the last. - FullPath.DirAndName: dir remains raw on purpose — callers needing a clean full path must use Sanitized(). The test pins this behavior so it is not accidentally "fixed" in a way that changes the (dir, name) semantics callers depend on. |
||
|
|
a0be40e070 | Merge branch 'master' of https://github.com/seaweedfs/seaweedfs | ||
|
|
b94ad82472 |
fix(test): stabilize ConcurrentLockContention; warn on coherence drift
TestPosixFileLocking/ConcurrentLockContention failed in CI (run 24857323067) with ENOENT when re-opening the file after all 8 workers had successfully written and closed. The 20s openWithRetry budget was exhausted, pointing at a real but unproven metaCache/parent-cache coherence issue in the mount under bursts of concurrent Release. Test: hold the initial fd open for the whole subtest; use it for the post-workers Sync() and the verification read. Workers still exercise the concurrent-flock invariant and per-record write correctness; the re-open path is no longer load-bearing. On Eventually failure, dump ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future recurrence has state to debug from. Drop the darwin-only ENOENT t.Skip branches that hid this same flake. Mount: in weedfs.lookupEntry, when returning ENOENT from the "parent cached but child missing" branch, log at Warningf instead of V(4) when the kernel is still tracking this path's inode. That combination is the smoking-gun signal for cache drift and is rare enough in normal use not to spam the log. |
||
|
|
cd5004cfbd |
build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 in /test/kafka (#9204)
build(deps): bump github.com/Azure/go-ntlmssp in /test/kafka Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
5cbcfd311c |
build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 (#9205)
Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
76f361fa77 |
fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) (#9206)
* fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) With `global.seaweedfs.enableSecurity=true` and the default `s3.httpsPort=0`, the chart was unconditionally passing `-cert.file` / `-key.file` to the S3 frontend. In `weed/command/s3.go`, when `tlsPrivateKey != ""` and `portHttps == 0`, the server promotes its main `-port` (8333 by default) into an HTTPS listener. The pod's readiness / liveness probes still use `scheme: HTTP`, so every kubelet probe produces http: TLS handshake error from <node-ip>:<port>: client sent an HTTP request to an HTTPS server in the pod log, as reported in #9202. `enableSecurity=true` is supposed to activate security.toml / gRPC mTLS, not silently flip the S3 HTTP port to HTTPS. Move the `seaweedfs.s3.tlsArgs` include inside the `if httpsPort` guard in all three templates that wire up an S3 frontend (standalone S3 deployment, filer with S3 sub-server, all-in-one deployment). The TLS cert args are now emitted only when the user explicitly opts into an HTTPS port; the main `-port` stays HTTP so probes work. Also add a regression test to `.github/workflows/helm_ci.yml` that renders all three templates with and without `httpsPort` and asserts the cert/key/ `-port.https` args are emitted together or not at all. * test(helm): add bash -n parse check to the S3 TLS-gating regression test Addresses gemini-code-assist review comment on #9206 flagging a potential "dangling backslash" shell-syntax risk in the rendered all-in-one command script when httpsPort is set but most S3/SFTP args are defaulted off. In practice bash -n accepts a trailing `\<newline><EOF>` (it's line-continuation to an empty line), so no current rendering is broken. Locking that contract down in CI so a future helper change that leaves a dangling backslash — or any other shell-syntax regression in the rendered command — fails loudly instead of silently shipping broken pods. |
||
|
|
3d39324bc1 |
fix(nfs): make Linux mount -t nfs work without client workaround (#9199) (#9201)
* fix(nfs): make Linux `mount -t nfs` work without client-side workaround (#9199) The upstream go-nfs library serves NFSv3 + MOUNT on a single TCP port and does not register with portmap. Linux mount.nfs queries portmap on port 111 first, so the plain `mount -t nfs host:/export /mnt` form failed with "portmap query failed" / "requested NFS version or transport protocol is not supported" against a default `weed nfs` deployment. - Add a minimal PORTMAP v2 responder (weed/server/nfs/portmap.go) with TCP+UDP listeners implementing PMAP_NULL, PMAP_GETPORT, PMAP_DUMP, and proper PROG_MISMATCH / PROG_UNAVAIL / PROC_UNAVAIL responses. Advertises NFS v3 TCP and MOUNT v3 TCP at the configured NFS port. - New CLI flag `-portmap.bind` (empty, disabled by default) to opt into the responder. Binding port 111 requires root or CAP_NET_BIND_SERVICE and must not collide with a system rpcbind. - Extended `weed nfs -h` help with the two supported ways to mount from Linux (client-side portmap bypass, or server-side `-portmap.bind`). - Startup log now prints a copy-pasteable mount command tailored to whether portmap is enabled. Unit tests cover RPC/XDR parsing, accept-stat paths, and a TCP+UDP round-trip against the real listener. Verified in a privileged Debian 12 container: with `-portmap.bind=0.0.0.0` the exact command from #9199 (`mount -t nfs -o nfsvers=3,nolock host:/export /mnt`) now succeeds and both read and write work. * fix(nfs): harden portmap responder per review feedback (#9201) Addresses three review findings on the portmap responder: - parseRPCCall: validate opaque_auth length against the record limit before applying the XDR 4-byte padding, so a near-uint32-max authLen can no longer overflow (authLen + 3) and bypass the bounds check. (gemini-code-assist) - serveTCP/Close: track live TCP connections and evict them on Close() so shutdown does not block on idle clients waiting for the read deadline to trip. serveTCP also no longer tears the listener down on a non-fatal Accept error (e.g. EMFILE); it logs and retries after a small back-off. Replaces the atomic.Bool closed flag with a mutex-guarded one so closed, conns, and the shutdown transition stay consistent. (coderabbit, minor) - handleTCPConn: apply per-IO read/write deadlines (30s idle, 10s in-flight) so a peer that opens the privileged port 111 and stalls cannot pin a goroutine indefinitely. (coderabbit, major) Adds TestPortmapServer_CloseEvictsIdleTCPConn, which holds a TCP connection idle and asserts Close() returns within 2s (well under the 30s idle deadline) and that the client sees the eviction. All existing tests still pass, including under -race. * fix(nfs): keep portmap UDP responder alive on transient read errors (#9201) - serveUDP: on a non-shutdown ReadFromUDP error, log, back off, and continue instead of returning. Matches how serveTCP now treats non-fatal Accept errors so a transient network blip doesn't take UDP portmap down until restart. (coderabbit) - Rename portmapAcceptBackoff -> portmapRetryBackoff now that both paths use it. - pmapProcDump: fix the pre-allocation capacity to match the actual encoding (20 bytes per entry + 4-byte terminator), replacing the old over-estimate of 24 per entry. No behavior change; just documents intent. (coderabbit nit) * docs(nfs): clarify encodeAcceptedReply body semantics (#9201) The prior comment said body is "nil when the accept_stat is itself an error", which was misleading: the PROG_MISMATCH branch already passes an 8-byte mismatch_info body. Rewrite to enumerate which error accept_stat values omit the body and call out PROG_MISMATCH as the exception, referencing RFC 5531 §9. Comment-only. (coderabbit nit) * fix(nfs): make portmap retry backoff interruptible by Close() (#9201) serveTCP and serveUDP both sleep portmapRetryBackoff (50ms) after a non-fatal listener error. If Close() races in during that sleep, the goroutine can't be interrupted, so Close() has to wait out the remaining backoff before wg.Wait() returns. Add a done channel that Close() closes once, and replace both time.Sleep calls with a select on ps.done + time.After. The window was tiny in practice but the select makes shutdown strictly bounded by Close()'s own work. (coderabbit nit) |
||
|
|
20f4fd9985 |
fix(storage): use ceil division for EC shard slots in maxVolumeCount (#9196)
* fix(storage): use ceil division for EC shard slots in maxVolumeCount * fix(topology): use ceil division for EC shard slots consistently Applies the same ceiling-division formula used in store.go to the four remaining master-side sites that computed volume-slots consumed by EC shards with off-by-one approximations: - disk.go ToDiskInfo / Disk.ToDiskInfo used (n+1)/d, which under-counts slots for non-multiples of DataShardsCount, over-reporting FreeVolumeCount. - DiskUsageCounts.FreeSpace and NodeImpl.AvailableSpaceFor subtracted n/d + 1, which over-counts slots at multiples of DataShardsCount, under-reporting free space (and suppressing volume growth on nodes that still had room). All four now use (n + DataShardsCount - 1) / DataShardsCount, matching store.go:393, store.go:810, and command_ec_decode.go:422. * refactor(topology): extract ecShardSlots helper Deduplicates the (n + DataShardsCount - 1) / DataShardsCount ceiling expression now used by ToDiskInfo, DiskUsageCounts.FreeSpace, Disk.ToDiskInfo, and AvailableSpaceFor. Addresses PR review feedback. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
0fcd5173be |
fix(admin): use basePath for API fetches when urlPrefix is set (#9197)
* fix(admin): use basePath for API fetches when urlPrefix is set * fix(admin): drop duplicate iam-utils script on Groups page * fix(admin): route topics page fetches through basePath The Topics page missed two fetch() calls that still used root-relative URLs, so create-topic and view-details still broke when -urlPrefix was set. --------- Co-authored-by: Maksim Babkou <maksim.babkou@innovatrics.com> Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
749430dceb |
fix(filer.meta.tail): include extended metadata in Elasticsearch docs (#9200)
* fix(filer.meta.tail): include extended metadata in Elasticsearch docs The -es sink flattened only the FUSE attributes, so xattrs (including S3 user metadata like X-Amz-Meta-*) never reached Elasticsearch. Add an Extended field and convert map[string][]byte to map[string]string so the values index as text; non-UTF-8 values fall back to base64. Addresses #9190 follow-up. * fix(filer.meta.tail): prefix base64-encoded extended values with "base64:" Addresses review feedback: a plain UTF-8 xattr and a base64 fallback are otherwise indistinguishable to a consumer reading the ES doc. |
||
|
|
036191c78a | Merge branch 'master' of https://github.com/seaweedfs/seaweedfs | ||
|
|
34b236acfa |
test(s3api): look up NewUser by name in CreateAccessKey collision test
The memory credential store backs LoadConfiguration with a map, so the identity order is not stable across a save/load round trip. Indexing Identities[1] intermittently pointed at the owner identity and produced a spurious credential leak. |
||
|
|
1a7ab2ea82 |
fix(upload): keep Content-MD5 on 204 unchanged writes (#9198)
Return Content-MD5 in the volume unchanged-write response and read it in the uploader 204 path so multipart chunk ETag metadata is preserved. |
||
|
|
ae93f87a46 | adjust logo | ||
|
|
6e950e0e7e |
docs(note): add production-setup slide deck
Marp-based markdown deck walking through the three-layer production topology (masters, volumes, filers + DB, S3 gateways, admin + workers) plus an erasure-coding note. Uses the object-store-layout diagram on the overview slide. Makefile renders PDF/HTML/PPTX via marp-cli. |
||
|
|
ede766645a |
build(deps): bump github.com/jackc/pgx/v5 from 5.9.0 to 5.9.2 (#9194)
Bumps [github.com/jackc/pgx/v5](https://github.com/jackc/pgx) from 5.9.0 to 5.9.2. - [Changelog](https://github.com/jackc/pgx/blob/master/CHANGELOG.md) - [Commits](https://github.com/jackc/pgx/compare/v5.9.0...v5.9.2) --- updated-dependencies: - dependency-name: github.com/jackc/pgx/v5 dependency-version: 5.9.2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
592d6d6021 |
fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) (#9193)
* fix(filer/remote): keep re-cache work alive past caller cancellation (#9174) For multi-GB remote blobs, doCacheRemoteObjectToLocalCluster cannot finish before the S3 gateway's initial cache wait elapses. When it does, the gRPC ctx cancellation cascades into the filer's chunk downloads, the error path calls DeleteUncommittedChunks on every chunk already written, and the next retry starts over. boto3 splitting the GET into concurrent ranges (or any client tear-down on first failure) shortens the window between retries, so the loop never converges. Detach the caller's ctx with context.WithoutCancel before invoking the singleflight work so the download runs to completion regardless of client cancellations. Subsequent waiters — via the in-flight singleflight, or a fresh retry landing after completion — observe the cached entry and stream normally. Same detach pattern is used in filer_server_handlers_write.go:53 and volume_server_handlers_write.go:51. * simplify rationale comment * switch to DoChan so handler can return on caller cancel Do keeps the handler goroutine blocked for the full detached download even after the client is gone. DoChan lets the handler select on ctx.Done() and exit immediately; the singleflight goroutine continues on bgCtx and the next request either joins it or finds the entry cached. |
||
|
|
f438cc3544 |
fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers. |
||
|
|
628363c4a6 |
fix(erasure_coding): surface replica delete failures from EC task (#9184) (#9187)
* test(erasure_coding): reproduce #9184 deleteOriginalVolume swallowing errors ErasureCodingTask.deleteOriginalVolume logs a warning when any replica VolumeDelete fails and then returns nil, so the EC task reports success to the admin even when a source replica survives. That stale replica lets a later detection scan re-propose the same volume and, once retried, drives the mounted-shard-truncation corruption that issue 9184 also describes. Reproducer: wire one reachable replica (succeeds) and one unreachable replica (fails) and assert the function currently returns nil. After the fix the function must surface the replica failure so the task is retried rather than marked done, and this test needs to be inverted. * fix(erasure_coding): surface replica delete failures from EC task ErasureCodingTask.deleteOriginalVolume previously logged a warning and returned nil when any VolumeDelete against a source replica failed. The EC task therefore reported overall success to the admin even when a source replica stayed on disk, which let a later detection scan propose a duplicate EC encoding of the same volume. The retry then walked the ReceiveFile path against servers that already had mounted EC shards for the volume, truncating the live shard files in place (the other half of #9184). This change returns an error describing the per-replica failures after the best-effort delete pass, so the task is marked failed instead of silently moving on. Successful deletes are still applied (per-replica progress is preserved); only the final return changes. When combined with the ReceiveFile mount-safety check, a stuck original replica now produces loud, actionable failures instead of silent corruption. Tests: - TestDeleteOriginalVolumeSurfacesReplicaFailures: asserts an error is returned and names the unreachable replica, while the reachable replica still gets deleted. - TestDeleteOriginalVolumeSucceedsWhenAllReplicasReachable: pins the happy path. |
||
|
|
8ae07e2a3f | chore(weed/filer/redis3): prune unused test functions (#9192) | ||
|
|
c6302fcb54 |
feat(iam): allow caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey (#9172)
* feat(iam): support caller-supplied AccessKeyId and SecretAccessKey in CreateAccessKey Both IAM implementations (standalone and embedded) now check for caller-supplied AccessKeyId and SecretAccessKey form parameters before generating random credentials. If provided, the caller-supplied values are used. If empty, random keys are generated as before. This enables programmatic identity provisioning where the caller needs to control the S3 credentials. Backward-compatible: no behavior change for callers that omit these parameters. * refactor(iam): extract shared caller-supplied credential validation Move the AccessKeyId/SecretAccessKey format checks and the in-memory collision scan into weed/iam so the standalone IAM API, the embedded IAM in s3api, and the admin dashboard all enforce the same rules. - ValidateCallerSuppliedAccessKeyId: 4-128 alphanumeric (rejects SigV4-breaking characters like '/' and '='). - ValidateCallerSuppliedSecretAccessKey: 8-128 chars. - FindAccessKeyOwner: scans identities and service accounts and returns the owning entity type + name for debug logging, without exposing the owner in caller-facing error messages. The admin dashboard previously only length-checked caller-supplied keys; it now enforces the same alphanumeric rule, which matches what SigV4 actually accepts anyway. * fix(iam): reject partial caller-supplied AccessKeyId/SecretAccessKey Previously, if a caller supplied only one of AccessKeyId or SecretAccessKey, CreateAccessKey logged a warning and auto-generated the missing half. That silently returns a credential the caller did not fully choose, which is surprising and easy to miss in a response they expected to echo back their input. Return ErrCodeInvalidInputException instead: either both are supplied or neither is. Updates the mixed-supply tests in weed/iamapi and weed/s3api to assert the rejection. * chore(iam): centralize and broaden sensitive form redaction DoActions and ExecuteAction both had an inline loop that redacted SecretAccessKey from their debug-level request log. Replace the two copies with iam.RedactSensitiveFormValues, backed by an explicit sensitive-keys set. The set now also covers Password, OldPassword, NewPassword, PrivateKey, and SessionToken. None of those parameters are used by today's IAM actions, but naming them here makes the log-safety guarantee survive future additions such as LoginProfile / STS. * test(iam): cover the upper length bound for CreateAccessKey TestCreateAccessKeyBoundary / TestEmbeddedIamCreateAccessKeyBoundary only exercised the 3/4-char lower edge. Add cases for 128 (accepted) and 129 (rejected) for AccessKeyId, plus 7 / 128 / 129-char cases for SecretAccessKey, so both ends of the validator are locked in at the handler level (the pure validators in weed/iam already cover this). * fix(s3api/iam): verify user existence before RNG and collision scan In the embedded IAM CreateAccessKey, the user lookup ran last: a request for a non-existent user still walked the whole identity / service-account list for collisions and, if no caller-supplied keys were present, generated fresh random credentials with crypto/rand before the NoSuchEntity error finally surfaced. Reorder: validate inputs, then find the target identity, then do the collision scan, then generate keys. A missing user now fails fast and consumes no entropy, and the handler returns NoSuchEntity instead of a misleading EntityAlreadyExists when both the user is missing and the supplied AccessKeyId happens to collide with another identity's key. Add TestEmbeddedIamCreateAccessKeyRejectsMissingUser to lock in the "no mutation on unknown user" guarantee. The standalone iamapi CreateAccessKey intentionally keeps its pre-existing "create-or-attach" semantics where a missing user is implicitly provisioned — that is a behavior change beyond the scope of this PR. * test(iam): tighten collision leak assertion and cover 8-char secret - Rename the collision-owner identity in TestCreateAccessKeyRejectsCollision (both iamapi and the embedded s3api test) from "existing" / "ExistingUser" to "ownerAlpha". The old assert.NotContains check was effectively a no-op because the error message never contained those substrings; a distinctive name shared with no part of the expected error body makes the leak guard actually meaningful if the wording ever drifts. The embedded test also adds a NotContains assertion that was previously missing entirely. - Add an explicit 8-char SecretAccessKey pass case to both boundary tests so the lower edge of the validator is locked in at the handler level alongside the 7 / 128 / 129-char cases. * fix(iamapi): enforce both-or-none before the collision lookup In the standalone IAM CreateAccessKey, FindAccessKeyOwner ran before the partial-credential check. If a caller supplied only AccessKeyId and it happened to collide with an existing key, the response was EntityAlreadyExists instead of the more fundamental InvalidInput for omitting SecretAccessKey — wrong error class, and leaked the fact that the probed key is already in use. Swap the order: validate both-or-none first, then do the collision scan. Matches the embedded IAM path and AWS behavior. Add a case to TestCreateAccessKeyRejectsPartialSupply that combines partial supply with a collision to lock in the ordering. * fix(admin): reject partial caller-supplied AccessKey/SecretKey The admin dashboard path silently generated the missing half when a caller supplied only one of AccessKey or SecretKey, while the IAM API and embedded IAM paths now reject this. Align the three: if exactly one is provided, return ErrInvalidInput. Also simplifies the generator block — either both are provided or neither is, so there is no mixed path to handle. * test(s3api/iam): guard dereferences in caller-supplied-keys test TestEmbeddedIamCreateAccessKeyWithCallerSuppliedKeys dereferenced *AccessKeyId/*SecretAccessKey/*UserName and indexed Identities[0].Credentials[0] without first verifying shape, so any future regression that returns a partial response or skips the config mutation would panic mid-assertion instead of failing with a clear message. Add require.NotNil on the response pointers and require.Len on the identities/credentials slices before the asserts. * test(iamapi): exercise the service-account branch of the collision check FindAccessKeyOwner scans both Identities[*].Credentials and ServiceAccounts[*].Credential, but TestCreateAccessKeyRejectsCollision only covered the identity branch. Split the test into two subtests — one per branch — so a future refactor that drops the service-account scan (or mutates the existing credential) trips a failure. Also asserts the existing service-account credential is not mutated and no credential is attached to the target identity on rejection. * test(iam): isolate 129-char secret subcase from prior credential In both TestCreateAccessKeyBoundary (iamapi) and TestEmbeddedIamCreateAccessKeyBoundary (s3api), the 129-char SecretAccessKey subcase reused the "validkey" AccessKeyId that the preceding 8-char subcase had just persisted into the config. The test still asserted the right outcome because the handler validates secret length before running the collision scan — but if the two checks ever swap, the subcase would pass (or fail) for the wrong reason. Reset the in-memory credentials before the 129-char subcase, matching the pattern already used by the 3/128/129-char AccessKeyId and 7-char secret subcases. No behavior change; purely test isolation. --------- Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
fff243d463 |
Export gRPC file_{read,write}_failures metrics on volume servers. (#9177)
Allows to track overall R/W errors in real time through Prometheus. Will follow up with a PR for Seaweed's REST API. Co-authored-by: Lisandro Pin <lisandro.pin@proton.ch> |
||
|
|
cb882ced46 |
fix(test): retry ENOENT in fcntl lock subprocess helper
TestPosixFileLocking/FcntlReleaseOnClose was flaky because the subprocess spawned by startLockHolder occasionally saw ENOENT when opening a file the parent had just created on the FUSE mount. Retry on ENOENT (matching the existing openWithRetry pattern used in testConcurrentLockContention) so the subprocess waits for the mount's dentry state to propagate before reporting the lock acquire as failed. |
||
|
|
c4e1885053 |
fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested. |
||
|
|
0f5e99f423 |
fix(filer.meta.tail): fail fast when -es is used without elastic build tag (#9191)
fix(filer.meta.tail): error instead of silently dropping events when -es is used without elastic build tag The default chrislusf/seaweedfs image builds without the `elastic` build tag, so sendToElasticSearchFunc was a no-op that returned a function discarding every event. Users passing -es saw the subscription wire up in filer logs but nothing ever reached Elasticsearch. Return an error explaining the binary wasn't built with ES support and pointing at the build flag. The caller already prints the error and exits, so users now get an immediate, actionable message. Fixes #9190 |
||
|
|
1220468a33 |
build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 in /test/kafka (#9189)
build(deps): bump github.com/rclone/rclone in /test/kafka Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5. - [Release notes](https://github.com/rclone/rclone/releases) - [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md) - [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5) --- updated-dependencies: - dependency-name: github.com/rclone/rclone dependency-version: 1.73.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
be9996962d |
fix(test): avoid port collision between master gRPC and volume ports
AllocateMiniPorts(1) reserved masterPort and masterPort+GrpcPortOffset by holding listeners open, but closed them on return. The subsequent AllocatePorts call bound 127.0.0.1:0, so the OS could immediately reuse the just-released mini gRPC port as a volume port — causing the volume server to fail at bind time with "address already in use". Introduce AllocatePortSet(miniCount, regularCount) that holds every listener open until the full set is chosen, and route the five volume test cluster builders through it. |
||
|
|
96e5fea08e |
test(catalog_spark): bound weed shell invocation with 30s timeout
createTableBucket ran `weed shell` via exec.Command with no deadline. When the shell's first command retries on a transient master connection blip, the trailing `exit` on stdin never gets processed and the subprocess blocks until the outer 20m `go test` timeout fires — the surfacing symptom is a flaky 20m panic with no diagnostic output. Wrap the invocation in exec.CommandContext with a 30s timeout, matching the existing pattern in test/s3tables/catalog_risingwave/setup_test.go. |
||
|
|
7f67995c24 |
chore(filer): remove -mount.p2p flag; registry is always on (#9183)
The filer-side mount peer registry (tier 1 of peer chunk sharing) was gated behind -mount.p2p (default true). Idle cost is negligible — a tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the surface area. Removes the flag from weed filer, weed server (-filer.mount.p2p), and weed mini, and always constructs the registry in NewFilerServer. Also drops the now-dead nil guards in MountRegister/MountList/sweeper and the TestMountRegister_DisabledIsNoOp case. |
||
|
|
9ae905e456 |
feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin S3 and filer already use a refreshing pemfile provider for their HTTPS cert, so rotated certificates (e.g. from k8s cert-manager) are picked up without a restart. Master, volume, webdav, and admin, however, passed cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at startup — rotating those certs required a pod restart. Add a small helper NewReloadingServerCertificate in weed/security that wraps pemfile.Provider and returns a tls.Config.GetCertificate closure, then wire it into the four remaining HTTPS entry points. httpdown now also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates but CertFile/KeyFile are empty, so volume server can pre-populate TLSConfig. A unit test exercises the rotation path (write cert, rotate on disk, assert the callback returns the new cert) with a short refresh window. * refactor(security): route filer/s3 HTTPS through the shared cert reloader Before: filer.go and s3.go each kept a *certprovider.Provider on the options struct plus a duplicated GetCertificateWithUpdate method. Both were loading pemfile themselves. Behaviorally they already reloaded, but the logic was duplicated two ways and neither path was shared with the newly-added master/volume/webdav/admin wiring. After: both use security.NewReloadingServerCertificate like the other servers. The per-struct certProvider field and GetCertificateWithUpdate method are removed, along with the now-unused certprovider and pemfile imports. Net: -32 lines, one code path for all HTTPS cert reloading. No behavior change — the refresh window, cache, and handshake contract are identical (the helper wraps the same pemfile.NewProvider). * feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc The HTTP client in weed/util/http/client loaded the mTLS client cert once at startup via tls.LoadX509KeyPair. That left every long-lived HTTPS client process (weed mount, backup, filer.copy, filer→volume, s3→filer/volume) unable to pick up a rotated client cert without a restart — even though the same cert-manager setup was already rotating the server side fine. Swap the client cert loader for a tls.Config.GetClientCertificate callback backed by the same refreshing pemfile provider. New TLS handshakes pick up the rotated cert; in-flight pooled connections keep their old cert and drop as normal transport churn happens. To keep this reusable from both server and client TLS code without an import cycle (weed/security already imports weed/util/http/client for LoadHTTPClientFromFile), extract the pemfile wrapper into a new weed/security/certreload subpackage. weed/security keeps its thin NewReloadingServerCertificate wrapper. The existing unit test moves with the implementation. gRPC mTLS was already handled by security.LoadServerTLS / LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ agent, Kafka gateway, and FUSE mount control plane are gRPC-only and therefore already rotate. CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted as a known limitation in the wiki. * fix(security): address PR review feedback on cert reloader Bots (gemini-code-assist + coderabbit) flagged three real issues and a couple of nits. Addressing them here: 1. KeyMaterial used context.Background(). The grpc pemfile provider's KeyMaterial blocks until material arrives or the context deadline expires; with Background() a slow disk could hang the TLS handshake indefinitely. Switched both the server and client callbacks to use hello.Context() / cri.Context() so a stuck read is bounded by the handshake timeout. 2. Admin server loaded TLS inside the serve goroutine. If the cert was bad, the goroutine returned but startAdminServer kept blocking on <-ctx.Done() with no listener, making the process look healthy with nothing bound. Moved TLS setup to run before the goroutine starts and propagate errors via fmt.Errorf; also captures the provider and defers Close(). 3. HTTP client discarded the certprovider.Provider from NewClientGetCertificate. That leaked the refresh goroutine, and NewHttpClientWithTLS had a worse case where a CA-file failure after provider creation orphaned the provider entirely. Added a certProvider field and a Close() method on HTTPClient, and made the constructors close the provider on subsequent error paths. 4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain the provider. filer and webdav run ServeTLS synchronously, so a plain defer works. master/volume/s3 dispatch goroutines and return while the server keeps running, so they hook Close() into grace.OnInterrupt. 5. Test: certreload_test now tolerates transient read/parse errors during file rotation (writeSelfSigned rewrites cert before key) and reports the last error only if the deadline expires. No user-visible behavior change for the happy path. * test(tls): add end-to-end HTTPS cert rotation integration test Boots a real `weed master` with HTTPS enabled, captures the leaf cert served at TLS handshake time, atomically rewrites the cert/key files on disk (the same rename-in-place pattern kubelet does when it swaps a cert-manager Secret), and asserts that a subsequent TLS handshake observes the rotated leaf — with no process restart, no SIGHUP, no reloader sidecar. Verifies the full path: on-disk change → pemfile refresh tick → provider.KeyMaterial → tls.Config.GetCertificate → server TLS handshake. Runtime is ~1s by exposing the reloader's refresh window as an env var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the test. The same env var is user-facing — documented in the wiki — so operators running short-lived certs (Vault, cert-manager with duration: 24h, etc.) can tighten the rotation-pickup window without a rebuild. Defaults to 5h to preserve prior behavior. security.CredRefreshingInterval is kept for API compatibility but now aliases certreload.DefaultRefreshInterval so the same env controls both gRPC mTLS and HTTPS reload. * ci(tls): wire the TLS rotation integration test into GitHub Actions Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner, Go 1.25, build weed, run `go test` in test/tls_rotation, upload master logs on failure. 10-minute job timeout; the test itself finishes in about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms inside the test. Runs on every push to master and on every PR to master. * fix(tls): address follow-up PR review comments Three new comments on the integration test + volume shutdown path: 1. Test: peekServerCert was swallowing every dial/handshake error, which meant waitForCert's "last err: <nil>" fatal message lost all diagnostic value. Thread errors back through: peekServerCert now returns (*x509.Certificate, error), and waitForCert records the latest error so a CI flake points at the actual cause (master didn't come up, handshake rejected, CA pool mismatch, etc.). 2. Test: set HOME=<tempdir> on the master subprocess. Viper today registers the literal path "$HOME/.seaweedfs" without env expansion, so a developer's ~/.seaweedfs/security.toml is accidentally invisible — the test was relying on that. Pinning HOME is belt-and-braces against a future viper upgrade that does expand env vars. 3. volume.go: startClusterHttpService's provider close was registered via grace.OnInterrupt, which fires on SIGTERM but NOT on the v.shutdownCtx.Done() path used by mini / integration tests. The pemfile refresh goroutine leaked in that shutdown path. Now the helper returns a close func and the caller invokes it on BOTH shutdown paths for parity. Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the ast-grep static-analysis nit — zero-risk since the pool only trusts our in-memory CA. Test runs clean 3/3. |
||
|
|
7364f148bd |
fix(s3/shell): factor EC volumes into bucket size metrics and collection.list (#9182)
* fix(s3/shell): include EC volumes in bucket size metrics and collection.list S3 bucket size metrics exported to Prometheus (and fed through stats.UpdateBucketSizeMetrics) are computed by collectCollectionInfoFromTopology, which only walked diskInfo.VolumeInfos. As soon as a volume was encoded to EC it dropped out of every aggregate, so Grafana showed bucket sizes shrinking while physical disk usage kept climbing. The shell helper collectCollectionInfo — used by collection.list and s3.bucket.quota.enforce — had the same gap, with the EC branch left as a commented-out TODO. Fold EC shards into both paths using the same approach the admin dashboard already uses (PR #9093): - PhysicalSize / Size sum across shard holders: EC shards are node-local (not replicas), so per-node TotalSize() and MinusParityShards().TotalSize() sum to the whole-volume physical and logical sizes respectively. - FileCount is deduped via max across reporters (every shard holder reports the same .ecx count; a slow node with a not-yet-loaded .ecx reports 0 and must not pin the aggregate). - DeleteCount is summed (each delete tombstones exactly one node's .ecj). - VolumeCount increments once per unique EC volume id. Adds regression tests covering pure-EC, mixed regular+EC, and the slow-reporter FileCount dedupe case. Refs #9086 * Address PR review feedback: EC size helpers, composite key, VolumeCount dedupe - Add EcShardsTotalSize / EcShardsDataSize helpers in the erasure_coding package that walk the shard bitmap directly instead of materializing a ShardsInfo and copying it via MinusParityShards(). Keeps the DataShardsCount dependency encapsulated in one place and avoids the per-shard allocation/copy overhead in the metrics hot path. - Switch shell collectCollectionInfo ecVolumes map to a composite {collection, volumeId} key, matching the bucket_size_metrics collector and defending against any cross-collection volume id aliasing. - Dedupe VolumeCount in shell addToCollection by volume id so regular volumes aren't counted once per replica presence. Aligns the shell's collection.list output with the S3 metrics collector and the EC branch, all of which now report logical volume counts. - Add unit tests for the new helpers and for the regular-volume VolumeCount dedupe. * Parameterize EcShardsDataSize with dataShards for custom EC ratios Add a dataShards parameter to EcShardsDataSize so forks with per-volume ratio metadata (e.g. the enterprise data_shards field carried on an extended VolumeEcShardInformationMessage) can pass the configured value and get accurate logical sizes under custom EC policies like 6+3 or 16+6. Passing 0 or a negative value falls back to the upstream DataShardsCount default, which is correct for the fixed 10+4 layout — so OSS callers in s3api and shell pass 0 and keep their current behavior. Added table cases covering the custom 6+3 and 16+6 paths so the parameterization is pinned by tests. |
||
|
|
05f0f7e1c9 |
fix(remote-storage/azure): fix re-cache of large remote blobs (#9174) (#9179)
* fix(remote-storage/azure): fix re-cache of large remote blobs (#9174) ReadFile issued a single DownloadStream for the entire requested byte range, so a large re-cache (e.g. a 2 GB blob re-fetched on S3 GET after eviction) had to move the whole range over one HTTP connection within the SDK's per-try TryTimeout. TryTimeout was set to 10s "to fail faster on auth issues", which silently broke large reads: every attempt hit context deadline, the filer's CacheRemoteObjectToLocalCluster returned an error, and the S3 gateway surfaced it to clients as an ETag-mismatch on the partial response. Switch ReadFile to the SDK's parallel block downloader (DownloadBuffer with 4 MiB blocks) so each individual HTTP GET is small enough to complete well inside TryTimeout. Expose the parallelism through the RemoteStorageConcurrentReader interface so callers (FetchAndWriteNeedle) can tune it per request, matching the S3 backend. Also restore TryTimeout to 60s. With parallel block transfers it is no longer on the critical path for large-blob bodies, but it gives metadata operations and any non-parallel paths more headroom on slow links. * fix(remote-storage/azure): guard ReadFileWithConcurrency inputs Addresses review feedback on PR #9179: - Reject negative size up front instead of panicking inside make([]byte, size). - Clamp concurrency to math.MaxUint16 before casting to uint16 so an oversized caller value can't silently wrap to a small number. * fix(remote-storage/azure): reject negative offset in ReadFileWithConcurrency Addresses review feedback on PR #9179. Without this guard, a negative offset combined with size == 0 would compute `size = ContentLength - offset` -> a value larger than the blob, then attempt to allocate and download past the end. |
||
|
|
45ba71a189 |
fix(volume): write state.pb into a real dir when -dir.idx is unset (#9178)
* fix(volume): write state.pb into a real dir when -dir.idx is unset When -dir.idx is not set, NewStore passed the empty default to NewState, making the state.pb path resolve to a relative "state.pb" against the process CWD. Under systemd (where CWD is typically /), this caused "open state.pb: permission denied" for operations such as `volumeServer.state -maintenanceOn`, even though the configured user owned the data dirs. Fall back to the first disk location's IdxDirectory so state.pb lives next to the volume data, consistent with other per-server artifacts. Fixes #9173 * fix(volume): always resolve state.pb dir via first disk location Use s.Locations[0].IdxDirectory unconditionally when a location exists so state.pb inherits the same resolution (~ expansion and empty-idxFolder fallback) already applied for the .idx files. Fall back to util.ResolvePath(idxFolder) in the location-less case so a relative or tilde-prefixed -dir.idx is still normalized. Addresses PR feedback on #9178. |
||
|
|
c40db5a52d |
perf(filer): parallelize StreamMutateEntry with path-keyed scheduler (#9171)
* perf(filer): parallelize StreamMutateEntry with path-keyed scheduler The server handler processed one mutation at a time per stream, capping a mount's aggregate throughput at ~1/filer_store_service_time regardless of client concurrency (see issue #9138). With 12 rclone processes this showed as a ~500 QPS ceiling on a filer that previously served ~1000+ QPS via unary CreateEntry. Replace the serial for-loop with a per-request goroutine admitted by a path-keyed scheduler, adapted directly from filer.sync's MetadataProcessor (weed/command/filer_sync_jobs.go). Same four conflict indexes, same kind taxonomy (file / barrier-dir / non-barrier-dir), same ancestor-barrier and descendant-barrier rules. Cross-path mutations run in parallel; same- path mutations serialize on arrival order; recursive delete and directory rename act as subtree barriers; directory attribute bumps stay non-barrier so they do not serialize file writes under them. Correctness and safety: - Per-stream goroutine cap (streamMutateConcurrency = 64) bounds resource use from a single noisy mount. - syncStream wrapper serializes stream.Send across worker goroutines (gRPC Send is not concurrent-safe). - Handler waits on in-flight workers before returning on recv EOF/error so no worker writes to a torn-down stream. - First fatal Send error from any worker propagates as the handler's return, causing the stream to tear down. Benchmark (2 ms simulated filer-store service delay, 12 client workers): serial : 440 QPS sem only : 4902 QPS (unsafe — reorders same-path ops) scheduler : 4934 QPS on distinct paths, 439 QPS on same path (correct) The sem-only number shows the upper bound of raw parallelism; the scheduler matches it on distinct paths (the realistic 12-rclone case) and correctly falls back to serial when the workload demands ordering. Peak concurrent mutations at the handler equals client worker count on the distinct-path workload and pins to 1 on the same-path workload, as the scheduler intends. * perf(filer): decouple StreamMutateEntry admission from receive loop The previous StreamMutateEntry handler called sched.Admit directly in the Recv loop. A single request conflicting on path /hot would head-of-line block stream.Recv, so later requests targeting unrelated paths could not be received or admitted until /hot drained — cross-path parallelism then depended on request ordering instead of being a property of the scheduler. Spawn the worker goroutine immediately on Recv and move sched.Admit into that goroutine. A new streamMutatePendingLimit (1024) caps total per- stream outstanding goroutines (pending + active) so a client flooding a conflicted path cannot explode goroutine count without bound. Addresses #9171 review comment (coderabbitai, Major). * fix(filer): reply with EINVAL on unknown StreamMutateEntry request type Returning nil when req.Request is a future oneof variant or a malformed request left the client's per-RequestId waiter blocked forever, because no response was ever sent for that id. Reply with IsLast=true and EINVAL so the waiter completes with a well-formed error. Addresses #9171 review comments (gemini-code-assist, coderabbitai). * fix(filer): make classifyMutation crash-free and correct for deletes Two issues addressed together because they share one function: 1. Nil-entry panic. classifyMutation dereferenced req.Entry.Name without a nil guard; an empty create_request / update_request / rename_request from a misbehaving client crashed the scheduler. Guard each oneof variant and fall back to a "/" barrier; the handler then sends EINVAL via the unknown-request path. 2. Non-recursive delete vs concurrent dir attribute update. DeleteEntry- Request does not carry IsDirectory, so the previous kindMutateFile classification for non-recursive deletes did not conflict with an in- flight kindMutateNonBarrierDir (chmod / xattr / mtime) at the same path — a race in scheduler terms. Classify every delete as kindMutateBarrierDir regardless of IsRecursive. The incremental cost of a descendant-wait for a non-recursive delete of a non-empty dir is negligible since that call fails at the store anyway. Adds classifyMutation tests for malformed create/update, empty oneof, and updates the delete-non-recursive case to the new expected kind. Addresses #9171 review comments (coderabbitai Critical, Major). * fix(filer): route renameStreamProxy.SendMsg through the wrapping Send The default pass-through SendMsg on renameStreamProxy bypassed the syncStream mutex and the StreamMutateEntryResponse wrapping: anything the rename helpers happened to push via SendMsg would have been emitted on the wire as the wrong protobuf type and could interleave with other workers' Sends. RecvMsg similarly raced with the outer StreamMutateEntry Recv loop and could steal unrelated mutation requests. Route SendMsg through the wrapping Send (rejecting other payload types) and fail RecvMsg explicitly — the rename logic is a strictly server-push stream and never calls RecvMsg, so loud failure is safer than silent stealing. Addresses #9171 review comment (coderabbitai, Major). * test(filer): run exactly ops in stream-mutate workloads perGoroutine := ops / concurrency silently truncated the total when the values were not divisible — e.g. 2400 ops with 64 workers actually ran 2368 and with 256 workers ran 2304, making the logged "ops per run" inaccurate and introducing measurement noise that varied across the concurrency sweep. Introduce opsForWorker(g, concurrency, ops) which distributes the remainder to the first (ops % concurrency) workers so the three workloads (unary, stream sync, stream async) each dispatch exactly `ops` operations. No changes to the timing methodology. Addresses #9171 review comment (coderabbitai, Minor). * fix(filer): enforce per-path FIFO admission in mutateScheduler sync.Cond.Broadcast wakes every waiter; the first to re-acquire the mutex wins, so two conflicting same-path admissions could be reordered by the Go runtime even though they arrived serially on the stream. A single stream is supposed to carry ordered mutations — the PR's original #8770 claim — so admission must be FIFO per path. Replace the single cond with a per-path FIFO queue. Each Admit enqueues a waiter on every path it touches (primary, and on rename the secondary too) and blocks on a ready channel. tryPromoteLocked admits any waiter that is at the head of every queue it joined, passes pathConflictsLocked against the active-state indexes, and is under concurrencyLimit. Done removes the heads and re-runs tryPromoteLocked so waiters freed by the completion move in arrival order. Side effect: two non-barrier directory updates on the same path now serialize instead of overlapping. filer.sync's MetadataProcessor intentionally allows them to overlap because its events come from a committed log where last-writer-wins coalescing is safe; streamed mutations carry client operations whose order matters, so we drop that optimization here. Added TestAdmitSamePathFIFO (20-waiter barrier release) and TestAdmitSamePathNonBarrierSerializes to cover both. Also refreshed the kindMutateFile doc comment that still referenced the pre-#1ecf805f5 "non-recursive delete" classification. Addresses #9171 review comments (coderabbitai Critical, Minor). * test(filer): make TestAdmitSamePathFIFO deterministic without sleeps The previous arrival-ordering sync (send to `started` before calling Admit, plus a 1 ms sleep) relied on the goroutine actually entering Admit and reaching the per-path queue during that sleep. Under -race on a loaded CI that is a real flake source, which is ironic for a test whose job is catching non-deterministic wake-ups. Observe the scheduler's own pathQueue length between spawns instead — waitQueueLen polls s.pathQueue["/a"] under s.mu until the expected number of waiters (1 barrier holder + i+1 file waiters) is enqueued. That's the exact event the test wants to synchronise on, so there is no fudge factor. Verified by `go test -race -count=5`. Addresses #9171 review comment (coderabbitai, Minor). |
||
|
|
141413ad76 |
fix(tests): make tests pass on 32-bit architectures (#9168) (#9170)
Two separate failures reported on 32-bit builds (void-linux 4.21): - weed/server: errorStreamImpl.count (and the same pattern in slowStream plus local totalEventsSent/totalSends) was a bare int64 sitting after smaller fields, so on 386/ARMv7/mips32 it landed at a 4-byte-aligned offset and atomic.AddInt64 panicked with "unaligned 64-bit atomic operation". Switched the counters to atomic.Int64, which Go guarantees is 8-byte aligned on every architecture. - weed/plugin/worker/iceberg: three equality-delete tests fail on 32-bit because the upstream github.com/apache/iceberg-go declares manifestEntry.EqualityIDs as *[]int while the Iceberg Avro schema defines equality_ids as long, and hamba/avro refuses to map Go int onto Avro long when int is 32-bit. Not fixable in seaweedfs, so guard the affected tests with a t.Skip() when unsafe.Sizeof(int) < 8 until the upstream type is changed to []int32/[]int64. |
||
|
|
0a5c22b57e |
chore(upload): log offset/bytes-read context on chunk ReadFrom errors (#9169)
chore(upload): add offset/bytes-read context to chunk ReadFrom errors Wrap io.ErrUnexpectedEOF (and siblings) from bytesBuffer.ReadFrom so the log shows "read chunk at offset N (got M bytes): ..." instead of a bare "unexpected EOF". The context distinguishes a client disconnect before any data arrived (offset=0, got=0) from a mid-stream truncation (offset>0, got<chunkSize) — diagnosing #9149 in the wild. Verified at this stage that expectedDataSize is already threaded correctly from upload_chunked.go (actual ReadFrom bytes) through s3api assignFunc → filer AssignVolume → master PickForWrite. No behavioral change to what the master receives. |
||
|
|
e77f8ae204 |
fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157) The STS GetFederationToken handler was implemented but never reachable. Three routing gaps sent requests to the S3/IAM path instead of STS: - No explicit mux route for Action=GetFederationToken in the URL query - iamMatcher did not exclude GetFederationToken, so authenticated POSTs with Action in the form body were matched and dispatched to IAM - UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity to STS, leaving GetFederationToken to fall through to DoActions and return NotImplemented Add the missing route, the matcher exclusion, and the dispatch branch. Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount into the s3-iam-tests workflow as a new "sts" matrix entry. Before this change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four test functions ran in CI, which is why this regression shipped. * test(iam): make orphaned STS/service-account tests pass under auth-enabled CI Follow-up to wiring STS tests into CI: fixes several pre-existing issues that made the newly-included tests fail locally. Server fixes: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when the caller is a legacy S3-config identity (not in the IAM user store). Previously any GetPoliciesForUser error short-circuited to InternalError, which hard-failed every SigV4 caller using keys from -s3.config. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs in the sa:<parent>:<uuid> format required by credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed the persistence-layer regex and caused every CreateServiceAccount call to return 500 once a filer-backed credential store validated the ID. Test helpers: - test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from req.URL.Host, and a manual Host header made the signer emit host;host in SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn subtest to match the existing SeaweedFS behavior (user-context assumption). - test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned IAM writes otherwise fall through to the STS fallback and return InvalidAction. CI matrix: - .github/workflows/s3-iam-tests.yml: skip TestServiceAccountLifecycle/use_service_account_credentials only. The rest of the service-account suite passes; that one subtest depends on a separate credential-reload issue where new ABIA keys briefly register into accessKeyIdent but aren't persisted to the filer, so they vanish on the next reload. Out of scope for the #9157 GetFederationToken fix. * fix(credential): accept AWS IAM username chars in service-account IDs Gemini review on #9167 pointed out that ServiceAccountIdPattern's parent-user segment was more restrictive than an AWS IAM username: `[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with `@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail validation at the filer store even though the embedded IAM API happily created them. Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html) and add a table-driven test that locks the expansion in. * address PR review feedback on #9167 All five review items were valid; changes keyed to review bullets: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows arbitrary policy-lookup failures. Only credential.ErrUserNotFound is treated leniently (the legacy-config SigV4 path); any other error now returns InternalError so we don't mint tokens with an incomplete policy set. - weed/credential/grpc/grpc_identity.go: GetUser translates gRPC NotFound back to credential.ErrUserNotFound so errors.Is(...) above matches for gRPC-backed stores, not just memory/filer-direct. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates the generated saId against credential.ValidateServiceAccountId before returning. Surfaces a client 400 with the offending ID instead of the opaque 500 that used to bubble up from the persistence layer. - weed/s3api/s3api_server_routing_test.go: seed a routing-test identity with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody with aws-sdk-go v4.Signer so the request actually passes AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers with no stsService) instead of just NotEqual(501) — 503 proves the dispatch reached STSHandlers.HandleSTSRequest. - test/s3/iam/s3_service_account_test.go: callIAMAPI signs with service="iam" instead of "s3" (SeaweedFS verifies against whichever service the client signed with, but "iam" is semantically correct). - weed/credential/validation_test.go: add positive rows for an uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000). |
||
|
|
0007245c7f |
fix(kafka): close late-joiner orphan race in consumer-group rebalance (#9162)
* fix(kafka): close late-joiner orphan race in consumer-group rebalance
The CI-observed orphan (one consumer with empty assignment after
`TestConsumerGroups/Rebalancing/MultipleConsumersJoin`) came from a
race in the group coordinator: once the leader had taken its member-
list snapshot in its JoinGroup response, a new member could still
arrive before the leader's SyncGroup landed. The gateway accepted the
stale SyncGroup, moved to Stable, and the late joiner's own SyncGroup
then served an empty Assignment from the Stable-state path — leaving
it silently unassigned with no further rebalance to fix it.
Three changes in `handleJoinGroup` / `handleSyncGroup` close the race:
- Late join during `CompletingRebalance` bumps the generation and
resets to `PreparingRebalance`, so the leader's in-flight SyncGroup
fails its generation check and the round restarts with the new
member in the snapshot.
- SyncGroup generation-mismatch returns `REBALANCE_IN_PROGRESS` (not
`ILLEGAL_GENERATION`) while the group is rebalancing, mirroring the
existing heartbeat fix — otherwise Sarama's `Consume()` tears down
on the stale SyncGroup instead of retrying.
- Leader SyncGroup verifies its assignment covers every current
member and rejects with `REBALANCE_IN_PROGRESS` otherwise, as a
belt-and-suspenders catch for joins that slip in between the
leader's JoinGroup reply and its SyncGroup without going through
`CompletingRebalance` state.
Verified: baseline reliably reproduces the orphan locally; with the
fix `TestConsumerGroups` passes end-to-end (53s total,
`MultipleConsumersJoin` 15-17s) and a 10-iteration stress loop against
the same gateway is 10/10 green with every consumer getting exactly
one partition.
* fix(kafka): clear stale Assignment when restarting a rebalance round
Review spot: the two restart paths added in the previous commit bumped
group.Generation and reset each member's State to Pending but left
member.Assignment populated with the prior generation's partitions.
The non-leader SyncGroup path only returns REBALANCE_IN_PROGRESS when
`member.Assignment` is empty (handleSyncGroup ~line 982). Leaving the
stale assignment in place means a member rejoining at the new
generation — before the leader's SyncGroup has published fresh
assignments — falls through that guard and is served its old
partitions from the pre-rebalance state.
Clear m.Assignment alongside m.State in both restart sites so the
guard fires and the member correctly re-enters the join/sync cycle.
Verified with a fresh-broker TestConsumerGroups run: 50.99s total,
MultipleConsumersJoin 15.25s, all four consumers each get exactly one
partition.
* fix(kafka): don't let empty leader assignments bypass coverage check
Review spot: the leader-assignment branch was gated on
`len(request.GroupAssignments) > 0`, so a leader SyncGroup that omitted
every current member (empty array with a non-empty group) fell through
to the server-side-assignment `else` branch and could move the group
Stable without the intended rebalance retry.
Drop the length guard. Whenever the caller is the leader, build the
assigned-member map and run the coverage check; if the assignment
omits any current member (including the all-empty case against a
non-empty group), bump the generation, reset to PreparingRebalance,
clear each member's Assignment, and return REBALANCE_IN_PROGRESS so
the leader rebuilds its snapshot and sends a complete assignment on
retry. The server-side-assignment branch (documented as "should not
happen with Sarama") is now only reachable for non-leader+non-empty
SyncGroups — a genuinely unexpected case — and keeps its existing
warning.
* revert: keep len(GroupAssignments) > 0 gate on leader-assign branch
The previous commit (
|
||
|
|
49f62df2cf |
fix(shell): mergeVolumes hard-link safety and cleaner cleanup logging (#9163)
* fix(shell): skip hard-linked entries in fs.mergeVolumes Hard-linked entries share a chunk list with their siblings, but the filer's UpdateEntry only rewrites one entry at a time. Moving a chunk here leaves every other hard-linked sibling pointing at a fid that either gets deleted by the filer's own garbage step after UpdateEntry or by deleteMovedSourceNeedles (#9160) — either way, the siblings end up with dangling references. Skip the entry with a visible log line so operators know the file was bypassed and can handle it explicitly (copy-then-unlink, or dedup before merge). Detected via entry.HardLinkId being non-empty, which is the same signal the filer itself uses (weed/pb/filer_pb/filer.pb.go:451). Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes suppresses 404 alongside 304 in source cleanup BatchDelete also returns StatusNotFound (404) for an already-deleted needle when ReadVolumeNeedle can't find it in the cookie-check path, not only StatusNotModified (304) from DeleteVolumeNeedle returning size 0. Both are benign races against a concurrent fsck purge or a replica that already reconciled, so don't clutter the output with "delete ... not found" warnings for them. Flagged by coderabbit on #9160 post-merge. * fix(shell): mergeVolumes merges hard-linked files via dedup on HardLinkId Hard-linked siblings share one chunk list through a KV blob keyed by HardLinkId (see weed/filer/filerstore_hardlink.go). UpdateEntry's setHardLink rewrites that blob and maybeReadHardLink overrides per-entry chunks with the blob's on every read, so a single UpdateEntry propagates new fids to every sibling automatically — the previous skip-hardlinks bailout was overly conservative and left hard-linked files stuck on merge-source volumes forever. Process each HardLinkId exactly once per run with a sync.Map so BFS workers in different directories synchronize without a global lock. First sibling carries the chunk move + UpdateEntry; later siblings find the id in the map and return — preventing the real race, which is two siblings trying to re-download an already-moved source needle or double-queue the same fid for deletion. Also address the log-spam review on deleteMovedSourceNeedles: an unreachable volume server returns one error per needle, so collapse multiple failures into a single per-server line with the first error as an example. |