mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-06-09 18:32:43 +00:00
4.37
658 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2c2df751f5 |
Perf CI: benchmark the Rust volume server and report memory usage (#10111)
* ci: add per-process memory sampler for perf jobs Samples VmRSS once a second into a CSV and records peak VmHWM per process on stop. Linux only; reads /proc/<pid>/status. * ci: run perf benchmarks on the Rust volume server and report memory Matrix the throughput and S3 jobs over go/rust volume servers, using a standalone master (plus filer for S3) and swapping only the volume binary so the two are directly comparable. Sample peak RSS in every job and surface it per impl in the run summary. * ci: harden mem sampler arg handling and peak fallback Guard against missing args under set -u, and fall back to the max RSS sampled when a process exits before VmHWM can be read. |
||
|
|
c01cea8786 |
docker release: run all platform jobs in one wave, cache rocksdb compile
Drop max-parallel so the 13 per-platform builds run together instead of two waves of 8 (rocksdb was queuing behind the cap and starting ~8 min late). Keep cache-to mode=max for rocksdb: its RocksDB static_lib compile is sha-independent, so it caches across releases and stops being the ~16-min long-pole that gates the merge fan-in. go-build variants stay mode=min. |
||
|
|
3f68b19500 |
docker release: per-platform builds on native runners, drop mode=max cache (#10109)
docker release: build per-platform on native runners, drop mode=max cache The build job built every platform of a variant on one runner, so 2-4 Go cross-compiles fought over a single 2-vCPU box and arm64 ran in an emulated context. Split the matrix to one platform per job on a native runner (amd64/386 on ubuntu-latest, arm64/arm-v7 on ubuntu-24.04-arm); only arm/v7 still needs QEMU, and only for its final apk stage. Each job pushes by digest, and a new merge job assembles the multi-arch tag with imagetools and mirrors it to Docker Hub. cache-to mode=max -> mode=min: BRANCH=sha cache-busts the heavy go-build layer every release, so writing all intermediate layers to the gha backend spent 3-11 min per variant on a cache the next release's sha can never hit. |
||
|
|
a88acaf061 |
Add performance CI (profiling, throughput, S3 read/write) (#10105)
* test: add self-contained S3 read/write load tool Concurrent PUT/GET against the S3 gateway, reporting requests/sec, transfer rate, and latency percentiles. Built on the aws-sdk-go-v2 client the S3 tests already use, so no extra benchmark binary is needed. * ci: add performance workflow Three parallel jobs: cpu/heap pprof of the server under write load, native throughput via weed benchmark plus the Go micro-benchmarks, and an S3 read/write benchmark against the gateway. Runs on push to master and manual dispatch with tunable duration, object count, size, and concurrency. |
||
|
|
d65ed3b557 | add release version-bump workflow | ||
|
|
d246a1a817 |
build(deps): bump actions/checkout from 6 to 7 (#10037)
Bumps [actions/checkout](https://github.com/actions/checkout) from 6 to 7. - [Release notes](https://github.com/actions/checkout/releases) - [Commits](https://github.com/actions/checkout/compare/v6...v7) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
d6da0e0e13 |
ci: only run heavy workflows when related paths change
Add path filters to workflows that fired on every PR/push regardless of the diff: CodeQL, go build, the e2e/EC/vacuum/TLS/plugin-worker integration suites, the Kafka and Postgres gateways, the S3 suites (Ceph s3tests, s3-go, s3-tables, proxy-signature, https, example, filer-group), TUS, and the dev binary/container builds. Each scopes to its subsystem under weed/, its test dir, go.mod/go.sum, and the workflow file, so docs-, helm-, terraform-, rust- or java-only changes no longer trigger a full compile-and-test fleet. |
||
|
|
871d7ddc02 |
[helm]: configure JWT expiration (#9940)
helm: configure JWT expiration |
||
|
|
c3b06bf809 |
ci: run weed tests on linux/386 (#9924)
386 test binaries execute natively on the amd64 runner, so the suite catches what vet cannot: unaligned 64-bit atomics and arithmetic that wraps at runtime. -short keeps the e2e suites on amd64 only. |
||
|
|
3eb550a3f1 |
fix(tests): 32-bit build of EC e2e tests, type-check linux/386 in CI (#9922)
* fix(tests): keep EC e2e fid cookie arithmetic in uint32 The cookie constants 0x9490CA00 and 0x9500CA00 were added to the int loop variable before conversion, overflowing 32-bit int at compile time on linux/386 and linux/arm. Convert the loop variable instead so the addition stays in uint32. * fix(tests): pass s3client max backoff in milliseconds MaxBackoffDelay is documented as milliseconds and multiplied by 1e6 before use, but the example set it to 5s in nanoseconds, yielding an absurd backoff on 64-bit and a compile-time int overflow on 32-bit. * ci: type-check code and tests for linux/386 64-bit-only constant arithmetic keeps slipping into test files and breaking 32-bit downstream builds. Vet the whole root module under GOOS=linux GOARCH=386 so these fail in CI instead of after release. * fix(tests): convert s3client backoff to Duration before scaling The ms-to-ns multiplication ran in int, wrapping at runtime on 32-bit; scale by time.Millisecond after the Duration conversion instead. |
||
|
|
caadd6ca79 |
ci(s3tables): stop Lakekeeper flaking on Docker Hub pull timeouts (#9920)
* ci(s3tables): drop docker pre-pull from Lakekeeper job The lakekeeper repro is pure Go against the local weed binary; the job kept failing on Docker Hub timeouts pulling python:3 and localstack images the test never runs. Also drop the stale python-in-docker comments left from the old harness. * ci(s3tables): serve python:3 from GHA cache in the STS job Retried pulls still die when both mirror.gcr.io and registry-1.docker.io are unreachable from the runner. Cache the saved image tarball under a weekly key: an exact hit skips the registry entirely, a miss pulls fresh and refreshes the cache, and a stale tarball from a previous week is the fallback when Docker Hub is down. * ci(spark): pre-pull the spark tag the test actually runs The workflow warmed apache/spark:3.5.8 with retries while the testcontainers setup runs apache/spark:3.5.1, so the real image was pulled at test time with no retry at all. |
||
|
|
0c2576c3d0 |
ci: route Docker Hub pulls through a mirror to cut registry timeouts (#9904)
* ci(s3tables): route Docker Hub pulls through mirror, drop unused buildx The integration jobs set up docker/setup-buildx-action only to docker pull/run images; the buildx bootstrap pulls moby/buildkit from registry-1.docker.io, which times out and fails the whole job before any test runs. These jobs never docker build with buildx, so the setup is pure overhead and an extra registry dependency. Replace it with a daemon registry-mirror pointing at mirror.gcr.io (a pull-through cache for Docker Hub) and retry the pre-pulls a few times. That removes the buildkit pull entirely and routes the rest through the cache, with graceful fallback to Docker Hub on a miss. * ci: route Docker Hub through mirror in remaining docker test workflows Same registry-1.docker.io timeout fix for the other integration jobs. s3-spark only docker pulls/runs an image, so drop the vestigial buildx setup and pull through the mirror with retries, matching s3-tables. kafka-quicktest, s3-proxy-signature, e2e and postgres build/compose and genuinely need buildx (e2e/postgres export a local layer cache, which the default driver can't), so keep it and just configure the mirror first — that way even the moby/buildkit bootstrap pull is served from the cache. Left samba/pjdfstest alone: they build-push to a local registry and pull from localhost, so buildx is required and there's no Docker Hub runtime pull to mirror. |
||
|
|
9053d61504 |
rust release: fix large-disk/normal binary overwrite + publish md5 checksums (#9862)
* rust release: publish .md5 checksums alongside weed-volume binaries The versioned rust volume release built and uploaded the tarballs/zips but no checksum sidecars (the Go releases get .md5 automatically via go-release-action; this workflow uses softprops/action-gh-release directly). Generate an .md5 next to each asset (md5sum on linux/windows-bash, md5 -r on macOS) and include them in the release/artifact uploads, so downloaders (e.g. seaweed-up, which verifies md5 before installing weed-volume) can check integrity. Covers linux amd64+arm64, darwin amd64+arm64, windows amd64. * rust release: build large-disk and normal into separate target dirs Both cargo builds wrote to target/<triple>/release/weed-volume, so the second (normal, --no-default-features) overwrote the first, and the Package step then copied that same binary into BOTH tarballs — the large-disk asset actually shipped the normal binary. Build each variant into its own --target-dir (target/large-disk and target/normal, both under target/ so the existing cache still covers them) and copy each tarball's binary from its own dir. |
||
|
|
3688be82f5 |
fix(helm): deduplicate all-in-one extra environment variables (#9837)
* fix(helm): deduplicate all-in-one extra environment variables The all-in-one Deployment looped global.seaweedfs.extraEnvironmentVars and allInOne.extraEnvironmentVars in two separate ranges, so any key present in both maps was emitted as two env entries with conflicting values. It also computed a merged map for the cluster-default lookup but never used it for the env loop. Use the existing seaweedfs.mergeExtraEnvironmentVars helper (as the filer, master and s3 templates already do) so a key set in both maps renders once with the component value taking precedence, and add a chart-CI render assertion covering it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la> * ci(helm): drop checkmark glyphs from chart test output --------- Signed-off-by: Aleksei Sviridkin <f@lex.la> Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
ae4ad6859d |
fix(helm): suspend bucket versioning for YAML bool false (#9836)
* fix(helm): suspend bucket versioning for YAML bool false createBuckets[].versioning accepts both a YAML bool and a string. The string branch maps "false"/"disable"/"suspended" to Suspended, but the bool branch only handled true (Enabled) and left false as a silent no-op. The same logical value therefore behaved differently depending on its YAML type: `versioning: false` did nothing while `versioning: "false"` suspended the bucket. Mirror the string behaviour in the bool branch so bool false suspends the bucket, and add a chart-CI render assertion covering it. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la> * ci(helm): trim versioning regression-test comment * chart: document bool false for createBuckets versioning --------- Signed-off-by: Aleksei Sviridkin <f@lex.la> Co-authored-by: Chris Lu <chris.lu@gmail.com> |
||
|
|
df833d485f | [test] update docker image for s3test (#9811) | ||
|
|
24159fbff9 |
build(deps): bump opentofu/setup-opentofu from 1 to 2 (#9801)
Bumps [opentofu/setup-opentofu](https://github.com/opentofu/setup-opentofu) from 1 to 2. - [Release notes](https://github.com/opentofu/setup-opentofu/releases) - [Commits](https://github.com/opentofu/setup-opentofu/compare/v1...v2) --- updated-dependencies: - dependency-name: opentofu/setup-opentofu dependency-version: '2' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Chris Lu <chrislusf@users.noreply.github.com> |
||
|
|
cb67542d01 |
build(deps): bump docker/setup-qemu-action from 4.0.0 to 4.1.0 (#9802)
Bumps [docker/setup-qemu-action](https://github.com/docker/setup-qemu-action) from 4.0.0 to 4.1.0. - [Release notes](https://github.com/docker/setup-qemu-action/releases) - [Commits](https://github.com/docker/setup-qemu-action/compare/v4...v4.1.0) --- updated-dependencies: - dependency-name: docker/setup-qemu-action dependency-version: 4.1.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
6908445c5d |
build(deps): bump actions/checkout from 5 to 6 (#9803)
Bumps [actions/checkout](https://github.com/actions/checkout) from 5 to 6. - [Release notes](https://github.com/actions/checkout/releases) - [Commits](https://github.com/actions/checkout/compare/v5...v6) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
fba71ab14c |
ci: parallelize the unified release-container build (#9783)
* docker: cross-compile the Go binary instead of emulating it under QEMU The builder stage ran as the target platform, so arm64/arm/386 images emulated the whole Go compile (and the full git clone) under QEMU. The binary is CGO-free, so pin the builder to $BUILDPLATFORM and cross-compile with GOOS/GOARCH (GOARM for v7), keeping every target's compile native. * ci: build all release container variants in parallel The build matrix throttled to two variants at a time on a stale rate-limit worry. Pulls go through mirror.gcr.io and pushes target GHCR only, so the five variants can all build at once. * ci: copy each variant to Docker Hub from its build job The separate copy-to-dockerhub job waited on the whole build matrix before any GHCR -> Docker Hub copy could start. Move the crane copy into the build job so each variant copies as soon as it is built, overlapping with the others still compiling. tag-latest and helm-release now depend on build. |
||
|
|
a10607f90a |
Add Terraform support for VM-based SeaweedFS deployment (#9754)
* terraform: add cloud-agnostic core renderer module Renders per-node weed argv, systemd units, config files, disk-mount and secret-fetch scripts, and cloud-init from an address map. Creates zero cloud resources. Flags verified against the weed binary: volume uses -mserver for the master list, gRPC is -port.grpc (auto http+10000), minFreeSpacePercent is a string, filer store via -defaultStoreDir. * terraform: add mTLS and JWT security module Generates the CA, per-component certs with distinct CNs, and JWT signing keys via the tls/random providers. Emits a core_security object plus PEMs for secret-store delivery. * terraform: add AWS deployment module and examples Reserves stable ENIs first, renders config via the core, then creates instances, prevent_destroy EBS data disks mounted at /data, and the cluster security group. With enable_security, generates certs/JWT, stores them in SSM SecureString, grants an instance role, and fetches them at boot so secrets stay out of user_data. Keyed for_each on every stateful tier. * terraform: add local cluster test harnesses run_local_cluster.sh and run_local_secure.sh render a cluster with the core and run real weed processes, asserting master quorum, volume registration, filer/s3 round-trips, mutual-TLS formation, and JWT enforcement. Use an isolated high port range with a guard so they never touch a cluster already running on the machine. The weed binary defaults to $(go env GOPATH)/bin/weed. * terraform: add CI workflow and README fmt/validate/tofu-test plus smoke jobs that build weed and run both harnesses. * terraform: guard against empty filesystem UUID in mount script An empty UUID made grep -q match any fstab line, skipping the fstab entry and breaking the mount. Fail fast when blkid returns no UUID. * terraform: sanitize cluster name in WEED_CLUSTER env keys Hyphens or spaces in cluster_name produced invalid systemd/bash env var names; map non-alphanumerics to underscores. * terraform: omit empty jwt.signing block from security.toml With enable_security and no JWT key, the template emitted [jwt.signing] key="". Gate the block on a non-empty key and cover it with a test. * terraform: mark core security input as sensitive The security object carries JWT signing keys; keep them out of plan output and known values. * terraform: enforce jwt_length minimum of 32 * terraform: note region/AZ coupling in HA example * terraform: guard WORKDIR before recursive delete in test harnesses * terraform: fix README fence language and test count * terraform: handle embedded s3 with no filer nodes Indexing sort(keys(var.filers))[0] errored at plan time when embedded S3 was enabled but no filers were defined; fall back to an empty config source. * terraform: scope kms:Decrypt to a configurable key arn Replace the hardcoded Resource="*" with a kms_key_arn variable (default "*") so production can restrict decrypt to a specific CMK. * terraform: encrypt EBS data volumes at rest Set encrypted = true on the volume/filer data disks and the all-in-one example disk. * terraform: protect filer instances from API termination Filers hold the leveldb2 metadata store, so they are stateful and get the same disable_api_termination as masters and volumes. * terraform: stop instance before detaching in all-in-one example * terraform: drop stale references to the removed plan doc * terraform: correct stale mount-step comment in aws module * terraform: mark Terraform support as experimental in README |
||
|
|
dfd05d14cb |
refactor(filer): remove the inode->path index and the NFS gateway (#9724)
* fix(filer): derive inodes by hash instead of a snowflake sequencer
Compute the same inode the FUSE mount would: non-hard-linked entries hash path + crtime, hard links hash their shared HardLinkId so every link resolves to one inode. Removes the snowflake inodeSequencer and the SEAWEEDFS_FILER_SNOWFLAKE_ID knob; inodes are now deterministic across filers.
* chore: remove the experimental NFS gateway
The NFS frontend ('weed nfs') was the only consumer of the inode->path index. Remove the weed/server/nfs package, the command and its registration, the integration test harness, and the CI workflow; go mod tidy drops the willscott/go-nfs and go-nfs-client dependencies.
* refactor(filer): drop the inode->path index
With the NFS gateway gone, nothing reads it. A regular file's inode is a pure hash of its path and a hard link's is a hash of its shared HardLinkId -- both derivable on demand -- so the secondary KV index and its write/remove hooks are dead. Removes filer_inode_index.go and the recordInodeIndex hooks from the store wrapper.
|
||
|
|
502fef6b50 |
build(deps): bump docker/login-action from 4.1.0 to 4.2.0 (#9678)
Bumps [docker/login-action](https://github.com/docker/login-action) from 4.1.0 to 4.2.0. - [Release notes](https://github.com/docker/login-action/releases) - [Commits](https://github.com/docker/login-action/compare/v4.1.0...v4.2.0) --- updated-dependencies: - dependency-name: docker/login-action dependency-version: 4.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
391f543ff2 |
fix(ec): correct multi-disk disk counting and EC balance shard attribution (#9594)
* fix(shell): count physical disks in cluster.status on multi-disk nodes
The master keys DataNodeInfo.DiskInfos by disk type, so several same-type
physical disks on one node collapse into a single DiskInfo entry. cluster.status
(printClusterInfo) and CountTopologyResources counted len(DiskInfos), reporting
one disk per node instead of the real physical disk count, while volume.list and
the admin ActiveTopology already split per physical disk.
Route both counters through DiskInfo.SplitByPhysicalDisk so a node with N
same-type disks reports N. Cosmetic/diagnostic only; placement already uses the
per-disk activeDisk map.
* fix(ec): attribute EC balance source disk per shard and reject same-node moves
On multi-disk nodes the EC balance worker built a node-level view that kept only
the first physical disk id per (node, volume), so a move of a shard living on a
different disk reported the wrong source disk. That source disk drives the
per-disk capacity reservation, so the wrong disk drifts the capacity model the
EC placement planner relies on. Track shards per physical disk and resolve the
actual source disk for every emitted move (dedup, cross-rack, within-rack,
global), keeping the per-disk view consistent as simulated moves are applied.
Also close a data-loss trap: VolumeEcShardsDelete is node-wide (it removes the
shard from every disk on the node) and copyAndMountShard skips the copy when
source and target addresses match, so a same-node move would erase a shard it
never copied. isDedupPhase now requires the same node AND disk, and Validate /
Execute reject same-node cross-disk moves outright.
* fix(ec): spread EC balance moves across destination disks
Port the shell ec.balance pickBestDiskOnNode heuristic to the EC balance
worker so a moved shard is placed on a good physical disk instead of always
deferring to the volume server (target disk 0). The detection now builds a
per-physical-disk view of each node (free slots split from the node total, exact
EC shard count, disk type, discovered from both regular volumes and EC shards)
and, for each cross-rack, within-rack, and global move, chooses the destination
disk by ascending score:
- fewer total EC shards on the disk,
- far fewer shards of the same volume on the disk (spread a volume's shards
across disks for fault tolerance), and
- data/parity anti-affinity (a data shard avoids disks holding the volume's
parity shards and vice versa).
Planned placements are reserved on the in-memory model during a run so multiple
shards moved to the same node spread across its disks rather than piling on one.
* fix(ec): bring EC balance worker to parity with shell ec.balance
The worker's cross-rack and within-rack balancing balanced shards by total
count; the shell balances data and parity shards separately with anti-affinity
and honors replica placement. Port that logic so the automatic balancer makes
the same fault-tolerance-aware decisions as the manual command:
- Cross-rack and within-rack now run a two-pass balance: data shards spread
first, then parity shards spread while avoiding racks/nodes that already hold
the volume's data shards (anti-affinity), mirroring doBalanceEcShardsAcrossRacks
and doBalanceEcShardsWithinOneRack.
- Optional replica placement: a new replica_placement config (e.g. "020")
constrains shards per rack (DiffRackCount) and per node (SameRackCount); empty
keeps the previous even-spread behavior.
- The data/parity boundary is resolved from a per-collection EC ratio (standard
10+4 here), replacing the previously hardcoded constant at the call sites.
Selection is deterministic (sorted keys) to keep behavior reproducible.
* refactor(ec): extract shared ecbalancer package for shell and worker
The EC shard balancing policy was duplicated between the shell ec.balance
command and the admin EC balance worker, and the two had drifted (multi-disk
handling, data/parity anti-affinity, replica placement). Extract the policy into
a new pure package, weed/storage/erasure_coding/ecbalancer, that both callers
share so it cannot drift again.
- ecbalancer.Plan(topology, options) runs the full policy (dedup, cross-rack and
within-rack data/parity two-pass with anti-affinity, global per-rack balance,
and diversity-aware disk selection) over a caller-built Topology snapshot and
returns the shard Moves. It depends only on erasure_coding and super_block.
- The worker builds the Topology from the master topology and turns Moves into
task proposals; the shell builds it from its EcNode model and executes Moves
via the existing move/delete RPCs. Per-collection EC ratio resolution stays in
each caller (passed as Options.Ratio).
- Options expose the two genuine policy differences: GlobalUtilizationBased
(worker balances by fractional fullness; shell by raw count) and
GlobalMaxMovesPerRack (worker moves incrementally across cycles; shell drains
in one pass).
The shell keeps pickBestDiskOnNode for the evacuate command. Policy tests move to
the ecbalancer package; the shell and worker keep their adapter/execution tests.
* fix(ec): restore parallelism and per-type/full-range balancing after ecbalancer refactor
Address regressions and gaps from the ecbalancer extraction:
- Shell ec.balance honors -maxParallelization again: planned moves run phase by
phase (preserving cross-phase dependencies) with bounded concurrency within a
phase. Apply mode does only the RPCs concurrently; dry-run stays sequential and
updates the in-memory model for inspection.
- Rack and node balancing gate on per-type spread (data and parity separately)
instead of combined totals, so a data/parity skew is corrected even when the
per-rack/node totals are even.
- Global rack balancing iterates the full shard-id space (MaxShardCount) so
custom EC ratios with more than the standard total are candidates.
- Cross-rack planning decrements the destination node's free slots per planned
move, so limited-capacity targets are no longer over-planned.
* fix(ec): make EC dedup keeper deterministic and capacity-aware
When a shard is duplicated across nodes, keep the copy on the node with the most
free slots and delete the duplicates from the more-constrained nodes, relieving
capacity pressure where it is tightest. Tie-break on node id so the choice is
deterministic. This unifies the shell and worker (the shell previously kept the
least-free node, an incidental default) on the more sensible behavior.
* fix(ec): restore global volume-diversity and per-volume move serialization
Two more behaviors lost in the ecbalancer refactor:
- Global rack balancing again prefers moving a shard of a volume the destination
does not hold at all before adding another shard of an already-present volume
(two-pass, mirroring the old balanceEcRack), keeping each volume's shards
spread across nodes.
- Shell apply-mode execution serializes a single volume's moves within a phase
while still running different volumes in parallel, so concurrent moves of the
same volume cannot race on its shared .ecx/.ecj/.vif sidecar files.
* fix(ec): key EC balance shards by (collection, volume id)
A numeric volume id can be reused across collections, and EC identity is
(collection, vid) (see store_ec_attach_reservation.go). The ecbalancer keyed
Node.shards by vid alone, so volumes sharing an id across collections merged into
one entry — letting dedup delete a "duplicate" that is actually a different
collection's shard, and letting moves act across collections. Key shards by
(collection, vid) throughout so each volume stays distinct.
* fix(ec): credit freed capacity from dedup before later balance phases
Dedup deletions are simulated only by applyMovesToTopology, which cleared shard
bits but did not return the freed disk/node/rack slots. Later phases reject
destinations with no free slots, so a slot opened by dedup could not be reused in
the same Plan/ec.balance run. applyMovesToTopology now credits the freed
disk/node/rack capacity for dedup moves (non-dedup moves still rely on the inline
accounting their phase already did).
* test(ec): add multi-disk EC balance integration test
Cover issue 9593 end-to-end at the unit level the old tests missed: build the
master's actual multi-disk wire format (same-type disks collapsed into one
DiskInfo, real DiskId only in per-shard records), run it through a real
ActiveTopology and the Detection entry point, then replay the planned moves with
the volume server's true semantics (node-wide VolumeEcShardsDelete) and assert no
EC shard is ever lost. Covers a balanced spread, a one-node-concentrated volume,
and a multi-rack spread, and asserts moves are safe (no same-node cross-disk),
correctly attributed to the source disk, and redistribute concentrated volumes
across both other racks and multiple destination disks.
* fix(ec): aggregate per-disk EC shards when verifying multi-disk volumes
collectEcNodeShardsInfo overwrote its per-server entry for each EcShardInfo of a
volume. A multi-disk node reports one EcShardInfo per physical disk holding shards
of the volume, so only the last disk's shards survived — the node looked like it
was missing shards it actually had. This made ec.encode's pre-delete verification
(and ec.decode) under-count volumes whose shards are spread across disks on one
server, falsely aborting the encode on multi-disk clusters. Union the per-disk
shard sets per server instead.
Also make verifyEcShardsBeforeDelete poll briefly: shard relocations reach the
master via volume-server heartbeats, so a freshly distributed shard set may not be
fully visible the instant the balance returns. Retry before concluding the set is
incomplete; genuine loss still fails after the retries are exhausted.
* test(ec): end-to-end multi-disk EC balance shard-loss regression
Start a real cluster of multi-disk volume servers (3 servers x 4 disks),
EC-encode a volume, run ec.balance, and assert hard invariants the prior
integration tests only logged: after encode all 14 shards exist, ec.balance loses
no shard, shards span more than one disk per node, and cluster.status counts
physical disks (not one per node). This reproduces issue 9593 end to end and would
have caught the multi-disk shard-aggregation bug fixed alongside it.
* fix(ec): bring EC balance worker/plugin path to parity with shell
- Per-volume serialization and phase order: key the plugin proposal dedupe by
(collection, volume) instead of (volume, shard, source), so the scheduler runs
only one of a volume's moves at a time (within a run and against in-flight jobs).
Concurrent same-volume moves raced on the volume's .ecx/.ecj/.vif sidecars; and
because the planner emits a volume's moves in phase order, they now execute in
order across detection cycles, matching the shell.
- disk_type "hdd": normalize via ToDiskType (hdd -> "" HardDriveType) while keeping
a "filter requested" flag, so disk_type=hdd matches the empty-keyed HDD disks
instead of nothing; apply the canonical type to planner options and move params.
- Replica placement: expose shard_replica_placement in the admin config form and
read it into the worker config, mirroring ec.balance -shardReplicaPlacement.
* test(ec): rename worker in-process test (not a real integration test)
The worker-package multi-disk tests build a fake master topology and simulate
move execution; they are not real-cluster integration tests. Rename
integration_test.go -> multidisk_detection_test.go and drop the Integration
prefix so 'integration' refers only to the real-cluster E2Es in test/erasure_coding.
* ci(ec): remove redundant ec-integration workflow
ec-integration.yml duplicated EC Integration Tests under the same workflow name
but ran only 'go test ec_integration_test.go' (one file), so it never ran new
test files (e.g. multidisk_shardloss_test.go) and was a strict, path-filtered
subset of ec-integration-tests.yml, which already runs 'go test -v' over the whole
test/erasure_coding package on every push/PR.
* fix(ec): worker falls back to master default replication for EC balance
For strict parity with the shell, the EC balance worker now uses the master's
configured default replication as the replica-placement fallback when no explicit
shard_replica_placement is set, instead of always defaulting to even spread.
The maintenance scanner reads it via GetMasterConfiguration each cycle and passes
it through ClusterInfo.DefaultReplicaPlacement; detection resolves the constraint
(explicit config wins, else master default, else none) in resolveReplicaPlacement.
A zero-replication default (the common 000 case) still means even spread, so the
common configuration is unchanged.
* fix(ec): plugin path populates master default replication too
The plugin worker built ClusterInfo with only ActiveTopology, so the master
default replication fallback added for the maintenance path never reached
plugin-driven EC balance detection — empty shard_replica_placement still meant
even spread there. Fetch the master default via GetMasterConfiguration (new
pluginworker.FetchDefaultReplicaPlacement) and set ClusterInfo.DefaultReplicaPlacement
so both detection paths resolve replica placement identically to the shell.
* docs(ec): empty shard replica placement uses master default, not even spread
The EC balance config text (admin plugin form, legacy form help text, and
the struct/proto field comments) still said an empty shard_replica_placement
spreads evenly. The runtime resolves empty to the master default replication
(resolveReplicaPlacement), matching shell ec.balance, with even spread only
when that default is empty or zero. Update the text to match and regenerate
worker_pb for the proto comment change.
|
||
|
|
a5d0e4a735 |
Samba-over-FUSE integration test and distributed-lock handoff fixes (#9590)
* test(mount): add Samba over FUSE integration test Export a SeaweedFS FUSE mount over SMB with smbd and drive it with smbclient: file round-trips, directories, rename, large-file chunking, recursive upload, cross-protocol consistency, and deletes. A second -dlm mount adds locking coverage: POSIX fcntl byte-range locks, distributed-lock write coordination, and concurrent writers. The two cross-mount handoff checks currently fail and pin a known limitation - the distributed lock is released on FUSE Release, which the kernel can delay under contention. Runs locally via test/samba/run.sh or in Docker via the compose file; wired into CI as samba-integration.yml. * fix(cluster): release distributed lock without racing the renewal goroutine Stop() closed the cancel channel, slept 10ms, then unlocked using renewToken. A renewal in flight during that window rotates the token on the server, so the unlock may be sent with a stale token, fail with a mismatch, and leave the lock to linger until its TTL expires - stalling other mounts waiting to write the same file. Wait for the renewal goroutine to exit before unlocking. The channel close also makes the renewToken read happen-after the last renewal. * fix(cluster): poll for distributed lock acquisition without exponential backoff A mount waiting to write a file held by another mount acquired through util.RetryUntil, whose backoff grows to several seconds. Once the holder released, the waiter could sleep that long before retrying, stretching the cross-mount handoff past client timeouts. Poll at the steady ~1s cadence AttemptToLock already enforces instead. * test(mount): tighten Samba harness and mark the DLM handoff checks xfail Run the workflow for weed/cluster changes, fail fast when the filer or smbd port never opens, and fold the recursive mput result into its own assertion so it cannot false-pass. Mark the two cross-mount handoff checks expected-fail: they pin the remaining DLM liveness bug (the lock is freed only on the delayed FUSE Release) without failing CI, and turn the suite red if the handoff is ever fixed. * fix(cluster): keep a wedged renewal shutdown from sending a stale unlock If the renewal goroutine is stuck in a slow RPC, Stop() fell through to unlock anyway once it timed out waiting. A late renewal can rotate renewToken, so that unlock races it, is rejected on a stale token, and leaves the lock lingering until its TTL regardless. On the timeout path, skip the unlock and let the TTL expire the lock instead. * fix(cluster): wake the long-lived lock renewal loop promptly on Stop StartLongLivedLock's renewal loop slept uninterruptibly between attempts, up to 5*renewInterval (2.5*lockTTL) while unlocked. Stop() waits only lockTTL+2s for the goroutine to exit, so a Stop() during that backoff would time out before the goroutine woke and closed renewalDone, breaking the shutdown synchronization. Sleep on a timer with a select on cancelCh so the loop exits immediately. |
||
|
|
f72983c1fd |
fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"
The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".
Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.
ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.
* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal
The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.
Extend the matcher so a request is recognized as S3 Tables when either:
- its AWS V4 credential scope names SERVICE=s3tables; or
- it carries the canonical AWS JSON RPC 1.1 content type and is
unsigned (a request explicitly signed for SERVICE=s3 still wins).
The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).
Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.
* s3api: extend service matcher to all S3 Tables routes; simplify scope check
- Apply serviceMatcher to every S3 Tables route, not just the bare-path
ones. ARN-bearing paths could otherwise be hit by an S3 object key
that starts with arn:aws:s3tables:..., inside a bucket named
"buckets", "namespaces", "tables", or "tag". One matcher everywhere
closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
credential-scope check. The scope shape is fixed at
AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
is exactly s3tables. Existing unit cases (including the
access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
the single Read could return short and make the equality check flaky.
* s3api: drop content-type fallback; sign s3 tables harness traffic instead
The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.
Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.
Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
|
||
|
|
6b94701213 |
mini: quieter startup with a docker-compose-style progress board (#9524)
* mini: quieter startup with a docker-compose-style progress board Replaces noisy startup/shutdown logs with a single in-place progress table on a TTY (or one line per state change off-TTY). Each component renders as `pending -> starting -> ready` during startup and `stopping -> stopped` during shutdown, with elapsed time on transition. Also folds in a few cleanups uncovered while making this readable: - route the admin.go startup prints through glog so quietMiniLogs() filters them under mini but standalone weed admin still shows them - generate a dev SSE-S3 KEK + passphrase on first run via WEED_S3_SSE_KEK and WEED_S3_SSE_KEK_PASSPHRASE env vars (viper.Set has a nested-key conflict between s3.sse.kek and s3.sse.kek.passphrase); persisted under the data folder so restarts reuse the same key - demote worker/master gRPC Recv 'context canceled' to V(1); those are the normal shutdown signal, not Errors/Warnings - drop the 'Optimized Settings' block and the 'credentials loaded from environment variables' message from the welcome banner - only show the credentials setup hints when no S3 identities exist (new s3api.HasAnyIdentity accessor backed by an atomic.Bool) - use S3_BUCKET in the credentials hint so it pairs with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY - reorder running-services list to master / volume / filer / webdav / s3 / iceberg / admin * mini: refuse in-memory-only SSE-S3 dev keys; surface admin serve errors loadOrCreateMiniHexSecret returns "" when os.WriteFile fails, so SSE-S3 won't encrypt data under a KEK that the next restart can't reproduce (which would orphan whatever was written this run). The caller already treats "" as "skip setting WEED_S3_SSE_* env vars", so SSE-S3 and IAM just stay disabled for this run. startAdminServer's serve goroutine used to only log ListenAndServe failures, so a bind error left the caller blocked on ctx.Done() with no listener. Forward the error through a buffered channel and select on it alongside ctx.Done(). * ci(s3-proxy-signature): match weed mini's new progress-board ready line The readiness probe grepped for "S3 (gateway|service).*(started|ready)", which matched weed mini's old "S3 service is ready at ..." line. Mini now emits " S3 ready (Xs)" from its progress board, so the old pattern misses and the test timed out at the 30-second wait. Widen the alternation to also accept "S3\s+ready". The curl HEAD fallback already covers any remaining cases. |
||
|
|
b4289abb0a |
admin: convert filer address to gRPC form before dispatch (#9523)
The master returns each registered filer in pb.ServerAddress dual-port form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's plugin context builder forwarded that string verbatim as filer_grpc_address, so workers calling grpc.DialContext on it failed every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port". Run each entry through pb.ServerAddress.ToGrpcAddress before populating ClusterContext.FilerGrpcAddresses. The lifecycle integration test now pins filer.port.grpc to a value that breaks the FILER_PORT+10000 assumption, and a new dispatch test drives the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and asserts the dispatched job both reaches the filer and deletes the backdated object. |
||
|
|
2ed95d7ea9 |
helm: decouple JWT signing from cert-manager mTLS (fixes #9506) (#9508)
* helm(security): decouple JWT signing from cert-manager mTLS The filer needs jwt.filer_signing.key to register the IAM gRPC service the Admin UI Users tab calls (PR #9442). The chart only rendered security.toml under enableSecurity, which also pulls in cert-manager for mTLS — much heavier than the Admin UI needs. Operators on Helm without cert-manager have no way to flip the JWT key on, so the Users tab fails with Unimplemented after upgrading past 4.24. Introduce seaweedfs.securityConfigEnabled, true when enableSecurity OR any explicit jwtSigning toggle (volumeRead/filerWrite/filerRead) is set. The configmap renders under that helper; the [grpc.*]/[https.*] sections inside stay gated on enableSecurity. Each pod template splits the security-config mount onto the helper and keeps the cert volume mounts on enableSecurity. volumeWrite is intentionally excluded from the helper trigger because it defaults to true; including it would silently start mounting security.toml on every fresh install. With this change, enableSecurity=false + defaults renders nothing (unchanged), enableSecurity=true renders the full toml (unchanged), and enableSecurity=false + filerWrite=true renders just the [jwt.*] sections so the Admin UI works without mTLS. Fixes #9506. * helm(security): trim verbose comments * helm(security): handle null securityConfig in helper Address review feedback: (.Values.global.seaweedfs.securityConfig).jwtSigning errored if a user explicitly set securityConfig: null in their values. Drop into intermediate $sec/$jwt with default dict at each step so a missing or nulled-out parent is tolerated. * helm(ci): cover IAM gRPC decoupling (issue #9506) Five regression assertions exercised against the rendered chart so a future change cannot silently re-couple jwt.filer_signing to mTLS: 1. defaults render no security-config ConfigMap (preserves baseline) 2. filerWrite=true alone renders [jwt.filer_signing] with no [grpc.*] 3. filerWrite=true mounts security-config on filer + admin without pulling in cert volumes — the actual fix for the Admin UI Users tab 4. enableSecurity=true still produces the full toml with [grpc.master] 5. securityConfig=null and securityConfig.jwtSigning=null both render cleanly (gemini-code-assist review nit, applied chart-wide) Patch a pre-existing direct-access in filer-statefulset.yaml that crashed on securityConfig=null, surfaced by the new null assertion. * helm(ci): drop issue numbers from comments * helm(ci): install pyyaml; assert [jwt.signing] in mTLS path Address coderabbit review: - The new IAM gRPC test block uses `import yaml` but ran before the later `pip install pyyaml -q` step that the security+S3 block performs. CI happens to pass because the runner image carries PyYAML, but make the dependency explicit so a future runner change cannot silently break the regression test. - The enableSecurity=true assertion only checked for [grpc.master]. Also assert [jwt.signing] so a refactor that drops the volume-side JWT stanza from the mTLS path fails the test instead of slipping through. |
||
|
|
e56a3ee4a2 |
ci(s3-lifecycle): split into per-test matrix jobs
Each test now runs against a fresh `weed mini`, so per-collection TTL volume budget no longer leaks across tests and exhausts the pool. |
||
|
|
db2d975b80 |
ci(docker): tag latest in unified release instead of rebuilding (#9500)
The separate container_latest.yml workflow rebuilt the latest image from scratch on every tag push (full multi-arch build + QEMU + trivy gate), which is slow and frequently fails — leaving `latest` stranded on the prior release (e.g. 4.23 after 4.24 shipped, #9497). Drop the rebuild. The unified release workflow already publishes the exact same content as `<tag>` and `<tag>_large_disk`, so just re-tag those manifests with `crane tag` on both GHCR and Docker Hub once copy-to-dockerhub completes. Seconds, not hours, and no QEMU. Move the trivy scan into the unified workflow as report-only: SARIF still uploads to GitHub Security for visibility, but vuln findings no longer block the release. container_latest.yml stays as a workflow_dispatch-only manual fallback. Refs #9497. |
||
|
|
91bcc910eb |
build(deps): bump actions/dependency-review-action from 4.9.0 to 5.0.0 (#9450)
Bumps [actions/dependency-review-action](https://github.com/actions/dependency-review-action) from 4.9.0 to 5.0.0. - [Release notes](https://github.com/actions/dependency-review-action/releases) - [Commits](https://github.com/actions/dependency-review-action/compare/2031cfc080254a8a887f58cffee85186f0e49e48...a1d282b36b6f3519aa1f3fc636f609c47dddb294) --- updated-dependencies: - dependency-name: actions/dependency-review-action dependency-version: 5.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
05d31a04b6 |
fix(s3tests): wire lifecycle worker for expiration suite (#9374)
* fix(s3tests): wire lifecycle worker for expiration suite
The upstream s3-tests `test_lifecycle_expiration` / `test_lifecyclev2_expiration`
exercise the "set rule, wait, verify deletion" path. Phase 4 (#9367) intentionally
stripped the PUT-time back-stamp, so pre-existing objects no longer pick up TtlSec
on a freshly-applied rule. The s3tests CI bare-bones `weed -s3` had nothing left
driving expiration.
Three changes that work together:
- Engine scales `Days` by `util.LifeCycleInterval`. Production keeps the 24h day;
the `s3tests` build tag shrinks it to 10s so a `Days: 1` rule completes inside
the suite's 30s polling window. Exported `DaysToDuration` so sibling-package
tests pin to the same scale.
- Scheduler/dispatcher tick defaults split into `_default` / `_s3tests` files.
Production stays 5s/30s/5m; the test build runs at 500ms/2s/2s so deletions
land within a couple ticks of becoming due.
- s3tests.yml spawns `weed shell s3.lifecycle.run-shard -shards 0-15 -events 0
-runtime 1800s` alongside the s3 server in both the basic and SQL blocks; the
shell command runs the full pipeline (reader + scheduler + dispatcher) for the
duration of the suite. `test_lifecycle_expiration_versioning_enabled` is left
out for now — versioned-bucket expiration via the worker still needs its own
pass.
Drive-by: bump `TestWorkerDefaultJobTypes` to 7 to match the registered
handler count (
|
||
|
|
85abf3ca88 |
feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var |
||
|
|
22ebe9feb0 |
ci(e2e): switch FUSE Mount build to Azure Ubuntu mirror, persist buildx cache
archive.ubuntu.com from GitHub-hosted runners has been Ign:/retrying for ~60s per package, eating the Start SeaweedFS step's 10-min budget before apt-get install finishes. The host already uses azure.archive.ubuntu.com; do the same inside Dockerfile.e2e and drop the Retries=5 amplifier. Also rotate /tmp/.buildx-cache-new over /tmp/.buildx-cache so the apt layer actually survives across runs, and bump the step to 15 minutes as a safety margin. |
||
|
|
a769c938ec |
test(s3tables): Unity Catalog OSS integration tests against SeaweedFS (#9308)
* test(s3tables): add Unity Catalog OSS integration test against SeaweedFS Mirrors the configuration used by the upstream playground at data-engineering-helpers/mds-in-a-box/unitycatalog-playground. Three test variants under test/s3tables/unity_catalog: - TestUnityCatalogDeltaIntegration: aws.masterRoleArn empty / static keys; catalog/schema/EXTERNAL Delta CRUD + temporary-table-credentials S3 round-trip (the playground's working configuration). - TestUnityCatalogMasterRoleIntegration: aws.masterRoleArn set to a SeaweedFS-side role with a permissive trust policy; UC's StsClient is pinned at SeaweedFS via AWS_ENDPOINT_URL_STS, and the test asserts the vended creds carry a session_token and a non-static access key, proving the role-vended path the playground notes as not-yet-working actually does work today. - TestUnityCatalogDeltaRsRoundTrip: writes/reads a real Delta table at the registered storage_location using delta-rs in a slim Python container, with temporary credentials fetched from UC. All three self-skip without Docker or a weed binary, matching the sibling lakekeeper / polaris tests. * test(s3tables): tighten Unity Catalog tests against actual UC OSS behavior After running the suite locally, ground the assertions in what the upstream UC OSS Docker image actually does against SeaweedFS today. - Static-key playground configuration (TestUnityCatalogDeltaIntegration): catalog/schema/EXTERNAL Delta CRUD pass against the SeaweedFS-backed warehouse. The temporary-table- credentials subtest is renamed and inverted to assert the failure mode the playground reports -- UC's AwsCredentialVendor falls through to an internal StsClient.assumeRole when masterRoleArn and sessionToken are both empty, which has no real STS to talk to. Bucket path is also fixed to match UC's getStorageBase() lookup (s3://lakehouse vs the playground's s3://lakehouse/warehouse, which the upstream code never matches). - Master-role variant (TestUnityCatalogMasterRoleIntegration): split into two passing slices. Slice 1 proves SeaweedFS' STS endpoint vending UnityCatalogVendedRole works via the Go AWS SDK and the vended creds round-trip on S3. Slice 2 boots UC with aws.masterRoleArn set and verifies catalog/schema/Delta CRUD. The third hop -- UC's Java StsClient actually reaching SeaweedFS' STS handler during /temporary-table-credentials -- is logged but not asserted, since the AWS Java SDK's STS request currently lands on a SeaweedFS S3 path rather than the STS handler. - Delta-RS round-trip (TestUnityCatalogDeltaRsRoundTrip): gated on UC_DELTA_RS_RUN=1 since it depends on the master-role STS handoff above. The Dockerfile / writer script stay in tree so the test runs end-to-end the moment that hop is fixed. README rewritten to be explicit about what each test validates today and what is still pending. Result: `go test -run TestUnityCatalog ./test/s3tables/unity_catalog/...` passes cleanly with weed + Docker available, and self-skips otherwise. * test(s3tables): exercise unity catalog integrations * ci: run Unity Catalog integration tests on PRs Adds a unity-catalog-integration-tests job to s3-tables-tests.yml, modeled on the existing lakekeeper / dremio jobs. Pre-pulls the UC image and python:3.11-slim (used by the delta-rs writer container) and runs `go test ./test/s3tables/unity_catalog`. Format-check and go-vet jobs already recurse into ./test/s3tables/... so the new package is covered there too. * test/ci: address PR review Tighten the UC readiness probe to require 200, not <500, so a 401/403/404 during startup surfaces immediately instead of being treated as ready (CodeRabbit). Pin the UC image to v0.4.0 in both the workflow and the test default, matching the pinned-tag convention the rest of s3-tables-tests.yml uses (CodeRabbit). Use UC_IMAGE=unitycatalog/unitycatalog:main to re-test against current upstream. * docs: separate UC static-key vs master-role failure modes The README mixed the two together. Static-key empty-sessionToken short-circuits with "S3 bucket configuration not found." before UC even fires an STS call; the AccessDenied I described is what happens in the master-role variant where UC's Java StsClient actually reaches SeaweedFS. Cross-link the playground PR that fixes the static-key vending side. Also drop the "what most playground users actually run" hand-wave under MANAGED tables. * docs: trim README Drop the playground cross-reference and the "two layers fail independently" framing. * docs: pin down what's actually pending Investigated the master-role STS handoff with a sniffer in front of SeaweedFS' STS port. UC's StsClient is constructed without an endpointOverride and never reads aws.endpoint or AWS_ENDPOINT_URL_STS; verified by pointing AWS_ENDPOINT_URL_STS at port 1 and seeing the same real-AWS InvalidClientTokenId 403 with zero traffic to SeaweedFS. The fix is upstream in UC. Updated the README and the master-role test's t.Logf to say so precisely, and dropped the stale "Spark client" bullet (delta-rs covers that path). * test(s3tables): use BaseEndpoint instead of deprecated resolver EndpointResolverWithOptions is deprecated in aws-sdk-go-v2; the supported way to override a service endpoint is via the per-service Options.BaseEndpoint. Switch the assume-role helper to that pattern so the test stops compiling against deprecated API and the resolver boilerplate disappears. Addresses gemini review on PR #9308. * test(s3tables): drop unused splitS3URI helper Helper had no callers; gemini caught it on PR #9308. Easy to bring back from git history if needed. * test(s3tables): extract last token of docker run output as container ID docker run -d may prefix the container ID with image-pull progress when the image isn't cached locally. strings.TrimSpace on the whole output then gave a multi-line string, not the ID. Take the last whitespace-separated token so the ID survives a fresh CI runner. Addresses gemini review on PR #9308. * test(s3tables): cap Unity Catalog response body reads at 10 MiB io.ReadAll without a limit could OOM the test runner if the UC container hands back an unexpectedly large body. 10 MiB is well above any well-formed catalog response and turns a misbehaving server into a test failure instead of a runner crash. Addresses gemini review on PR #9308. * docs: link UC fix PR and call out UC's mocked-Sts test pattern UC's own credential-vending tests substitute StsClient with an in-process EchoAwsStsClient (BaseCRUDTestWithMockCredentials) or Mockito.mockStatic (CloudCredentialVendorTest), so the wire path between UC's Java SDK and a real STS server is untested -- which is why the missing endpointOverride slipped through upstream. Linked the upstream fix at unitycatalog/unitycatalog#1532. |
||
|
|
1de741737d |
test(s3tables): add Apache Doris Iceberg catalog integration test (#9307)
* test(s3tables): add Apache Doris Iceberg catalog integration test Adds an end-to-end smoke test that boots the apache/doris all-in-one container, registers SeaweedFS as an external Iceberg REST catalog (OAuth2 client_credentials), and validates metadata visibility plus the parquet read path against tables seeded via the Iceberg REST API and a PyIceberg writer container, mirroring the existing Trino, Spark, and Dremio coverage. Wires the test into a new s3-tables-tests workflow job. * test(s3tables): document weed shell -master flag format and fill in helper docstrings Restores the explanatory comment on createTableBucket about the host:port.grpcPort ServerAddress format used by `weed shell -master` (produced by pb.NewServerAddress) so the dot separator isn't mistaken for a typo, and adds doc comments for createIcebergNamespace, createIcebergTable, doIcebergJSONRequest, requireDorisRuntime, and hasDocker. |
||
|
|
fc75f16c30 |
test(s3tables): expand Dremio Iceberg catalog test coverage (#9303)
* test(s3tables): expand Dremio Iceberg catalog test coverage
Restructure TestDremioIcebergCatalog into subtests and add three new
checks that go beyond a connectivity smoke test:
- ColumnProjection: SELECT id, label proves Dremio parsed the schema
served by the SeaweedFS REST catalog (the previous SELECT COUNT(*)
passed without exercising any column metadata).
- InformationSchemaColumns: verifies the table's columns are listed in
Dremio's INFORMATION_SCHEMA.COLUMNS in the expected ordinal order.
- InformationSchemaTables: verifies the table is registered in
INFORMATION_SCHEMA.TABLES.
All subtests share a single Dremio container startup, so total
runtime is unchanged.
* test(s3tables): exercise multi-level Iceberg namespaces from Dremio
Seed a 2-level Iceberg namespace (and a table inside it) via the REST
catalog before bootstrapping Dremio, then add a MultiLevelNamespace
subtest that scans the nested table by its dot-separated reference.
This relies on isRecursiveAllowedNamespaces=true (already set in the
Dremio source config) to surface the nested levels as folders. A
regression in either the SeaweedFS namespace path encoding (#8959-style)
or Dremio's recursive-namespace discovery would surface here.
Adds two helpers to keep the existing single-level call sites unchanged:
- createIcebergNamespaceLevels: namespace creation with []string levels
- createIcebergTableInLevels: table creation with []string levels and
unit-separator (0x1F) URL encoding for the namespace path component
* test(s3tables): verify Dremio reads PyIceberg-written rows
The previous Dremio subtests only scanned empty tables, so they did not
exercise the data path - just the catalog/metadata path. Add a
PyIceberg-based writer that materializes parquet files plus a snapshot
on a separate table before Dremio bootstraps, and two new subtests:
- ReadWrittenDataCount: SELECT COUNT(*) returns 3.
- ReadWrittenDataValues: SELECT id, label ORDER BY id returns the three
written rows with the expected (id, label) pairs.
The writer runs in a small image (Dockerfile.writer) built locally on
demand. It pip-installs pyiceberg+pyarrow once and reuses the layer
cache on subsequent runs. The CI workflow pre-pulls python:3.11-slim
to keep cold runs predictable.
The writer authenticates via the OAuth2 client_credentials flow that
SeaweedFS already exposes at /v1/oauth/tokens, mirroring the Go-side
helper used for REST-API table creation.
* test(s3tables): fix Dremio writer required-field schema mismatch
PyIceberg's append() compatibility check rejects an arrow column whose
nullability does not match the Iceberg field. The table schema declares
id as `required long`, but the default pyarrow int64 column is nullable
- so the writer failed with:
1: id: required long vs. 1: id: optional long
Declare an explicit pyarrow schema with nullable=False on id and
nullable=True on label to match the Iceberg side.
|
||
|
|
b2f4ebb776 |
test(s3tables): add Dremio Iceberg catalog integration tests (#9299)
* test(s3tables): add Dremio Iceberg catalog integration tests
Add comprehensive integration tests for Dremio with SeaweedFS's Iceberg
REST Catalog, following the same patterns as existing Spark and Trino tests.
Tests include:
- Basic catalog connectivity and schema operations
- Table creation, insertion, and querying (CRUD)
- Deterministic table location specification
- Multi-level namespace support
Implementation includes:
- dremio_catalog_test.go: Core test environment and basic operations
- dremio_crud_operations_test.go: Schema and table CRUD testing
- dremio_deterministic_location_test.go: Location and namespace testing
- Comprehensive README and implementation documentation
CI/CD:
- Added dremio-iceberg-catalog-tests job to s3-tables-tests.yml
- Pre-pulls Dremio image, runs with 25m timeout
- Uploads artifacts on failure
* add docstrings to Dremio integration tests and fix CI image pre-pull
- Add function docstrings to all test functions and helper functions
in dremio_catalog_test.go, dremio_crud_operations_test.go, and
dremio_deterministic_location_test.go to improve code documentation
and satisfy CodeRabbit's docstring coverage requirements.
- Make Dremio Docker image pre-pull non-critical in CI workflow.
The pre-pull was failing with access denied error, but the image
can still be pulled at runtime. Using continue-on-error to allow
tests to proceed.
* fix: correct YAML syntax in Dremio CI workflow
Use multi-line run command with pipe operator (|) instead of
inline command with || operator to avoid YAML parsing errors.
The || operator was causing 'mapping values are not allowed here'
syntax errors in the YAML parser.
* make Dremio tests gracefully skip if container unavailable
Modify startDremioContainer and waitForDremio to return boolean values
instead of fataling. Tests now skip gracefully if:
- Dremio Docker image is unavailable
- Container fails to start
- Container doesn't become ready within timeout
This prevents CI failure when Dremio image is not accessible while
still testing the integration when it is available.
* Revert "make Dremio tests gracefully skip if container unavailable"
This reverts commit
|
||
|
|
9b624a73fe |
ci: provide a Docker tag for foundationdb release container on workflow_dispatch
The metadata-action used type=ref,event=tag, which produces no tag on workflow_dispatch, causing build-push to fail with "tag is needed when pushing to registry". Add a release_tag input and build the tag from a RELEASE_TAG env, mirroring container_release_unified.yml. |
||
|
|
1da091f798 |
ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up) Six integration test packages had _test.go files but no GitHub workflow running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug (and the four cross-SSE copy bugs in #9281) ship undetected was an instance of this same pattern. This change wires three of them into CI and removes a fourth that was deadcode: test/multi_master/ NEW workflow: multi-master-tests.yml - 3-node master raft cluster failover/recovery (5 tests, ~65s) test/testutil/ (run alongside multi_master) - port-allocator regression test test/s3/etag/ NEW workflow: s3-etag-acl-tests.yml - PutObject ETag format regression for #7768 (must be pure MD5 hex, not "<md5>-N" composite, for AWS Java SDK v2 compatibility) test/s3/acl/ (same workflow as etag) - object-ACL behavior on versioned buckets test/s3/catalog_trino/ DELETED (deadcode) - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go from a 2024 commit that was never iterated, while the test/s3tables/ counterpart has been actively maintained (and IS in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job). Both workflows trigger only on changes to relevant code paths and use the existing simple "build weed → run go test" pattern (no per-test-dir Makefile boilerplate). The S3 workflow starts a single `weed mini` shared by etag and acl, which keeps the job under 2 minutes on a fresh runner. Two tests remain knowingly uncovered: test/s3/basic/ — order-dependent state across tests (TestListObjectV2 expects a bucket created by an earlier test, etc.) and uses the deprecated aws-sdk-go v1. Treated as sample programs, not a regression suite. Fixing them is out of scope for this PR. test/s3/catalog_trino/ — see "DELETED" above. Verified locally: - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/... PASS (5 multi_master + 1 testutil tests, 64s) - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/... PASS (8 etag + 5 acl tests, ~6s after server startup) * ci: fix log-collector glob for multi-master tests (review feedback on #9283) test/multi_master/cluster.go creates per-test temp dirs via os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to match that prefix. The previous version looked for MasterCluster* / TestLeader* / TestTwoMasters* / TestAllMasters* which never matches — the failure-artifact upload would have been empty on a real failure. Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it actually picks up the per-node master*.log files under <baseDir>/logs/. Found by coderabbitai review on PR #9283. |
||
|
|
1f515f9d02 |
fix(s3api): cross-SSE copy operations and bring them back into CI (#9281) (#9282)
* fix(s3api): cross-SSE copy operations and bring them back into CI (#9281) Four cross-SSE copy tests were broken on master and excluded from CI with the comment "pre-existing SSE-C issues": - TestSSECObjectCopyIntegration/Copy_SSE-C_to_SSE-C_with_different_key - TestSSEKMSObjectCopyIntegration/Copy_SSE-KMS_with_different_key - TestCrossSSECopy/SSE-S3_to_SSE-C - TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object Each surfaced as a different symptom — 500 InternalError, CRC32 mismatch, "unexpected EOF", MD5 mismatch — but they were all instances of the same root pattern that #8908 hit on UploadPartCopy: copy paths writing destination chunks tagged inconsistently with the bytes on disk, so detectPrimarySSEType / IsSSE*Encrypted disagreed about what the read path should do. Five fixes in this PR, each with its own targeted test: 1. SSE-C IV format: putToFiler stored entry.Extended[SeaweedFSSSEIV] as raw bytes (with a comment saying so), but StoreSSECIVInMetadata stored it base64-encoded. The two readers (the GET handler reading it raw, and GetSSECIVFromMetadata reading it base64-decoded) each matched one writer but not the other. Standardise on raw bytes everywhere; GetSSECIVFromMetadata accepts the legacy base64 form for backward compat. 2. SSE-C single-part copy chunk tagging: copyChunkWithReencryption re-encrypted the bytes for the destination but never set the destination chunk's SseType / SseMetadata. With chunks left SseType=NONE, detectPrimarySSEType returned "None" and the GET served still-encrypted volume bytes raw without decryption. Tag the chunk after re-encryption. 3. SSE-KMS single-part copy chunk tagging: same shape as (2). Also, the function discarded the destSSEKey returned from CreateSSEKMSEncryptedReaderWithBucketKey (with `_`) — that key carries the freshly-minted EncryptedDataKey + IV the read path needs, so it must be captured and serialized into the destination chunk's per-chunk metadata (and bubbled up to the entry-level SeaweedFSSSEKMSKey for single-chunk objects whose read path falls back to the entry-level key). 4. SSE-KMS multipart source decryption: copyChunkWithSSEKMSReencryption decrypted every source chunk with the entry-level sourceSSEKey. For multipart SSE-KMS objects each chunk has its own EDK + IV in per-chunk metadata, so the entry-level key is wrong. Decrypt with per-chunk metadata when present. 5. Same-key copy fast path chunk tagging: copySingleChunk uses createDestinationChunk which dropped SseType / SseMetadata. For same-key copies (e.g. SSE-KMS source → SSE-KMS dest with the same KMS key) the fast path reuses the source ciphertext as-is, so the destination chunks must keep the source's SSE tagging. Add a createDestinationChunkPreservingSSE helper for the fast path; the re-encryption paths still call createDestinationChunk and then overwrite the SSE fields after re-encrypting. CI: extend the comprehensive-test TEST_PATTERN to include the four test families that were previously excluded (`.*ObjectCopyIntegration`, `TestCrossSSECopy`, `TestSSEMultipartCopy`) so this category of regression is caught going forward. The exclusion comment is removed. Tests: - All four originally-failing tests pass. - The full pre-existing TestSSE* / TestCrossSSE / TestGitHub7562 / TestCopyToBucketDefaultEncryptedRegression / TestSSEMultipart suite still passes. - go test -race ./weed/s3api/ passes. Refs #8908, #9280. * fix(s3api): SSE-KMS copy ChunkOffset must stay 0 (review feedback on #9282) CreateSSEKMSEncryptedReaderWithBucketKey initialises a fresh CTR stream at counter 0 with a per-chunk random IV — there is no base-IV-plus-offset relationship. The previous commit on this branch wrote `destSSEKey.ChunkOffset = chunk.Offset` onto the per-chunk metadata, which the read-side CreateSSEKMSDecryptedReader applies as calculateIVWithOffset(IV, ChunkOffset) — i.e. it advances the decryption IV by chunk.Offset/16 blocks beyond where the encryption actually wrote. The bug only manifests for SSE-KMS-to-SSE-KMS-with-different-key copies of multipart sources (where source chunks live at non-zero offsets), which is why the existing TestSSEKMSObjectCopyIntegration (single-chunk source) and TestSSEMultipartCopy/Copy_SSE-KMS_Multipart_Object (same-key copy that takes the fast preserving path, not the re-encrypt path) both happened to pass. Set ChunkOffset to 0 to match the actual encryption position. Existing tests still pass; the dangerous case is only reachable with a multipart SSE-KMS source and a different destination key, which is not currently exercised in CI. Found by gemini-code-assist review on PR #9282. * fix(s3api): use first dst chunk's full key for entry-level SSE-KMS metadata in remaining copy paths (review feedback on #9282) Earlier this branch fixed copyChunksWithSSEKMSReencryption to populate the entry-level SeaweedFSSSEKMSKey from the first destination chunk's fully-formed metadata (with EDK + IV) instead of a stub key with only KeyID + EncryptionContext + BucketKeyEnabled. The same fix needs to apply to the other two paths that build entry-level SSE-KMS metadata: - copyMultipartCrossEncryption() — cross-encryption to SSE-KMS dest. Per-chunk metadata comes from copyCrossEncryptionChunk's CreateSSEKMSEncryptedReaderWithBucketKey call, so chunks[0] has a real EDK + IV. Use it. - copyChunksWithSSEKMS() direct (same-key) branch. After createDestinationChunkPreservingSSE in copySingleChunk, dst chunks carry the source's per-chunk SSE-KMS metadata. Use chunks[0] for the entry-level key so single-chunk same-key copies don't fall back to a stub key on the read path. Without this, single-chunk SSE-KMS reads through these two paths failed at GET with "Invalid ciphertext format" — KMS unwrap was called on an empty EDK. Found by coderabbitai review on PR #9282. * fix(s3api): add 0-byte fallback to SSE-KMS reencryption entry-level metadata (review feedback on #9282) copyChunksWithSSEKMSReencryption was missing the fallback for 0-byte objects (where dstChunks is empty), inconsistent with the fallback in copyChunksWithSSEKMS direct branch and copyMultipartCrossEncryption. Without it, a 0-byte SSE-KMS copy would land with no entry-level SeaweedFSSSEKMSKey, so the read path's IsSSEKMSEncryptedInternal check would not recognise the empty object as SSE-KMS. Mirror the existing fallback: build a stub SSEKMSKey with KeyID, context and bucket-key state; serialize it as the entry-level key. Found by gemini-code-assist review on PR #9282. * fix(s3api): SSE-KMS direct copy must check encryption context + bucket-key, not just key ID (review feedback on #9282) DetermineSSEKMSCopyStrategy / CanDirectCopySSEKMS only compares the source and destination KMS key IDs, but the destination request can also change the encryption context or the BucketKey flag. Both are embedded in the source ciphertext's wrapped EDK; preserving the source metadata verbatim does not satisfy a destination request that asks for different settings, so the destination object would silently report the source's context/flag instead of what was requested. Add srcSSEKMSStateMatchesDest: deserialize the source's stored SSEKMSKey and compare its EncryptionContext + BucketKeyEnabled to the destination request. If either differs, force the slow re-encrypt path (SSEKMSCopyStrategyDecryptEncrypt) so the destination gets a freshly-wrapped EDK bound to the requested context/flag. A malformed source key is treated as non-matching (conservative). nil and empty encryption-context maps are treated as equal to avoid spurious divergence when the request omits the context header. Found by coderabbitai review on PR #9282. * fix(s3api): copyMultipartSSEKMSChunk falls back to entry-level key + entry-level metadata uses first chunk's full key (review feedback on #9282) Two related issues in copyMultipartSSEKMSChunks / copyMultipartSSEKMSChunk: 1. copyMultipartSSEKMSChunks built the destination's entry-level SeaweedFSSSEKMSKey from a stub (KeyID + context + bucket-key only), missing the EDK + IV. Single-chunk reads through this path fall back to entry-level keyData and would fail at GET because KMS would be asked to unwrap an empty EDK. Mirrors the fix in copyChunksWithSSEKMS / copyMultipartCrossEncryption / copyChunksWithSSEKMSReencryption: prefer the first dst chunk's full per-chunk metadata, fall back to the stub only for 0-byte objects. 2. copyMultipartSSEKMSChunk hard-failed when chunk.GetSseMetadata() was empty. Newer multipart SSE-KMS uploads populate per-chunk metadata, but legacy objects may have only entry-level metadata and would now be impossible to copy. Add a sourceEntrySSEKey fallback parameter (deserialized once by the caller from entry.Extended[SeaweedFSSSEKMSKey]); use it when per-chunk metadata is absent. Found by coderabbitai review on PR #9282. * refactor(s3api): extract entry-level SSE-KMS deserialization and per-chunk fallback into helpers (review feedback on #9282) Three medium-priority maintainability comments from gemini-code-assist: - The same "deserialize entry.Extended[SeaweedFSSSEKMSKey]" pattern appeared in srcSSEKMSStateMatchesDest, copyMultipartSSEKMSChunks and copyChunksWithSSEKMSReencryption. - The "prefer per-chunk metadata, fall back to entry-level key" selection logic appeared inline in copyMultipartSSEKMSChunk and copyChunkWithSSEKMSReencryption with subtly different shapes. - encryptionContextEqual hand-rolled a map comparison. Pull both patterns out into named helpers: - deserializeEntrySSEKMSKey: returns the entry-level SSEKMSKey or nil on missing/malformed data, with a single V(2) log line. - resolveChunkSSEKMSKey: centralises the chunk-vs-entry-level selection so all sites use the same decryption-side selection logic (which must mirror the encryption side). Replace encryptionContextEqual's manual loop with reflect.DeepEqual, keeping the empty-vs-nil shortcut at the top because DeepEqual treats those as different. No behaviour change; existing copy tests still pass. |
||
|
|
35fe3c801b |
feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder
The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.
Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.
This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.
References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.
* feat(nfs): advertise UDP MOUNT v3 in the portmap responder
The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.
Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.
* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener
Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.
The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.
Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.
* docs(nfs): note that the plain mount form works on UDP-default clients
With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.
The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.
* test(nfs): add kernel-mount e2e tests under test/nfs
The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.
Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:
- TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
- TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
only — regression test for the new UDP MOUNT v3 responder.
- TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
asserting the kernel surfaces a protocol/version-level error rather
than a generic "mount system call failed" — regression test for the
PROG_MISMATCH path from #9262.
The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.
Run locally with:
cd test/nfs
sudo go test -v -run TestKernelMount ./...
CI wiring follows in the next commit.
* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow
Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:
- Existing protocol-layer step now skips '^TestKernelMount' so a
"skipped because not root" line doesn't appear on every run.
- New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
name lookups need to resolve `tcp`/`udp`).
- New privileged step runs only the kernel-mount tests under sudo,
preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
caches so the second `go test` invocation reuses already-built
test binaries instead of redownloading modules under root.
The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
|
||
|
|
4d8ddd8ded | build(deps): bump aquasecurity/trivy-action from 0.35.0 to 0.36.0 (#9248) | ||
|
|
76f361fa77 |
fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) (#9206)
* fix(helm): gate S3 TLS cert args on httpsPort to stop probe failures (#9202) With `global.seaweedfs.enableSecurity=true` and the default `s3.httpsPort=0`, the chart was unconditionally passing `-cert.file` / `-key.file` to the S3 frontend. In `weed/command/s3.go`, when `tlsPrivateKey != ""` and `portHttps == 0`, the server promotes its main `-port` (8333 by default) into an HTTPS listener. The pod's readiness / liveness probes still use `scheme: HTTP`, so every kubelet probe produces http: TLS handshake error from <node-ip>:<port>: client sent an HTTP request to an HTTPS server in the pod log, as reported in #9202. `enableSecurity=true` is supposed to activate security.toml / gRPC mTLS, not silently flip the S3 HTTP port to HTTPS. Move the `seaweedfs.s3.tlsArgs` include inside the `if httpsPort` guard in all three templates that wire up an S3 frontend (standalone S3 deployment, filer with S3 sub-server, all-in-one deployment). The TLS cert args are now emitted only when the user explicitly opts into an HTTPS port; the main `-port` stays HTTP so probes work. Also add a regression test to `.github/workflows/helm_ci.yml` that renders all three templates with and without `httpsPort` and asserts the cert/key/ `-port.https` args are emitted together or not at all. * test(helm): add bash -n parse check to the S3 TLS-gating regression test Addresses gemini-code-assist review comment on #9206 flagging a potential "dangling backslash" shell-syntax risk in the rendered all-in-one command script when httpsPort is set but most S3/SFTP args are defaulted off. In practice bash -n accepts a trailing `\<newline><EOF>` (it's line-continuation to an empty line), so no current rendering is broken. Locking that contract down in CI so a future helper change that leaves a dangling backslash — or any other shell-syntax regression in the rendered command — fails loudly instead of silently shipping broken pods. |
||
|
|
9ae905e456 |
feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin S3 and filer already use a refreshing pemfile provider for their HTTPS cert, so rotated certificates (e.g. from k8s cert-manager) are picked up without a restart. Master, volume, webdav, and admin, however, passed cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at startup — rotating those certs required a pod restart. Add a small helper NewReloadingServerCertificate in weed/security that wraps pemfile.Provider and returns a tls.Config.GetCertificate closure, then wire it into the four remaining HTTPS entry points. httpdown now also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates but CertFile/KeyFile are empty, so volume server can pre-populate TLSConfig. A unit test exercises the rotation path (write cert, rotate on disk, assert the callback returns the new cert) with a short refresh window. * refactor(security): route filer/s3 HTTPS through the shared cert reloader Before: filer.go and s3.go each kept a *certprovider.Provider on the options struct plus a duplicated GetCertificateWithUpdate method. Both were loading pemfile themselves. Behaviorally they already reloaded, but the logic was duplicated two ways and neither path was shared with the newly-added master/volume/webdav/admin wiring. After: both use security.NewReloadingServerCertificate like the other servers. The per-struct certProvider field and GetCertificateWithUpdate method are removed, along with the now-unused certprovider and pemfile imports. Net: -32 lines, one code path for all HTTPS cert reloading. No behavior change — the refresh window, cache, and handshake contract are identical (the helper wraps the same pemfile.NewProvider). * feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc The HTTP client in weed/util/http/client loaded the mTLS client cert once at startup via tls.LoadX509KeyPair. That left every long-lived HTTPS client process (weed mount, backup, filer.copy, filer→volume, s3→filer/volume) unable to pick up a rotated client cert without a restart — even though the same cert-manager setup was already rotating the server side fine. Swap the client cert loader for a tls.Config.GetClientCertificate callback backed by the same refreshing pemfile provider. New TLS handshakes pick up the rotated cert; in-flight pooled connections keep their old cert and drop as normal transport churn happens. To keep this reusable from both server and client TLS code without an import cycle (weed/security already imports weed/util/http/client for LoadHTTPClientFromFile), extract the pemfile wrapper into a new weed/security/certreload subpackage. weed/security keeps its thin NewReloadingServerCertificate wrapper. The existing unit test moves with the implementation. gRPC mTLS was already handled by security.LoadServerTLS / LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ agent, Kafka gateway, and FUSE mount control plane are gRPC-only and therefore already rotate. CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted as a known limitation in the wiki. * fix(security): address PR review feedback on cert reloader Bots (gemini-code-assist + coderabbit) flagged three real issues and a couple of nits. Addressing them here: 1. KeyMaterial used context.Background(). The grpc pemfile provider's KeyMaterial blocks until material arrives or the context deadline expires; with Background() a slow disk could hang the TLS handshake indefinitely. Switched both the server and client callbacks to use hello.Context() / cri.Context() so a stuck read is bounded by the handshake timeout. 2. Admin server loaded TLS inside the serve goroutine. If the cert was bad, the goroutine returned but startAdminServer kept blocking on <-ctx.Done() with no listener, making the process look healthy with nothing bound. Moved TLS setup to run before the goroutine starts and propagate errors via fmt.Errorf; also captures the provider and defers Close(). 3. HTTP client discarded the certprovider.Provider from NewClientGetCertificate. That leaked the refresh goroutine, and NewHttpClientWithTLS had a worse case where a CA-file failure after provider creation orphaned the provider entirely. Added a certProvider field and a Close() method on HTTPClient, and made the constructors close the provider on subsequent error paths. 4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain the provider. filer and webdav run ServeTLS synchronously, so a plain defer works. master/volume/s3 dispatch goroutines and return while the server keeps running, so they hook Close() into grace.OnInterrupt. 5. Test: certreload_test now tolerates transient read/parse errors during file rotation (writeSelfSigned rewrites cert before key) and reports the last error only if the deadline expires. No user-visible behavior change for the happy path. * test(tls): add end-to-end HTTPS cert rotation integration test Boots a real `weed master` with HTTPS enabled, captures the leaf cert served at TLS handshake time, atomically rewrites the cert/key files on disk (the same rename-in-place pattern kubelet does when it swaps a cert-manager Secret), and asserts that a subsequent TLS handshake observes the rotated leaf — with no process restart, no SIGHUP, no reloader sidecar. Verifies the full path: on-disk change → pemfile refresh tick → provider.KeyMaterial → tls.Config.GetCertificate → server TLS handshake. Runtime is ~1s by exposing the reloader's refresh window as an env var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the test. The same env var is user-facing — documented in the wiki — so operators running short-lived certs (Vault, cert-manager with duration: 24h, etc.) can tighten the rotation-pickup window without a rebuild. Defaults to 5h to preserve prior behavior. security.CredRefreshingInterval is kept for API compatibility but now aliases certreload.DefaultRefreshInterval so the same env controls both gRPC mTLS and HTTPS reload. * ci(tls): wire the TLS rotation integration test into GitHub Actions Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner, Go 1.25, build weed, run `go test` in test/tls_rotation, upload master logs on failure. 10-minute job timeout; the test itself finishes in about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms inside the test. Runs on every push to master and on every PR to master. * fix(tls): address follow-up PR review comments Three new comments on the integration test + volume shutdown path: 1. Test: peekServerCert was swallowing every dial/handshake error, which meant waitForCert's "last err: <nil>" fatal message lost all diagnostic value. Thread errors back through: peekServerCert now returns (*x509.Certificate, error), and waitForCert records the latest error so a CI flake points at the actual cause (master didn't come up, handshake rejected, CA pool mismatch, etc.). 2. Test: set HOME=<tempdir> on the master subprocess. Viper today registers the literal path "$HOME/.seaweedfs" without env expansion, so a developer's ~/.seaweedfs/security.toml is accidentally invisible — the test was relying on that. Pinning HOME is belt-and-braces against a future viper upgrade that does expand env vars. 3. volume.go: startClusterHttpService's provider close was registered via grace.OnInterrupt, which fires on SIGTERM but NOT on the v.shutdownCtx.Done() path used by mini / integration tests. The pemfile refresh goroutine leaked in that shutdown path. Now the helper returns a close func and the caller invokes it on BOTH shutdown paths for parity. Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the ast-grep static-analysis nit — zero-risk since the pool only trusts our in-memory CA. Test runs clean 3/3. |
||
|
|
e77f8ae204 |
fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157) The STS GetFederationToken handler was implemented but never reachable. Three routing gaps sent requests to the S3/IAM path instead of STS: - No explicit mux route for Action=GetFederationToken in the URL query - iamMatcher did not exclude GetFederationToken, so authenticated POSTs with Action in the form body were matched and dispatched to IAM - UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity to STS, leaving GetFederationToken to fall through to DoActions and return NotImplemented Add the missing route, the matcher exclusion, and the dispatch branch. Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount into the s3-iam-tests workflow as a new "sts" matrix entry. Before this change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four test functions ran in CI, which is why this regression shipped. * test(iam): make orphaned STS/service-account tests pass under auth-enabled CI Follow-up to wiring STS tests into CI: fixes several pre-existing issues that made the newly-included tests fail locally. Server fixes: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when the caller is a legacy S3-config identity (not in the IAM user store). Previously any GetPoliciesForUser error short-circuited to InternalError, which hard-failed every SigV4 caller using keys from -s3.config. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs in the sa:<parent>:<uuid> format required by credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed the persistence-layer regex and caused every CreateServiceAccount call to return 500 once a filer-backed credential store validated the ID. Test helpers: - test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from req.URL.Host, and a manual Host header made the signer emit host;host in SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn subtest to match the existing SeaweedFS behavior (user-context assumption). - test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned IAM writes otherwise fall through to the STS fallback and return InvalidAction. CI matrix: - .github/workflows/s3-iam-tests.yml: skip TestServiceAccountLifecycle/use_service_account_credentials only. The rest of the service-account suite passes; that one subtest depends on a separate credential-reload issue where new ABIA keys briefly register into accessKeyIdent but aren't persisted to the filer, so they vanish on the next reload. Out of scope for the #9157 GetFederationToken fix. * fix(credential): accept AWS IAM username chars in service-account IDs Gemini review on #9167 pointed out that ServiceAccountIdPattern's parent-user segment was more restrictive than an AWS IAM username: `[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with `@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail validation at the filer store even though the embedded IAM API happily created them. Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html) and add a table-driven test that locks the expansion in. * address PR review feedback on #9167 All five review items were valid; changes keyed to review bullets: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows arbitrary policy-lookup failures. Only credential.ErrUserNotFound is treated leniently (the legacy-config SigV4 path); any other error now returns InternalError so we don't mint tokens with an incomplete policy set. - weed/credential/grpc/grpc_identity.go: GetUser translates gRPC NotFound back to credential.ErrUserNotFound so errors.Is(...) above matches for gRPC-backed stores, not just memory/filer-direct. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates the generated saId against credential.ValidateServiceAccountId before returning. Surfaces a client 400 with the offending ID instead of the opaque 500 that used to bubble up from the persistence layer. - weed/s3api/s3api_server_routing_test.go: seed a routing-test identity with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody with aws-sdk-go v4.Signer so the request actually passes AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers with no stsService) instead of just NotEqual(501) — 503 proves the dispatch reached STSHandlers.HandleSTSRequest. - test/s3/iam/s3_service_account_test.go: callIAMAPI signs with service="iam" instead of "s3" (SeaweedFS verifies against whichever service the client signed with, but "iam" is semantically correct). - weed/credential/validation_test.go: add positive rows for an uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000). |
||
|
|
25d7f2c569 |
build(deps): bump docker/build-push-action from 6 to 7 (#9151)
Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6 to 7. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](https://github.com/docker/build-push-action/compare/v6...v7) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
86c5e815d2 |
fix(kafka): make consumer-group rebalancing work end-to-end (#9143)
* fix(kafka): make consumer-group rebalancing work end-to-end
TestConsumerGroups was failing every run since the job was added
(2026-04-17) but the failures were masked by a `|| echo ...` trailer on
the go test invocation, so the CI reported green. Removing the mask
exposes several real bugs in the gateway's group-coordinator code:
1. JoinGroup deduplicated members by ClientID, which collapsed two
Sarama consumers that share the default ClientID ("sarama") into a
single member slot and broke rebalancing. Key dedup off the TCP
ConnectionID instead; keep ClientID on the member for DescribeGroup
fidelity.
2. Every JoinGroup replaced the *GroupMember struct, wiping the
Assignment the leader had just published in its SyncGroup and leaving
non-leader consumers with 0 partitions after a rebalance. Update the
existing member in place on rejoin.
3. Non-leader SyncGroup returned an empty assignment while the leader
was mid-rebalance, so consumers silently came up with no partitions.
Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama
retries the join/sync cycle (4 retries x 2s backoff by default).
4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when
the group was in PreparingRebalance/CompletingRebalance. Return
REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly
cancels the session instead of tearing it down on a fatal error.
5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3
(Members array) by default, so the gateway silently rejected the
request as InvalidGroupID and dead consumers stayed in the group as
phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/
tagged-fields) parsing.
The rebalancing integration tests called Consume() once per consumer,
which cannot survive a rebalance (heartbeat RBIP cancels the session
and Consume() returns - this is documented Sarama behaviour; callers
are expected to loop). Added a runConsumeLoop helper and used it in the
four affected sub-tests. RebalanceTestHandler.Setup now overwrites
stale entries in its assignments channel so the test observes the
settled post-rebalance snapshot rather than whatever arrived first.
* fix(kafka): address PR review feedback
- JoinGroup now snapshots existing members before mutating and restores
the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the
rollback path always deleted the entry, corrupting group state when
an existing member rejoined with an incompatible protocol.
- handleLeaveGroup iterates request.Members instead of processing only
the first entry, so v3+ batch departures (KIP-345 style) correctly
remove every listed member and build a per-member response. A single
group-state transition runs after the loop, with leader election
only triggered if the actual group leader was among the departures.
- Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser
already decoded flexible versions, but the response still went out in
non-flexible encoding (4-byte array lengths, 2-byte strings, no
tagged fields), which v4+ clients could not parse. Route flexible
versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse.
- BasicFunctionality gives each consumer its own
ConsumerGroupHandler/ready channel. The previous shared handler
closed ready once, so readyCount advanced to numConsumers from a
single signal; the test could proceed without the other consumers
actually reaching Setup.
- RebalanceTestHandler.assignments is now a size-1 channel, so readers
always observe the most recent rebalance snapshot instead of an
intermediate one from an earlier round.
|
||
|
|
a8ba9d106e |
peer chunk sharing 7/8: tryPeerRead read-path hook (#9136)
* mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount. * mount: wire tryPeerRead via FetchChunk streaming gRPC Replaces the HTTP GET byte-transfer path with a gRPC server-stream FetchChunk call. Same fall-through semantics: any failure drops through to entryChunkGroup.ReadDataAt, so reads never slow below status quo. * peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk (flattening manifests), asks the HRW owner for holders via ChunkLookup, then opens FetchChunk on each holder in LRU order (PR #5) until one succeeds. Assembled bytes are verified against FileChunk.ETag end-to-end — the peer is still treated as untrusted. Reuses the shared PeerConnPool from PR #6 for all outbound gRPC. * peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing itself on a self-owned fid. * filehandle_read.go: tryPeerRead slot between tryRDMARead and entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the presence of peerGrpcServer (the single identity test). Read ordering with the feature enabled is now: local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server One port, one identity, one connection pool — no more HTTP bytecast. * test(fuse_p2p): end-to-end CI test for peer chunk sharing Adds a FUSE-backed integration test that proves mount B can satisfy a read from mount A's chunk cache instead of the volume tier. Layout (modelled on test/fuse_dlm): test/fuse_p2p/framework_test.go — cluster harness (1 master, 1 volume, 1 filer, N mounts, all with -peer.enable) test/fuse_p2p/peer_chunk_sharing_test.go — writer-reader scenario The test (TestPeerChunkSharing_ReadersPullFromPeerCache): 1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner of a chunk is self ~50 % of the time (peer path short-circuits); with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly exercises the remote-owner fan-out. 2. Mount 0 writes a ~8 MiB file, then reads it back through its own FUSE to warm its chunk cache. 3. Waits for seed convergence (one full MountList refresh) plus an announcer flush cycle, so chunk-holder entries have reached each HRW owner. 4. Mount 1 reads the same file. 5. Verifies byte-for-byte equality AND greps mount 1's log for "peer read successful" — content matching alone is not proof (the volume fallback would also succeed), so the log marker is what distinguishes p2p from fallback. Workflow .github/workflows/fuse-p2p-integration.yml triggers on any change to mount/filer peer code, the p2p protos, or the test itself. Failure artifacts (server + mount logs) are uploaded for 3 days. Mounts run with -v=4 so the tryPeerRead success/failure glog messages land in the log file the test greps. |