mirror of
https://github.com/seaweedfs/seaweedfs.git
synced 2026-05-13 21:31:32 +00:00
master
359 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
d5c0a7b153 |
fix(ec): make multi-disk same-server EC reads work + full-lifecycle integration test (#9487)
* fix(master): include GrpcPort in LookupEcVolume response LookupVolume already passes loc.GrpcPort through to the client; LookupEcVolume builds Location with only Url / PublicUrl / DataCenter, so callers fall back to ServerToGrpcAddress (httpPort + 10000). On any deployment where that convention does not hold — multi-disk integration tests, custom port layouts — EC reads dial the wrong port and quietly degrade to parity recovery. * fix(volume/ec): probe every DiskLocation when serving local shard reads reconcileEcShardsAcrossDisks (issue 9212) registers each .ec?? against the DiskLocation that physically owns it, so a multi-disk volume server can hold shards for the same vid in two separate ecVolumes — one per disk — with .ecx on whichever disk owned the original .dat. The read path only consulted the single EcVolume FindEcVolume picked, so requests for shards on the sibling disk fell through to errShardNotLocal and then to remote/loopback recovery. Walk all DiskLocations after the first probe in both readLocalEcShardInterval and the VolumeEcShardRead gRPC handler; the latter also covers the loopback that recoverOneRemoteEcShardInterval falls back to when a peer dial fails. * test(volume/ec): cover the multi-disk EC lifecycle end-to-end Two integration tests against a real volume server with two data dirs: TestEcLifecycleAcrossMultipleDisks drives encode -> mount -> HTTP read -> drop .dat -> stop -> redistribute shards across disks -> restart -> verify reconcileEcShardsAcrossDisks attached the orphan shards and reads still work -> blob delete -> stop -> drop a shard -> restart -> VolumeEcShardsRebuild pulls input from both disks -> reads still work. TestEcPartialShardsOnSiblingDiskCleanedUpOnRestart is the issue 9478 reproducer at the cluster level: seed a healthy .dat on disk 0, plant the on-disk footprint of an interrupted EC encode on disk 1, restart, and assert pruneIncompleteEcWithSiblingDat wipes disk 1 without touching disk 0. Framework gets RestartVolumeServer / StopVolumeServer helpers; the previous run's volume.log is rotated to volume.log.previous so a startup regression on the second run does not lose the first run's diagnostics. * review: trim verbose comments * review: drop racy fast-path, use locked findEcShard directly gemini-code-assist flagged the two-step lookup in readLocalEcShardInterval and VolumeEcShardRead: the first probe (ecVolume.FindEcVolumeShard) reads the EcVolume's Shards slice without holding ecVolumesLock, so a concurrent mount / unmount could race with it. findEcShard already walks every DiskLocation under the right lock, so the fast-path adds nothing but the race. Collapse both call sites to a single locked call. Also note in RestartVolumeServer why the log-rotation error is swallowed: absence on first call is benign; anything else surfaces in the next os.Create in startVolume. |
||
|
|
b1d59b04a8 |
fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on master after PRs #9475 + #9466: 1. Walker dispatch for ABORT_MPU was sending entry.DestKey as req.ObjectPath. The server's ABORT_MPU handler (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/ prefix to extract the upload id and reads the init record from that directory, so it expects the .uploads/<id> path verbatim. DestKey looks like a regular object path; the server's prefix check fails and the dispatch returns BLOCKED with "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/ prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey to ...UsesUploadsPath and inverts the assertion to match the actual server contract. DestKey is still used for the WalkBuckets shard predicate and for rule-prefix matching in bootstrap.walker; both surfaces want the user's intended path, while DISPATCH wants the .uploads/<id> directory. The bootstrap test (TestLifecycleAbortIncompleteMultipartUpload) caught this when the walker's BLOCKED error surfaced as FATAL output. 2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the shell command logs "loaded lifecycle for N bucket(s)" so a regression that produces half-shaped output (no load summary) is caught. The restored shell command (PR #9475) didn't print that line; add it back on the first pass that finds non-zero inputs. * fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path) runShard's empty-replay sentinel (rsh == [32]byte{}) was returning BEFORE the steady-state walker check. A bucket whose only lifecycle rule was walker-only (ExpirationDate / ExpiredDeleteMarker / NewerNoncurrent) would never have it dispatched because: - ReplayContentHash only hashes replay-eligible kinds, so walker-only-only snapshots produce rsh == empty. - The early-return persisted the empty cursor and exited before the steady-state walker block at the bottom of the function. Move the walker invocation INTO the empty-replay branch so walker- only rules dispatch on the same path as mixed-rule buckets. TestLifecycleExpirationDateInThePast and TestLifecycleExpiredDeleteMarkerCleanup were both timing out their "object must be deleted" Eventually polls because of this. Caught on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the shell entry point that exercises the integration tests. * fix(s3/lifecycle): cold-start walker covers pre-existing objects runShard only walked the bucket tree on the recovery branch (found && hash mismatch). For a fresh worker with no persisted cursor, found=false, so the recovery walker never fired and the meta-log replay only scanned runNow - maxTTL of events. Objects PUT before that window — including pre-existing objects in a newly-rule-enabled bucket — never matched the rule. The streaming worker handled this with scheduler.BucketBootstrapper. Daily-replay needed the equivalent: walk the live tree once on the first run for each shard so pre-existing objects get evaluated even when their PUT events are outside meta-log scan window. Restructured the recovery branch to fire the walker on either (found && mismatch) OR !found. On cold-start the cursor isn't rewound — we keep TsNs=0 and let the drain below floor to runNow - maxTTL like before; the walker just handles whatever the sliding window can't reach. TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151). * fix(s3/lifecycle): restore walker tag and null-version state Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): parallelize shell shard sweeps Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard Two CI bugs surfaced after PR #9466 deleted the streaming worker: 1. The shell command's -refresh loop never fires. runPass used the outer ctx (full -runtime), so dailyrun.Run blocked for the entire 1800s s3tests window — the background worker only ran one pass and never re-loaded configs that tests created mid-run. test_lifecycle_expiration sees 6 objects when expecting 4 because expire1/* never reaches the worker's snapshot. Cap each pass to cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx. 2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but runLifecycleShard invoked with no -refresh — only one pass ran. The marker rule can't fire in the same pass that dispatches v1's delete because v1 is still in .versions/. Add -refresh 1s so the 10s runtime gets multiple passes. * fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout drainShardEvents only exits via ctx cancellation for an idle subscription — that's the steady-state when all replayed events are already past. Saving the cursor with the canceled passCtx silently drops every advance, so the next pass re-subscribes from the same floor and re-replays the same events. Symptom in s3tests: status=error shards=16 errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race between concurrent shard drains all retrying the same events. Use a 5s timeout derived from context.Background for the save, and treat passCtx Deadline/Canceled from drain as a clean end-of-pass — not a shard-level error to log. * fix(s3/lifecycle): trust persisted cursor; never bump past pending events The drain freezes cursorAdvanceTo at the last pre-skip event so pending matches (DueTime > runNow) re-enter the subscription next pass. Combined with the new cursor persistence, the floor bump (runNow - maxTTL) then orphans the very events the drain stopped at. Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME + maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the last advance saved a cursor right before the not-yet-due PUT (e.g., keep2/* between expire1/* and expire3/* on the same shard), the floor bump on pass 9 skips past the expire3 event itself — the worker never re-reads it. Test symptom: expire3/* never expires when worker shards include other earlier no-match events. Cold start (found=false) still subscribes from runNow - maxTTL. Steady state honors the cursor verbatim. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> |
||
|
|
10cc06333b |
cluster: restrict Ping RPC to known peers of the requested type (#9445)
Ping previously dialled whatever host:port the caller asked for. Gate each server's Ping handler on cluster membership: masters check the topology, registered cluster nodes, and configured master peers; volume servers only accept their seed/current masters; filers accept tracked peer filers, the master-learned volume server set, and configured masters. Use address-indexed peer lookups to keep Ping target validation O(1): - topology maintains a pb.ServerAddress -> *DataNode index alongside the dc/rack/node tree, kept in sync from doLinkChildNode and UnlinkChildNode plus the ip/port-rewrite branch in GetOrCreateDataNode. GetTopology now returns nil on a detached subtree instead of panicking, so the linkage hooks can no-op safely. - vid_map tracks a refcount per volume-server address so hasVolumeServer answers without scanning every vid location. The add path skips empty-address entries the same way the delete path already does, so a zero-value Location cannot leak a permanent serverRefCount[""] bucket. - masters reuse a cached master-address set from MasterClient instead of walking the configured peer slice on every request. - volume servers compare against a pre-built seed-master set and protect currentMaster reads/writes with an RWMutex, fixing the data race with the heartbeat goroutine. The seed slice is copied on construction so external mutation cannot desync it from the frozen lookup set. - cluster.check drops the direct volume-to-volume sweep; volume servers no longer carry a peer-volume list, and the note next to the dropped probe is reworded to make clear that direct volume-to-volume reachability is intentionally not validated by this command. Update the volume-server integration tests that drove Ping through the new admission gate: success-path coverage now targets the master peer (the only type a volume server tracks), and the unknown/unreachable path asserts the InvalidArgument the gate now returns instead of the old downstream dial error. Mirror the same admission gate in the Rust volume server crate: a seed-master HashSet built once at startup plus a tokio RwLock over the heartbeat-tracked current master, both consulted in is_known_ping_target on every Ping, with InvalidArgument returned for any target that isn't a recognised master. |
||
|
|
69da20bdae |
volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the destructive volume-server RPCs, and reject cluster-internal remote S3 endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before dialing. Pin the validated address against DNS rebinding by routing the AWS SDK through an HTTP transport whose DialContext re-resolves the host and re-applies the deny list on every dial, so an endpoint that resolves to a public IP at validate-time and then flips to 127.0.0.1 at connect time is refused. Operators that legitimately fetch from private hosts can opt out with -volume.allowUntrustedRemoteEndpoints. |
||
|
|
46bb70d93e |
feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions A version's noncurrent TTL clock starts when the next version is written, not at its own mtime. Today the lifecycle engine derives that moment from the next-newer sibling's mtime — a heuristic that drifts if the sibling is later modified and is unavailable when the demoting event sits outside meta-log retention. Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at the two places where a PUT flips the latest pointer: updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning. Timestamp source is time.Now().UnixNano() captured once per demotion — the documented Phase 1 fallback until the filer write API surfaces its own TsNs. Engine reads the stamp on both the bootstrap walker path and the event-driven router; missing/zero falls back to the legacy sibling-mtime derivation, so pre-stamp entries keep working. Prerequisite for the daily-replay lifecycle worker (Phase 2+). * fix(s3): address CI failure and PR review feedback - Backdating tests must move both clocks: the lifecycle integration tests backdate version mtimes to simulate aging, but my earlier commit made the engine prefer the explicit demotion stamp over sibling mtime, so a real-now stamp dominated a backdated mtime and the rule never fired. Update backdateVersionedMtime to also rewrite Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it. This is a test simplification — production stamps record when the successor was written, not the demoted version's own mtime — but the resulting clock is correctly old enough. - Refactor stamp parsing into one shared helper. Per gemini-code-assist: the parsing logic for ExtNoncurrentSinceNsKey was duplicated in router/router.go and scheduler/bootstrap.go. Move it to a new weed/s3api/s3lifecycle/noncurrent_since.go as exported SuccessorFromEntryStamp; both call sites now go through it. - Make the parser ordering test deterministic. Per coderabbitai: time.Now().UnixNano() drops the monotonic clock component, so two back-to-back calls can decrease if the wall clock steps backward — the prior test was exercising OS clock behavior rather than the parser. Replace with fixed nanosecond values. - Close a suspended-versioning race. Per coderabbitai: the prior putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning after putToFiler returned, i.e. after the object write lock released. A concurrent PUT could promote a newer latest version, which we'd then wipe — leaving the older "null" object incorrectly current. Move the cleanup into the afterCreate callback so the null write and the .versions pointer clear (including the new demotion stamp) run atomically under the same lock. Best-effort logging is preserved. * fix(s3/lifecycle): clear noncurrent_since stamp on test backdate Backdating a version's mtime in tests is not a coherent claim about when it became noncurrent — production stamps record the successor's PUT time, which the test doesn't manipulate. The prior commit rewrote the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent that creates an inconsistent state: v3's stamp says "demoted 30 days ago" while v4's mtime (the supposed demoter) is real-now. With both NewerNoncurrentVersions and NoncurrentDays in the same rule, the NoncurrentDays floor passes against the backdated stamp and the rank-based check then deletes v3 via the meta-log historical replay that misranks against current state. Clearing the stamp instead lets the lifecycle engine fall back to the sibling-mtime derivation the tests were originally written against: the legacy code path is preserved end-to-end while the new explicit- stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go and the bootstrap-walker integration in scheduler/bootstrap_test.go. The deeper interaction — historical meta-log replay ranking against current state inside routePointerTransitionExpand — is pre-existing and is no longer masked by the freshly-PUT successor's mtime once the stamp is read. Tracked separately; not blocking this PR. * fix(s3): stamp noncurrent_since before the .versions/ pointer flip The pointer-flip on the .versions/ directory emits a meta-log event that the lifecycle router consumes via routePointerTransition. The router then calls LookupVersion on the demoted version's id. With the prior ordering — pointer flip first, stamp second — the router could read the demoted entry before markVersionNoncurrent landed and fall back to the legacy sibling-mtime derivation. Versioned COPY is the clean break: the new latest version keeps the source object's mtime instead of recording the moment v_old was demoted, so the fallback's successor clock can be arbitrarily wrong. Reorder both updateLatestVersionInDirectory and updateIsLatestFlagsForSuspendedVersioning so the stamp is written first; the pointer flip then emits an event into a state where the stamp is already present. Failure of the stamp write remains non-fatal — lifecycle still falls back to the legacy derivation in that case, with the same caveats as before the PR but no race window. |
||
|
|
c7b01c72b2 |
test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters
First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.
Five new tests:
- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
versioned bucket must produce a delete marker (latest after worker
runs is a marker) AND keep the original version directly addressable
by versionId. ListObjectVersions confirms IsLatest=true on the
marker.
- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
backdate v1, run worker. v1 must be gone, v2 still current.
- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
DELETE (creates marker), backdate both, run worker. Every version
AND marker must be gone for the key.
- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
must not produce dispatches even on a backdated match. Negative
test for the engine's enabled-status gate.
- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
objects carrying the tag. Two backdated objects (one tagged, one
not) — only the tagged one is removed.
Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).
* test(s3/lifecycle): tighten noncurrent expiration diagnostics
Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:
- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
markers, with IsLatest) so the next failure shows which side the
bug is on rather than just NotFound
* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload
Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).
Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.
Helpers added: putAbortMPULifecycle, backdateUploadDir.
* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions
NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.
Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.
Helper added: putNewerNoncurrentLifecycle.
* test(s3/lifecycle): integration coverage for suspended-versioning Expiration
Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.
Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.
Helper added: putVersioningSuspended.
* test(s3/lifecycle): integration coverage for multi-bucket sweep
A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.
Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.
* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter
ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.
Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.
* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup
Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:
- A later test's shard sweep would still load the prior test's
lifecycle config (the worker reads every bucket's XML from filer
state, and DeleteBucket alone doesn't drop lifecycle config
cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
ListObjectsV2 can't see, so the existing best-effort empty-then-
delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.
Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.
* test(s3/lifecycle): backdate both versions for NoncurrentDays clock
Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.
Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.
* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion
Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.
* test(s3/lifecycle): gofmt size-filter test
* test(s3/lifecycle): integration coverage for Object Lock skip
Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.
Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.
Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.
* test(s3/lifecycle): integration coverage for config update between sweeps
An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.
Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.
Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.
* test(s3/lifecycle): integration coverage for ExpirationDate (past)
Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.
Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.
* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects
Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.
Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.
* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching
Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.
Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.
* test(s3/lifecycle): integration coverage for empty bucket sweep no-op
A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).
Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.
* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate
ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.
ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.
* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker
The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.
EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.
Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.
* test(s3/lifecycle): integration coverage for multiple rules per bucket
A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.
Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.
Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.
* fix(s3/lifecycle): mark ScanAtDate actions active in Compile
Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in
|
||
|
|
85abf3ca88 |
feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash) shard, routes events through the compiled lifecycle engine, and dispatches due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume. Operator-runnable harness for end-to-end Phase 3 validation while the plugin-worker auto-scheduler is still pending. EventBudget bounds a single invocation; flags expose dispatch + checkpoint cadence. Discovers buckets by walking the configured DirBuckets path and reading each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml] through lifecycle_xml.ParseCanonical. All compiled actions are seeded BootstrapComplete=true so the run dispatches whatever fires immediately; production bootstrap walks set this incrementally per bucket. * test(s3/lifecycle): integration test driving the run-shard shell command Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix, PUTs the target object, then rewrites the entry's Mtime via filer UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every shard via 'weed shell' subprocess and asserts the backdated object is deleted within 30s, and the in-prefix-but-recent object remains. The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable. Backdating via the filer's gRPC sidesteps that constraint while still exercising the real Reader -> Router -> Schedule -> Dispatcher -> LifecycleDelete RPC path end-to-end. Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs all 16 shards because ShardID(bucket, key) is hash-based and the test shouldn't couple to that detail; running every shard keeps the test independent of the hash function. * fix(shell/s3.lifecycle.run-shard): address review findings - Reject negative -events explicitly. Help text already defines 0 as unbounded; negative budgets created ambiguous behavior in pipeline.Run. - Bound the gRPC dial with a 30s timeout instead of context.Background() so an unreachable S3 endpoint doesn't hang the shell. - Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList takes a single-RPC limit; the prior 4096 silently dropped buckets past that page on large clusters. Loop with startFrom until a page comes back short. - Surface parse errors instead of swallowing them. Buckets with malformed lifecycle XML now print the first three errors verbatim and a count for the rest, so an operator running this command for diagnostics can find what's wrong. * feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads the same model through Reader and Pipeline. - reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs; ShardID stays for the single-shard short form. Event carries the computed ShardID so consumers can route per-shard without rehashing. - dispatcher.Pipeline gains Shards []int. When set, Run holds one Cursor + Schedule + Dispatcher per shard, opens one filer SubscribeMetadata stream with a predicate covering the whole set, and routes events into the matching shard's schedule from a single dispatch goroutine — no per-shard goroutine fan-out. - shell command parses -shard or -shards (mutually exclusive), formats progress messages with a contiguous-range label when applicable, and validates against ShardCount. Integration test now uses -shards 0-15 (one subprocess invocation) instead of a 16-iteration loop. * fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil The reader rejected the legitimate 'fresh subscription from epoch' state when called from a fresh Pipeline.Run on a multi-shard worker (no cursor file yet, all shards' MinTsNs=0). The downstream SubscribeMetadata call handles SinceNs=0 fine; the up-front check was over-defensive and broke the auto-scheduler completely (CI showed 5-second-cadence retries with this exact error). * fix(s3/lifecycle): schedule from ModTime not eventTime A backdated or out-of-band entry update has eventTime ≈ now while ModTime is far in the past; eventTime+Delay would push the dispatch into the future even though the rule already fires. ModTime+Delay is the correct fire moment. The dispatcher's identity-CAS still catches drift between schedule and dispatch. * fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards The CI integration test sets -events 200 expecting the subprocess to return after 200 in-shard events. But -events counts only events that pass the shard filter; the test produces ~5 such events (bucket create, lifecycle PUT, two object PUTs, mtime backdate), so the reader stays in stream.Recv forever and runShellCommand hangs the test deadline. - weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag. When > 0, Pipeline.Run runs under context.WithTimeout(D); on expiry the reader/dispatcher drain cleanly and the cursor saves. - weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat context.DeadlineExceeded the same as context.Canceled at exit (both are graceful shutdown signals). * test(s3/lifecycle): pass -runtime 10s to run-shard Pair with the new -runtime flag so the subprocess exits cleanly after 10s instead of waiting for an event budget that never lands on quiet shards. * refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg The worker's router needs the same length-prefixed sha256 of the entry's Extended map; pulling it out of the s3api private file lets both sides import it. * fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil while the live entry on the server has a non-nil hash, so every dispatch returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted. * fix(s3/lifecycle): identity HeadFid via GetFileIdString Meta-log events go through BeforeEntrySerialization, which clears FileChunk.FileId and writes the Fid struct instead. Reading .FileId directly returns "" on the worker side while the server's freshly fetched entry still has a populated string, so the identity-CAS would mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY. * fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx errors, so a subscription that ends because runCtx was canceled was being logged as a fatal reader error. Check status.Code as well so the shell's -runtime cap exits cleanly. * fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard run-shard's -s3 flag dials the LifecycleDelete gRPC service, which listens on s3.port + 10000. The integration test was passing the HTTP port instead, so the dispatcher's RPC just timed out and the shell command exited under -runtime with no work done. * chore(test/s3/lifecycle): drop emoji from Makefile output * docs(test/s3/lifecycle): correct '-shards 0-15' wording * fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run The shell's parseShardsSpec already validates, but a programmatic caller (scheduler, future worker config) shouldn't be able to silently produce no-op states by passing -1 or 99. * fix(s3/lifecycle): bound drain + final-save with their own timeouts Shutdown was using context.Background, so a stuck dispatcher RPC or filer save could keep Pipeline.Run from ever returning. * fix(test/s3/lifecycle): drop self-killing pkill in stop-server The pkill pattern \"weed mini -dir=...\" is also in the running shell's argv (it's the recipe body), so pkill -f matches its own bash and the recipe exits with Terminated. CI test job passed but the cleanup step failed with exit 2. The PID file is sufficient on its own. * docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var |
||
|
|
c918660901 | build(deps): bump io.netty:netty-transport-native-epoll from 4.1.132.Final to 4.2.13.Final in /test/java/spark (#9365) | ||
|
|
12f283357f |
fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333)
* fix(iam): scope IAM-managed OIDC provider lookup by role account Two account-scoped OIDC records sharing an issuer were collapsed into a single map slot keyed only by the URL. The last-write-wins entry then served every AssumeRoleWithWebIdentity, so a token destined for account B's role could be validated by account A's record (its clientIDs and thumbprints), defeating the per-account isolation the records exist for. The role-account check in enforceProviderAccountScope still rejected the cross-account assumption, but only after the wrong record's audience and TLS pin had already accepted the token. Refresh now keys IAM-managed records as (issuer, account), and validation parses the requested role's account up front and matches the record under that issuer in this order: exact account, global (account-less), static-config fallback. An unknown account hint deliberately skips account-scoped entries — picking one arbitrarily is the bug this commit fixes — and falls through to global or static. * fix(iam): route public AssumeRoleWithWebIdentity through IAMManager handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity directly, bypassing the IAMManager wrapper. The wrapper is where enforceProviderAccountScope rejects cross-account assumption attempts and capDurationByRole clamps to the role's MaxSessionDuration; both silently became no-ops for any AWS-SDK caller hitting the public endpoint. Dispatch through the IAMManager (via the existing IAMManagerProvider interface that other handlers in this file already use) when one is wired. Embedded test setups without an IAM integration fall back to the bare STS service unchanged. * fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config initOIDCProviderStore mirrored only URL and ClientIDs. Once RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on boot once the metadata-subscribe loop kicked in), buildOIDCProviderFromRecord rebuilt the runtime provider from this truncated record. Because IAM-managed entries take precedence over the static-config map, the rebuild silently shadowed the bootstrap with a weaker provider: - Thumbprints: dropped, so TLS-pinned issuers fell back to the system trust store. - AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped reaching the session. - PolicyClaim: dropped, so claim-based policy mode stopped triggering. Pull all three from the provider's static Config map at mirror time so the stored record round-trips to a runtime provider equivalent to the one the static config produced directly. * fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler Phase 3b advertises that RoleArn MAY be omitted in claim-based policy mode — the STS service then derives the assumed-role ARN from the configured policy claim. The HTTP handler still rejected empty RoleArn up front with MissingParameter, so SDK callers using the documented omitted-role flow never reached the STS layer. Drop the pre-check; STS still validates that claim-based mode is configured and that the IDP emits policies, returning a precise error when either is missing. The existing error mapping below this point surfaces those as InvalidParameterValue, matching what an AWS SDK expects. * test(iam): update missing-RoleArn STS integration test for the new contract The previous commit drops the HTTP-layer RoleArn pre-check so claim-based mode can derive the ARN from a JWT claim. The integration test still asserted MissingParameter for the missing-RoleArn case, which now reaches the STS layer and surfaces a JWT-parse error instead. Update the assertion to match: missing RoleArn alone must no longer surface as MissingParameter, but a bogus JWT must still be rejected. |
||
|
|
6141222ab0 |
fix(test/s3/policy): allocate fresh admin port per subtest (#9332)
* fix(test/s3/policy): allocate fresh admin port per subtest startMiniCluster ran weed mini in-process and explicitly assigned master/volume/filer/s3 ports allocated by MustAllocatePorts, but it left -admin.port and -admin.port.grpc unset, so each subtest reused the hardcoded defaults 23646 / 33646. The package's subtests run sequentially within the same go test process. The previous subtest's admin goroutine is still bound to 23646 by the time the next subtest spins up its own mini, so the new admin can never bind, mini.go's waitForAdminServerReady hits its 240-attempt cap, and glog.Fatalf kills the test binary. This has been the dominant cause of "admin server did not become ready" flakes across recent IAM PRs. Allocate two extra ports for admin and pass them through. The other subprocess-based tests (s3tables/*) are not affected because each launches weed mini in a fresh OS process. * fix(mini): make admin readiness wait context-aware waitForAdminServerReady polled for 240 attempts × 500ms regardless of whether the surrounding mini context was cancelled. When mini is run in-process from a test harness (test/s3/policy/...) and the test calls its cancel func, the leftover wait keeps spinning for the full two minutes and then glog.Fatalf's, terminating the entire test binary — including any sibling subtest that has since started its own mini. Thread the existing miniClientsCtx through the wait so a Stop / cancel returns context.Canceled immediately. The caller (startMiniAdminWithWorker) treats a context-cancelled outcome as a graceful shutdown signal and logs+returns instead of fataling. |
||
|
|
a769c938ec |
test(s3tables): Unity Catalog OSS integration tests against SeaweedFS (#9308)
* test(s3tables): add Unity Catalog OSS integration test against SeaweedFS Mirrors the configuration used by the upstream playground at data-engineering-helpers/mds-in-a-box/unitycatalog-playground. Three test variants under test/s3tables/unity_catalog: - TestUnityCatalogDeltaIntegration: aws.masterRoleArn empty / static keys; catalog/schema/EXTERNAL Delta CRUD + temporary-table-credentials S3 round-trip (the playground's working configuration). - TestUnityCatalogMasterRoleIntegration: aws.masterRoleArn set to a SeaweedFS-side role with a permissive trust policy; UC's StsClient is pinned at SeaweedFS via AWS_ENDPOINT_URL_STS, and the test asserts the vended creds carry a session_token and a non-static access key, proving the role-vended path the playground notes as not-yet-working actually does work today. - TestUnityCatalogDeltaRsRoundTrip: writes/reads a real Delta table at the registered storage_location using delta-rs in a slim Python container, with temporary credentials fetched from UC. All three self-skip without Docker or a weed binary, matching the sibling lakekeeper / polaris tests. * test(s3tables): tighten Unity Catalog tests against actual UC OSS behavior After running the suite locally, ground the assertions in what the upstream UC OSS Docker image actually does against SeaweedFS today. - Static-key playground configuration (TestUnityCatalogDeltaIntegration): catalog/schema/EXTERNAL Delta CRUD pass against the SeaweedFS-backed warehouse. The temporary-table- credentials subtest is renamed and inverted to assert the failure mode the playground reports -- UC's AwsCredentialVendor falls through to an internal StsClient.assumeRole when masterRoleArn and sessionToken are both empty, which has no real STS to talk to. Bucket path is also fixed to match UC's getStorageBase() lookup (s3://lakehouse vs the playground's s3://lakehouse/warehouse, which the upstream code never matches). - Master-role variant (TestUnityCatalogMasterRoleIntegration): split into two passing slices. Slice 1 proves SeaweedFS' STS endpoint vending UnityCatalogVendedRole works via the Go AWS SDK and the vended creds round-trip on S3. Slice 2 boots UC with aws.masterRoleArn set and verifies catalog/schema/Delta CRUD. The third hop -- UC's Java StsClient actually reaching SeaweedFS' STS handler during /temporary-table-credentials -- is logged but not asserted, since the AWS Java SDK's STS request currently lands on a SeaweedFS S3 path rather than the STS handler. - Delta-RS round-trip (TestUnityCatalogDeltaRsRoundTrip): gated on UC_DELTA_RS_RUN=1 since it depends on the master-role STS handoff above. The Dockerfile / writer script stay in tree so the test runs end-to-end the moment that hop is fixed. README rewritten to be explicit about what each test validates today and what is still pending. Result: `go test -run TestUnityCatalog ./test/s3tables/unity_catalog/...` passes cleanly with weed + Docker available, and self-skips otherwise. * test(s3tables): exercise unity catalog integrations * ci: run Unity Catalog integration tests on PRs Adds a unity-catalog-integration-tests job to s3-tables-tests.yml, modeled on the existing lakekeeper / dremio jobs. Pre-pulls the UC image and python:3.11-slim (used by the delta-rs writer container) and runs `go test ./test/s3tables/unity_catalog`. Format-check and go-vet jobs already recurse into ./test/s3tables/... so the new package is covered there too. * test/ci: address PR review Tighten the UC readiness probe to require 200, not <500, so a 401/403/404 during startup surfaces immediately instead of being treated as ready (CodeRabbit). Pin the UC image to v0.4.0 in both the workflow and the test default, matching the pinned-tag convention the rest of s3-tables-tests.yml uses (CodeRabbit). Use UC_IMAGE=unitycatalog/unitycatalog:main to re-test against current upstream. * docs: separate UC static-key vs master-role failure modes The README mixed the two together. Static-key empty-sessionToken short-circuits with "S3 bucket configuration not found." before UC even fires an STS call; the AccessDenied I described is what happens in the master-role variant where UC's Java StsClient actually reaches SeaweedFS. Cross-link the playground PR that fixes the static-key vending side. Also drop the "what most playground users actually run" hand-wave under MANAGED tables. * docs: trim README Drop the playground cross-reference and the "two layers fail independently" framing. * docs: pin down what's actually pending Investigated the master-role STS handoff with a sniffer in front of SeaweedFS' STS port. UC's StsClient is constructed without an endpointOverride and never reads aws.endpoint or AWS_ENDPOINT_URL_STS; verified by pointing AWS_ENDPOINT_URL_STS at port 1 and seeing the same real-AWS InvalidClientTokenId 403 with zero traffic to SeaweedFS. The fix is upstream in UC. Updated the README and the master-role test's t.Logf to say so precisely, and dropped the stale "Spark client" bullet (delta-rs covers that path). * test(s3tables): use BaseEndpoint instead of deprecated resolver EndpointResolverWithOptions is deprecated in aws-sdk-go-v2; the supported way to override a service endpoint is via the per-service Options.BaseEndpoint. Switch the assume-role helper to that pattern so the test stops compiling against deprecated API and the resolver boilerplate disappears. Addresses gemini review on PR #9308. * test(s3tables): drop unused splitS3URI helper Helper had no callers; gemini caught it on PR #9308. Easy to bring back from git history if needed. * test(s3tables): extract last token of docker run output as container ID docker run -d may prefix the container ID with image-pull progress when the image isn't cached locally. strings.TrimSpace on the whole output then gave a multi-line string, not the ID. Take the last whitespace-separated token so the ID survives a fresh CI runner. Addresses gemini review on PR #9308. * test(s3tables): cap Unity Catalog response body reads at 10 MiB io.ReadAll without a limit could OOM the test runner if the UC container hands back an unexpectedly large body. 10 MiB is well above any well-formed catalog response and turns a misbehaving server into a test failure instead of a runner crash. Addresses gemini review on PR #9308. * docs: link UC fix PR and call out UC's mocked-Sts test pattern UC's own credential-vending tests substitute StsClient with an in-process EchoAwsStsClient (BaseCRUDTestWithMockCredentials) or Mockito.mockStatic (CloudCredentialVendorTest), so the wire path between UC's Java SDK and a real STS server is untested -- which is why the missing endpointOverride slipped through upstream. Linked the upstream fix at unitycatalog/unitycatalog#1532. |
||
|
|
6d95a5592a |
test(s3/policy): stop racing t.TempDir cleanup against mini shutdown
The mini cluster's admin/plugin worker keeps creating files under admin/plugin/job_types/ for ~1s after subtests finish, while the previous Stop() only cancelled an unobserved context and slept 500ms. t.TempDir()'s registered RemoveAll then raced the worker and intermittently failed with "directory not empty" (CI run 25352039081). Manage the data dir manually so it is removed only after the mini goroutine has exited, and wire MiniClusterCtx so cancel actually drains master/volume/filer/admin/s3/webdav. |
||
|
|
6aa353716a |
test(kafka): snapshot consumer-group state mid-attempt for resumption flake
The onRetry hook in TestOffsetManagement/ConsumerGroupResumption only fires after defer reader.Close(), so every dump shows the post-LeaveGroup Empty state — useless for diagnosing why the second consumer hangs in the join cycle. Add an onTick callback fired every 1.5s while the reader is still joined so we can see PreparingRebalance / CompletingRebalance churn, leader, and member assignments during the 20s attempt window. |
||
|
|
1de741737d |
test(s3tables): add Apache Doris Iceberg catalog integration test (#9307)
* test(s3tables): add Apache Doris Iceberg catalog integration test Adds an end-to-end smoke test that boots the apache/doris all-in-one container, registers SeaweedFS as an external Iceberg REST catalog (OAuth2 client_credentials), and validates metadata visibility plus the parquet read path against tables seeded via the Iceberg REST API and a PyIceberg writer container, mirroring the existing Trino, Spark, and Dremio coverage. Wires the test into a new s3-tables-tests workflow job. * test(s3tables): document weed shell -master flag format and fill in helper docstrings Restores the explanatory comment on createTableBucket about the host:port.grpcPort ServerAddress format used by `weed shell -master` (produced by pb.NewServerAddress) so the dot separator isn't mistaken for a typo, and adds doc comments for createIcebergNamespace, createIcebergTable, doIcebergJSONRequest, requireDorisRuntime, and hasDocker. |
||
|
|
87bb0a4115 |
test(s3tables): capture weed mini stdout/stderr in catalog_spark tests
Without this, when "weed mini" fails to bind its master port within the 45s startup timeout the only diagnostic is the timeout message itself, making flakes impossible to root-cause. The sibling catalog, catalog_dremio, and catalog_trino packages already wire stdout/stderr through; bring catalog_spark in line. |
||
|
|
6844ec067c |
fix(s3): cache remote-only source before CopyObject (#9304) (#9305)
* fix(s3): cache remote-only source before CopyObject (#9304) CopyObject from a remote.mount source whose object lives only upstream created a destination entry with FileSize > 0 but no chunks/content, because the resolved source entry has no local chunks and the copy path fell into the "inline/empty chunks" branch with empty entry.Content. A subsequent GET returned 500 with "data integrity error: size N reported but no content available". CopyObjectPart had the same shape via copyChunksForRange iterating an empty chunk list. Detect entry.IsInRemoteOnly() right after resolving the source in both CopyObjectHandler and CopyObjectPartHandler and cache the object to the local cluster first via a new cacheRemoteObjectForCopy helper (a copy-time analogue of cacheRemoteObjectForStreaming with a bounded 30s timeout and version-aware path resolution). If caching fails or produces no chunks, return 503 with Retry-After: 5 instead of writing a metadata-only destination, mirroring the GetObject behavior added in the #7817 cold-cache fix. Adds TestCopyObjectRemoteOnlySourceDetection pinning the four entry shapes the fix branches on plus the pre-fix broken-output shape. * address PR review on remote-only copy fix - Use the resolved entry's version id when srcVersionId is empty so a CopyObject reading the latest object in a versioning-enabled bucket caches the correct .versions/v_<id> path instead of getting stuck in a 503 retry loop. New helper resolvedSourceVersionId handles the fallback for both CopyObject and CopyObjectPart. - Drop the redundant cachedEntry.IsInRemoteOnly() recheck in both handlers; the cache helper now reports success based on local data presence, and IsInRemoteOnly does not look at inline Content so keeping the check would 503 on small inline-cached objects. - Treat inline Content as a successful cache result in both cacheRemoteObjectForStreaming and cacheRemoteObjectForCopy via a shared cachedEntryHasLocalData predicate. The CopyObject inline branch already handles entries that have Content but no chunks. - Extract buildVersionedRemoteObjectPath so the streaming and copy cache helpers share path construction. Adds TestResolvedSourceVersionId and TestCachedEntryHasLocalData to pin the new helpers' contracts. * narrow streaming cache contract back to chunks-only CodeRabbit flagged that cacheRemoteObjectForStreaming's caller in streamFromVolumeServers (lines 997-1002) still required non-empty chunks, so content-only cache hits would fall through to a 503 retry loop instead of being honored. Resolve by keeping the helper's contract chunks-only: the filer's caching code only ever writes chunks, the streaming downstream isn't wired to read inline Content from a cached entry, and a partial range- aware inline writer here would be overkill for a path that doesn't actually occur in practice. cacheRemoteObjectForCopy keeps the relaxed contract since the copy path's inline branch genuinely handles both chunked and content-only entries. Document the asymmetry on cachedEntryHasLocalData and on cacheRemoteObjectForStreaming so a future reader can see why the two helpers diverge. * extend version-id resolution to streaming cache path CodeRabbit flagged that GetObjectHandler still passed the raw query versionId to cacheRemoteObjectForStreaming. For latest-version reads in versioning-enabled buckets that stays empty even though the resolved entry lives at .versions/v_<id>, so remote-only GETs would keep caching the wrong path and 503-ing forever. Reuse the new resolvedSourceVersionId helper at the streaming call site too. Also document on cachedEntryHasLocalData that the zero-byte case flagged in the same review is handled upstream (IsInRemoteOnly requires RemoteSize > 0, so the cache helper is never invoked for empty remote objects -- CopyObject's pre-existing inline branch writes a correct empty destination directly). Pin this with a new test case. * trim verbose comments Drop tutorial-style and review-history comments. Keep only the WHY that isn't obvious from identifiers: the #9304 reference on the new branches in CopyObject / CopyObjectPart, the latest-version-fallback rationale on resolvedSourceVersionId, and the streaming/copy contract asymmetry on cachedEntryHasLocalData. * drop issue references from comments Issue numbers belong in PR descriptions and commit messages, not in source comments where they rot. Replace with the underlying invariant the code is preserving. * test: drive remote-object cache helpers through real gRPC Existing tests only re-enacted helper-function branching in test space, so they could not have caught a handler that consumed a remote-only entry without going through the cache. Stand up an in-process filer gRPC server (UnimplementedSeaweedFilerServer + configurable CacheRemoteObjectToLocalCluster response) and exercise the two cache helpers end-to-end. What's pinned: - cacheRemoteObjectForCopy returns nil when the cache makes no progress (response is still remote-only), lets gRPC errors through as nil, accepts both chunked and inline-content cache hits, and surfaces deadline-exceeded as nil so callers can 503 instead of holding the request open. - Versioned source paths route to .versions/v_<id>; non-versioned and "null" stay at the bucket-relative path. Captured by reading the request the stub server received. - cacheRemoteObjectForStreaming holds the stricter chunks-only contract: a content-only cache hit is not propagated, since streamFromVolumeServers' downstream isn't wired to read from inline Content there. Any current or future handler that calls these helpers exercises the same gRPC path under test, so the bug class is closed for helper-routed cache calls. * move remote-only copy test into the integration suite The previous gRPC-stub test in weed/s3api/ was integration-flavored but stubbed; relocate the coverage to the existing two-server suite under test/s3/remote_cache/, which already exercises the real remote.mount + remote.uncache flow against a primary SeaweedFS plus a secondary acting as remote storage. The new test/s3/remote_cache/remote_cache_copy_test.go drives: - TestRemoteCacheCopyObject: upload to primary, uncache (entry now remote-only), CopyObject to a new key, GET the destination. Pre- fix the GET returned 500 'data integrity error: size N reported but no content'; this pins the fixed behavior over real HTTP through the actual handler stack. - TestRemoteCacheCopyObjectPart: same shape via multipart UploadPartCopy on a 6 MiB object split into two parts, exercising CopyObjectPartHandler's range-copy path. Drop weed/s3api/s3api_remote_storage_grpc_test.go: the helper-level classification tests in s3api_remote_storage_test.go still cover the contract pieces (cachedEntryHasLocalData, resolvedSourceVersionId, the remote-only entry shape), and the integration suite covers the end-to-end behavior that those classifications enable. |
||
|
|
fc75f16c30 |
test(s3tables): expand Dremio Iceberg catalog test coverage (#9303)
* test(s3tables): expand Dremio Iceberg catalog test coverage
Restructure TestDremioIcebergCatalog into subtests and add three new
checks that go beyond a connectivity smoke test:
- ColumnProjection: SELECT id, label proves Dremio parsed the schema
served by the SeaweedFS REST catalog (the previous SELECT COUNT(*)
passed without exercising any column metadata).
- InformationSchemaColumns: verifies the table's columns are listed in
Dremio's INFORMATION_SCHEMA.COLUMNS in the expected ordinal order.
- InformationSchemaTables: verifies the table is registered in
INFORMATION_SCHEMA.TABLES.
All subtests share a single Dremio container startup, so total
runtime is unchanged.
* test(s3tables): exercise multi-level Iceberg namespaces from Dremio
Seed a 2-level Iceberg namespace (and a table inside it) via the REST
catalog before bootstrapping Dremio, then add a MultiLevelNamespace
subtest that scans the nested table by its dot-separated reference.
This relies on isRecursiveAllowedNamespaces=true (already set in the
Dremio source config) to surface the nested levels as folders. A
regression in either the SeaweedFS namespace path encoding (#8959-style)
or Dremio's recursive-namespace discovery would surface here.
Adds two helpers to keep the existing single-level call sites unchanged:
- createIcebergNamespaceLevels: namespace creation with []string levels
- createIcebergTableInLevels: table creation with []string levels and
unit-separator (0x1F) URL encoding for the namespace path component
* test(s3tables): verify Dremio reads PyIceberg-written rows
The previous Dremio subtests only scanned empty tables, so they did not
exercise the data path - just the catalog/metadata path. Add a
PyIceberg-based writer that materializes parquet files plus a snapshot
on a separate table before Dremio bootstraps, and two new subtests:
- ReadWrittenDataCount: SELECT COUNT(*) returns 3.
- ReadWrittenDataValues: SELECT id, label ORDER BY id returns the three
written rows with the expected (id, label) pairs.
The writer runs in a small image (Dockerfile.writer) built locally on
demand. It pip-installs pyiceberg+pyarrow once and reuses the layer
cache on subsequent runs. The CI workflow pre-pulls python:3.11-slim
to keep cold runs predictable.
The writer authenticates via the OAuth2 client_credentials flow that
SeaweedFS already exposes at /v1/oauth/tokens, mirroring the Go-side
helper used for REST-API table creation.
* test(s3tables): fix Dremio writer required-field schema mismatch
PyIceberg's append() compatibility check rejects an arrow column whose
nullability does not match the Iceberg field. The table schema declares
id as `required long`, but the default pyarrow int64 column is nullable
- so the writer failed with:
1: id: required long vs. 1: id: optional long
Declare an explicit pyarrow schema with nullable=False on id and
nullable=True on label to match the Iceberg side.
|
||
|
|
1f6f473995 |
refactor(worker): co-locate plugin handlers with their task packages (#9301)
* refactor(worker): co-locate plugin handlers with their task packages
Move every per-task plugin handler from weed/plugin/worker/ into the
matching weed/worker/tasks/<name>/ package, so each task owns its
detection, scheduling, execution, and plugin handler in one place.
Step 0 (within pluginworker, no behavior change): extract shared helpers
that previously lived inside individual handler files into dedicated
files and export the ones now consumed across packages.
- activity.go: BuildExecutorActivity, BuildDetectorActivity
- config.go: ReadStringConfig/Double/Int64/Bytes/StringList, MapTaskPriority
- interval.go: ShouldSkipDetectionByInterval
- volume_state.go: VolumeState + consts, FilterMetricsByVolumeState/Location
- collection_filter.go: CollectionFilterMode + consts
- volume_metrics.go: export CollectVolumeMetricsFromMasters,
MasterAddressCandidates, FetchVolumeList
- testing_senders_test.go: shared test stubs
Phase 1: move the per-task plugin handlers (and the iceberg subpackage)
into their task packages.
weed/plugin/worker/vacuum_handler.go -> weed/worker/tasks/vacuum/plugin_handler.go
weed/plugin/worker/ec_balance_handler.go -> weed/worker/tasks/ec_balance/plugin_handler.go
weed/plugin/worker/erasure_coding_handler.go -> weed/worker/tasks/erasure_coding/plugin_handler.go
weed/plugin/worker/volume_balance_handler.go -> weed/worker/tasks/balance/plugin_handler.go
weed/plugin/worker/iceberg/ -> weed/worker/tasks/iceberg/
weed/plugin/worker/handlers/handlers.go now blank-imports all five
task subpackages so their init() registrations fire.
weed/command/mini.go and the worker tests construct the handler with
vacuum.DefaultMaxExecutionConcurrency (the constant moved with the
vacuum handler).
admin_script remains in weed/plugin/worker/ because there is no
underlying weed/worker/tasks/admin_script/ package to merge with.
* refactor(worker): update test/plugin_workers imports for moved handlers
Three handler constructors moved out of pluginworker into their task
packages — update the integration test files in test/plugin_workers/
to import from the new locations:
pluginworker.NewVacuumHandler -> vacuum.NewVacuumHandler
pluginworker.NewVolumeBalanceHandler -> balance.NewVolumeBalanceHandler
pluginworker.NewErasureCodingHandler -> erasure_coding.NewErasureCodingHandler
The pluginworker import is kept where the file still uses
pluginworker.WorkerOptions / pluginworker.JobHandler.
* refactor(worker): update test/s3tables iceberg import path
The iceberg subpackage moved from weed/plugin/worker/iceberg/ to
weed/worker/tasks/iceberg/. test/s3tables/maintenance/maintenance_integration_test.go
still imported the old path, breaking S3 Tables / RisingWave / Trino /
Spark / Iceberg-catalog / STS integration test builds.
Mirrors the OSS-side fix needed by every job in the run that
transitively imports test/s3tables/maintenance.
* chore: gofmt PR-touched files
The S3 Tables Format Check job runs `gofmt -l` over weed/s3api/s3tables
and test/s3tables, then fails if anything is unformatted. Files this
PR moved or modified had import-grouping and trailing-spacing issues
introduced by perl-based renames; reformat them with gofmt -w.
Touched files:
test/plugin_workers/erasure_coding/{detection,execution}_test.go
test/s3tables/maintenance/maintenance_integration_test.go
weed/plugin/worker/handlers/handlers.go
weed/worker/tasks/{balance,ec_balance,erasure_coding,vacuum}/plugin_handler*.go
* refactor(worker): bounds-checked int conversions for plugin config values
CodeQL flagged 18 go/incorrect-integer-conversion warnings on the moved
plugin handler files: results of pluginworker.ReadInt64Config (which
ultimately calls strconv.ParseInt with bit size 64) were being narrowed
to int32/uint32/int without an upper-bound check, so a malicious or
malformed admin/worker config value could overflow the target type.
Add three helpers in weed/plugin/worker/config.go that wrap
ReadInt64Config and clamp out-of-range values back to the caller's
fallback:
ReadInt32Config (math.MinInt32 .. math.MaxInt32)
ReadUint32Config (0 .. math.MaxUint32)
ReadIntConfig (math.MinInt32 .. math.MaxInt32, platform-portable)
Update each flagged call site in the four moved task packages to use
the bounds-checked helper. For protobuf uint32 fields (volume IDs)
the variable type also becomes uint32, removing the trailing
uint32(volumeID) casts and changing the "missing volume_id" check
from `<= 0` to `== 0`.
Touched files:
weed/plugin/worker/config.go
weed/worker/tasks/balance/plugin_handler.go
weed/worker/tasks/erasure_coding/plugin_handler.go
weed/worker/tasks/vacuum/plugin_handler.go
* refactor(worker): use ReadIntConfig for clamped derive-worker-config helpers
CodeQL still flagged three call sites where ReadInt64Config was being
narrowed to int after a value-range clamp (max_concurrent_moves <= 50,
batch_size <= 100, min_server_count >= 2). The clamp is correct but
CodeQL's flow analysis didn't recognize the bound, so it flagged them
as unbounded narrowing.
Switch to ReadIntConfig (already int32-bounded by the helper) for
those three sites, drop the now-redundant int64 intermediate variables.
Also drops the now-unused `> math.MaxInt32` clamp in
ec_balance.deriveECBalanceWorkerConfig (the helper covers it).
|
||
|
|
b2f4ebb776 |
test(s3tables): add Dremio Iceberg catalog integration tests (#9299)
* test(s3tables): add Dremio Iceberg catalog integration tests
Add comprehensive integration tests for Dremio with SeaweedFS's Iceberg
REST Catalog, following the same patterns as existing Spark and Trino tests.
Tests include:
- Basic catalog connectivity and schema operations
- Table creation, insertion, and querying (CRUD)
- Deterministic table location specification
- Multi-level namespace support
Implementation includes:
- dremio_catalog_test.go: Core test environment and basic operations
- dremio_crud_operations_test.go: Schema and table CRUD testing
- dremio_deterministic_location_test.go: Location and namespace testing
- Comprehensive README and implementation documentation
CI/CD:
- Added dremio-iceberg-catalog-tests job to s3-tables-tests.yml
- Pre-pulls Dremio image, runs with 25m timeout
- Uploads artifacts on failure
* add docstrings to Dremio integration tests and fix CI image pre-pull
- Add function docstrings to all test functions and helper functions
in dremio_catalog_test.go, dremio_crud_operations_test.go, and
dremio_deterministic_location_test.go to improve code documentation
and satisfy CodeRabbit's docstring coverage requirements.
- Make Dremio Docker image pre-pull non-critical in CI workflow.
The pre-pull was failing with access denied error, but the image
can still be pulled at runtime. Using continue-on-error to allow
tests to proceed.
* fix: correct YAML syntax in Dremio CI workflow
Use multi-line run command with pipe operator (|) instead of
inline command with || operator to avoid YAML parsing errors.
The || operator was causing 'mapping values are not allowed here'
syntax errors in the YAML parser.
* make Dremio tests gracefully skip if container unavailable
Modify startDremioContainer and waitForDremio to return boolean values
instead of fataling. Tests now skip gracefully if:
- Dremio Docker image is unavailable
- Container fails to start
- Container doesn't become ready within timeout
This prevents CI failure when Dremio image is not accessible while
still testing the integration when it is available.
* Revert "make Dremio tests gracefully skip if container unavailable"
This reverts commit
|
||
|
|
1da091f798 |
ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up) Six integration test packages had _test.go files but no GitHub workflow running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug (and the four cross-SSE copy bugs in #9281) ship undetected was an instance of this same pattern. This change wires three of them into CI and removes a fourth that was deadcode: test/multi_master/ NEW workflow: multi-master-tests.yml - 3-node master raft cluster failover/recovery (5 tests, ~65s) test/testutil/ (run alongside multi_master) - port-allocator regression test test/s3/etag/ NEW workflow: s3-etag-acl-tests.yml - PutObject ETag format regression for #7768 (must be pure MD5 hex, not "<md5>-N" composite, for AWS Java SDK v2 compatibility) test/s3/acl/ (same workflow as etag) - object-ACL behavior on versioned buckets test/s3/catalog_trino/ DELETED (deadcode) - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go from a 2024 commit that was never iterated, while the test/s3tables/ counterpart has been actively maintained (and IS in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job). Both workflows trigger only on changes to relevant code paths and use the existing simple "build weed → run go test" pattern (no per-test-dir Makefile boilerplate). The S3 workflow starts a single `weed mini` shared by etag and acl, which keeps the job under 2 minutes on a fresh runner. Two tests remain knowingly uncovered: test/s3/basic/ — order-dependent state across tests (TestListObjectV2 expects a bucket created by an earlier test, etc.) and uses the deprecated aws-sdk-go v1. Treated as sample programs, not a regression suite. Fixing them is out of scope for this PR. test/s3/catalog_trino/ — see "DELETED" above. Verified locally: - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/... PASS (5 multi_master + 1 testutil tests, 64s) - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/... PASS (8 etag + 5 acl tests, ~6s after server startup) * ci: fix log-collector glob for multi-master tests (review feedback on #9283) test/multi_master/cluster.go creates per-test temp dirs via os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to match that prefix. The previous version looked for MasterCluster* / TestLeader* / TestTwoMasters* / TestAllMasters* which never matches — the failure-artifact upload would have been empty on a real failure. Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it actually picks up the per-node master*.log files under <baseDir>/logs/. Found by coderabbitai review on PR #9283. |
||
|
|
82cf60a44f |
fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) (#9280)
* fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) The remaining failure mode in #8908 was that Docker Registry's blob finalization (server-side Move via UploadPartCopy) silently corrupts SSE-S3 multipart objects. Reproduces with `aws s3api upload-part-copy` under bucket-default SSE-S3: the GET on the completed object returns deterministic wrong bytes (correct length, same wrong SHA-256 across runs). The metadata is mathematically self-consistent — every chunk's stored IV equals `calculateIVWithOffset(baseIV_dst, partLocalOffset)` — but the bytes on disk were encrypted with the SOURCE upload's key+baseIV. Root cause: - `copyChunksForRange` (and `createDestinationChunk`) constructs new chunks for UploadPartCopy without copying `SseType` / `SseMetadata`, so destination chunks are written with `SseType=NONE`. - At completion, `completedMultipartChunk` (PR #9224's NONE→SSE_S3 backfill, intended to recover from a different missing-metadata bug) sees those NONE chunks under an SSE-S3 multipart upload and backfills SSE-S3 metadata derived from the destination upload's baseIV. The chunk metadata is now internally consistent and the GET path applies decryption — but the bytes on disk are encrypted with the source upload's key, not the destination's. Decryption produces deterministic garbage. Docker Registry pulls then fail with "Digest did not match". Fix: when either the source object or the destination multipart upload has any SSE configured, take a slow-path UploadPartCopy that (1) opens a plaintext reader of the source range — decrypting the source's per-chunk SSE-S3 metadata if needed via a reused `buildMultipartSSES3Reader`, and (2) feeds that plaintext through `putToFiler`'s existing encryption pipeline by staging the destination upload entry's SSE-S3/SSE-KMS headers on a cloned request. Encryption then matches PutObjectPart's contract: every part starts a fresh CTR stream from counter 0 with `baseIV_dst`, and each internal chunk's metadata records `calculateIVWithOffset(baseIV_dst, chunk.partLocalOffset)`. The `non-SSE → non-SSE` case still takes the existing fast raw-byte copy path — bytes on disk are plaintext on both sides, so chunk-level metadata is irrelevant. Cross-encryption from SSE-KMS / SSE-C sources is left as TODO — the new path returns an explicit error rather than the previous silent corruption. SSE-S3 (the user-reported case) round-trips correctly. Tests: - test/s3/sse/s3_sse_uploadpartcopy_integration_test.go pins three UploadPartCopy shapes against bucket-default SSE-S3: * Docker-Registry-shape 32MB+tail (the user's exact 5-chunk / 2-part metadata layout) * single full-object UploadPartCopy * many small range copies Each round-trips SHA-256. - test/s3/sse/s3_sse_concurrent_repro_test.go covers the parallel multipart-upload shape from the user report (5 blobs in parallel, full GET and chunked range GET both hash-checked) — pre-existing coverage; added here as a regression sentinel. * test(s3-sse): rename UploadPartCopy regression test so CI matches it The CI workflow .github/workflows/s3-sse-tests.yml dispatches on the TEST_PATTERN ".*Multipart.*Integration" — i.e. the test name must contain both "Multipart" and "Integration" for CI to run it. The previous name TestSSES3UploadPartCopyIntegration had only "Integration"; "UploadPart" isn't "Multipart". Rename to TestSSES3MultipartUploadPartCopyIntegration so the regression test actually runs in CI rather than only locally. * fix(s3api): map unsupported UploadPartCopy SSE source to 501, not 500 (review feedback on #9280) openSourcePlaintextReader explicitly rejects SSE-KMS and SSE-C sources (SSE-S3 is the only one wired up in this slow path so far). Earlier the caller blanket-mapped that to ErrInternalError, which collapses "this shape isn't implemented yet" into the same 500 response a real server failure would produce. Clients can no longer tell whether they hit a feature gap or a bug. Introduce a sentinel errCopySourceSSEUnsupported and have copyObjectPartViaReencryption errors.Is-check it; on match, return ErrNotImplemented (501) instead of ErrInternalError (500). Other failures still map to 500. Found by coderabbitai review on PR #9280. * fix(s3api): UploadPartCopy must fail with NoSuchUpload when upload entry is missing (review feedback on #9280) CopyObjectPartHandler's earlier checkUploadId call only verifies that the uploadID's hash prefix matches dstObject; it does not prove the upload directory exists in the filer. The previous logic silently swallowed filer_pb.ErrNotFound from getEntry(uploadDir) and fell through with uploadEntry=nil, which then skipped the destination SSE check and could route a plain-source copy through the raw-byte fast path even though the destination's encryption state is unknown. Treat ErrNotFound as ErrNoSuchUpload so the client sees the right status, matching the AWS S3 contract for UploadPartCopy on a non-existent upload. Found by coderabbitai review on PR #9280. * feat(s3api): set SSE response headers on UploadPartCopy slow path (review feedback on #9280) PutObjectPartHandler writes x-amz-server-side-encryption (and the KMS key-id header for SSE-KMS) on every successful part response so clients can confirm the destination's encryption state. The new UploadPartCopy slow path was missing this — it returned only the ETag in the response body and no SSE response headers. Plumb putToFiler's SSEResponseMetadata back through copyObjectPartViaReencryption to the handler, then call setSSEResponseHeaders before writing the XML response, matching the PutObjectPart contract. Found by gemini-code-assist review on PR #9280. * fix(s3api): map transient filer errors on UploadPartCopy upload-entry fetch to 503 (review feedback on #9280) Earlier non-ErrNotFound errors from getEntry(uploadDir, uploadID) all returned 500 InternalError, which most SDKs treat as fatal — even though a transient filer outage (gRPC Unavailable, leader election in flight, deadline exceeded) is exactly the kind of failure SDK retry logic is supposed to recover from. Add an isTransientFilerError helper that recognises: - context.DeadlineExceeded / context.Canceled - gRPC codes.Unavailable, DeadlineExceeded, ResourceExhausted, Aborted When the upload-entry fetch fails for one of those reasons, return 503 ServiceUnavailable so the client retries; everything else still maps to 500. Log line now also carries dstObject (in addition to dstBucket and uploadID) to make incident triage easier. Found by gemini-code-assist review on PR #9280. |
||
|
|
02574314f6 |
test(s3): force-drop collection after deleteBucket in tagging/versioning/cors/copying (#9270)
* test(s3): force-drop collection after deleteBucket across tagging/versioning/cors/copying Each test creates a unique bucket (= new SeaweedFS collection) and the master's warm-create issues a 7-volume grow batch. The S3 DeleteBucket-driven collection sweep snapshots the layout once, but in-flight `volume_grow` requests keep registering volumes after the snapshot, leaking 1-3 volumes per bucket. On a single `weed mini` data node with the auto-derived volume cap, those leaks pile up fast and every subsequent PutObject 500s with "Not enough data nodes found". Mirror the retention-suite fix (commits |
||
|
|
35fe3c801b |
feat(nfs): UDP MOUNT v3 responder + real-Linux e2e mount harness (#9267)
* feat(nfs): add UDP MOUNT v3 responder
The upstream willscott/go-nfs library only serves the MOUNT protocol
over TCP. Linux's mount.nfs and the in-kernel NFS client default
mountproto to UDP in many configurations, so against a stock weed nfs
deployment the kernel queries portmap for "MOUNT v3 UDP", gets port=0
("not registered"), and either falls back inconsistently or surfaces
EPROTONOSUPPORT — surfacing as the user-visible "requested NFS version
or transport protocol is not supported" reported in #9263. The user has
to add `mountproto=tcp` or `mountport=2049` to mount options to coerce
TCP just for the MOUNT phase.
Add a small UDP responder that speaks just enough of MOUNT v3 to handle
the procedures the kernel actually invokes during mount setup and
teardown: NULL, MNT, and UMNT. The wire layout for MNT mirrors
handler.go's TCP path so both transports produce the same root
filehandle and the same auth flavor list for the same export. Other
v3 procedures (DUMP, EXPORT, UMNTALL) cleanly return PROC_UNAVAIL.
This commit only adds the responder; portmap-advertise and Server.Start
wire-up follow in subsequent commits so each step stays independently
reviewable.
References: RFC 1813 §5 (NFSv3/MOUNTv3), RFC 5531 (RPC). Existing
constants and parseRPCCall / encodeAcceptedReply helpers from
portmap.go are reused so behaviour stays consistent across both UDP
listening goroutines.
* feat(nfs): advertise UDP MOUNT v3 in the portmap responder
The portmap responder advertised TCP-only entries because go-nfs only
serves TCP, but with the new UDP MOUNT responder in place we can now
honestly advertise MOUNT v3 over UDP as well. Linux clients whose
default mountproto is UDP query portmap during mount setup; if the
answer is "not registered" some kernels translate the result to
EPROTONOSUPPORT instead of falling back to TCP, which is exactly the
failure pattern reported in #9263.
Add the entry, refresh the doc comment, and extend the existing
GETPORT and DUMP unit tests so a regression that drops the entry shows
up at unit-test granularity rather than only in an end-to-end mount.
* feat(nfs): start UDP MOUNT v3 responder alongside the TCP NFS listener
Plug the new mountUDPServer into Server.Start so it comes up on the
same bind/port as the TCP NFS listener. Started before portmap so a
portmap query that races a fast client never returns a UDP MOUNT entry
the responder isn't actually answering, and shut down via the same
defer chain so a portmap-or-listener startup failure doesn't leave the
UDP responder dangling.
The portmap startup log now reflects all three advertised entries
(NFS v3 tcp, MOUNT v3 tcp, MOUNT v3 udp) so operators can confirm at a
glance that the UDP MOUNT path is up.
Verified end-to-end: built a Linux/arm64 binary, ran weed nfs in a
container with -portmap.bind, and mounted from another container using
both the user-reported failing setup from #9263 (vers=3 + tcp without
mountport) and an explicit mountproto=udp to force the new code path.
The trace `mount.nfs: trying ... prog 100005 vers 3 prot UDP port 2049`
now leads to a successful mount instead of EPROTONOSUPPORT.
* docs(nfs): note that the plain mount form works on UDP-default clients
With UDP MOUNT v3 now served alongside TCP, the only path that ever
required mountproto=tcp / mountport=2049 — clients whose default
mountproto is UDP — works against the plain mount example. Update the
startup mount hint and the `weed nfs` long help so users don't go
hunting for a mount-option workaround that no longer applies.
The "without -portmap.bind" branch is unchanged: that path still has
to bypass portmap entirely because there is no portmap responder for
the kernel to query.
* test(nfs): add kernel-mount e2e tests under test/nfs
The existing test/nfs/ harness boots a real master + volume + filer +
weed nfs subprocess stack and drives it via go-nfs-client. That covers
protocol behaviour from a Go client's perspective, but anything
mis-coded once a real Linux kernel parses the wire bytes is invisible:
both ends of the test use the same RPC library, so identical bugs
round-trip cleanly. The two NFS issues hit recently were exactly that
shape — NFSv4 mis-routed to v3 SETATTR (#9262) and missing UDP MOUNT v3
— and only surfaced in a real client.
Add three end-to-end tests that mount the harness's running NFS server
through the in-tree Linux client:
- TestKernelMountV3TCP: NFSv3 + MOUNT v3 over TCP (baseline).
- TestKernelMountV3MountProtoUDP: NFSv3 over TCP, MOUNT v3 over UDP
only — regression test for the new UDP MOUNT v3 responder.
- TestKernelMountV4RejectsCleanly: vers=4 against the v3-only server,
asserting the kernel surfaces a protocol/version-level error rather
than a generic "mount system call failed" — regression test for the
PROG_MISMATCH path from #9262.
The tests pass explicit port=/mountport= mount options so the kernel
never queries portmap, which means the harness doesn't need to bind
the privileged port 111 and won't collide with a system rpcbind on a
shared CI runner. They t.Skip cleanly when the host isn't Linux, when
mount.nfs isn't installed, or when the test process isn't running as
root.
Run locally with:
cd test/nfs
sudo go test -v -run TestKernelMount ./...
CI wiring follows in the next commit.
* ci(nfs): run kernel-mount e2e tests in nfs-tests workflow
Wire the new TestKernelMount* tests from test/nfs into the existing
NFS workflow:
- Existing protocol-layer step now skips '^TestKernelMount' so a
"skipped because not root" line doesn't appear on every run.
- New "Install kernel NFS client" step pulls nfs-common (mount.nfs +
helpers) and netbase (/etc/protocols, which mount.nfs's protocol-
name lookups need to resolve `tcp`/`udp`).
- New privileged step runs only the kernel-mount tests under sudo,
preserving PATH and pointing GOMODCACHE/GOCACHE at the user's
caches so the second `go test` invocation reuses already-built
test binaries instead of redownloading modules under root.
The summary block now lists the three kernel-mount cases explicitly
so a regression on either of #9262 or this PR's UDP MOUNT change is
traceable from the workflow run page.
|
||
|
|
363d5caa85 |
test(s3-retention): purge stale buckets before each create to avoid volume exhaustion
The WORM suite creates one bucket per test, each backed by ~3 reserved volumes on the data node. With ~30 tests and the default `weed mini` volume cap, the data node runs out of slots midway through the run and every PutObject after that fails with InternalError. Hook a sweep of every test-prefix bucket into the create helpers so a panicked or interrupted prior test cannot leak buckets into the next. |
||
|
|
d92c5e057a |
test(iceberg): cross-engine regression coverage for deterministic table locations (#9074) (#9253)
* test(trino): regression for unique-table-location=false (#9074) With #9246's namespace-location property, Trino's REST catalog can resolve table locations even when the connector is configured with `iceberg.unique-table-location=false`, and CREATE/CTAS lands at the deterministic <namespace-location>/<tableName> path with no UUID-suffixed sibling. Lock that in: - writeTrinoConfig now parameterizes the unique-table-location flag via a withDeterministicTableLocation() option. - setupTrinoTest forwards config options. - TestTrinoDeterministicLocationCTAS exercises a fresh CREATE TABLE + CTAS with the flag flipped off and asserts the on-disk layout has no UUID-suffixed sibling under the namespace dir, proving each table occupies a single dir. Refs #9074 * test(spark): regression for CTAS without explicit location (#9074) iceberg-spark has no equivalent of Trino's unique-table-location flag — its REST catalog interactions always produce the deterministic <namespace-location>/<tableName> path. Without #9246's namespace-location property, Spark cannot resolve a table location for a CREATE TABLE that omits an explicit LOCATION clause; with it, the operation succeeds and the table lands at the expected single-dir-per-table layout. TestSparkDeterministicLocationCTAS walks the same scenario as the Trino test: CREATE TABLE without LOCATION, INSERT, CTAS, SELECT count, then asserts via S3 ListObjectsV2 that no UUID-suffixed sibling directory appears under the namespace. Refs #9074 * test(duckdb): read table at deterministic location via REST catalog (#9074) DuckDB's iceberg extension is a read-only consumer in this flow — there is no client-side UUID-suffixing toggle to test. The relevant question post-#9246 is whether DuckDB can ATTACH the REST catalog and resolve a table at a deterministic <bucket>/<namespace>/<tableName> path produced by writers that don't suffix UUIDs (iceberg-spark, pyiceberg, Trino with unique-table-location=false). TestDuckDBDeterministicLocationRead creates a namespace + minimal table via direct REST API calls (so no client-side UUID is added), confirms the catalog returns a deterministic location URL, then runs DuckDB through ATTACH ... TYPE 'ICEBERG' and DESCRIBE on the table. Asserting DESCRIBE succeeds proves DuckDB walked the catalog → metadata → schema chain against the deterministic on-disk path. The test skips gracefully when the DuckDB image lacks the iceberg extension or the ATTACH-iceberg syntax, so older base images don't fail the suite. Refs #9074 |
||
|
|
fe50da4934 |
test(fuse): stream verify-phase diagnostics from writeback stress test
The 45m suite alarm fires on TestWritebackCacheStressSmallFiles with no output from the test, since t.Logf is buffered until the test completes and the alarm panic skips that flush. Add streaming stderr progress, an explicit verify-phase budget that t.Fatalf's with a goroutine dump on overrun, and per-retry/per-failure logging so the next hang shows which file(s) the mount could not read back. |
||
|
|
f50917224a |
fix(iceberg): default namespace location so fresh CTAS does not race metadata write (#9074) (#9246)
* fix(iceberg): advertise default namespace location for REST clients
Trino's REST catalog has two code paths for CREATE TABLE depending on
whether a table location can be resolved before the catalog call:
// TrinoRestCatalog.newCreateTableTransaction (Trino 479)
if (location.isEmpty()) {
return tableBuilder.create().newTransaction(); // EAGER: REST POST now
}
return tableBuilder.withLocation(...).createTransaction(); // DEFERRED
`tableLocation` resolves to null when the user does not pass
`location = '...'` AND `defaultTableLocation` returns null. The latter
happens whenever the namespace's `loadNamespaceMetadata` response has no
`location` property — and our handler returned exactly that.
In the eager branch Trino calls REST POST /v1/.../tables immediately, so
our handleCreateTable persists `<location>/metadata/v1.metadata.json` to
the filer. Trino's IcebergMetadata.beginCreateTable then runs
`fileSystem.listFiles(location).hasNext()` on the same path, finds the
metadata file we just wrote, and throws "Cannot create a table on a
non-empty location" — even on a brand-new schema and table name.
Synthesize a default `location` of `s3://<bucket>/<flattened-namespace>`
on Get/Create namespace responses when one is not stored. With a non-
null `defaultTableLocation`, Trino takes the deferred branch, picks a
unique `<namespace-location>/<table>-<UUID>` path (UUID added by the
standard `iceberg.unique-table-location=true` setting), and the empty-
location check passes. The actual REST POST is deferred to commit time,
so our metadata write lands alongside the data files Trino has already
produced.
Existing namespaces with an explicit `location` property are untouched —
the synthesis only kicks in when the property is absent.
Refs #9074
* test(trino): regression for fresh CREATE TABLE without explicit location
Exercises the follow-up scenario reported on issue #9074: a CREATE TABLE
on a brand-new schema and brand-new table name with NO `location = '...'`
clause, followed by a CTAS on top — exactly the SQL pattern from the
report. Before advertising a default namespace location, the first
CREATE TABLE failed with
Cannot create a table on a non-empty location:
s3://iceberg-tables/<schema>/<table>, set
'iceberg.unique-table-location=true' in your Iceberg catalog
properties to use unique table locations for every table.
even though the bucket was genuinely empty. The bug surfaced because our
handleCreateTable persists `<location>/metadata/v1.metadata.json` during
Trino's eager-create branch, and Trino's post-create listFiles
emptiness check then trips on the metadata file we just wrote.
The test asserts the CTAS succeeds AND the resulting table contains the
source rows, since the reporter saw the table get created with empty
data when the query failed.
Refs #9074
|
||
|
|
ac3a756dae |
test(s3-retention): force-drop collection after deleteBucket to free volumes
COMPLIANCE-mode retention leaves objects that BypassGovernanceRetention cannot clear, so the test's DeleteBucket keeps returning BucketNotEmpty and the underlying SeaweedFS collection (with its 7 reserved volumes) leaks. After a few leaks on the single-node `weed mini` server, the master logs "Not enough data nodes found" and every subsequent PutObject 500s, timing the suite out. Call the master's /col/delete admin endpoint from deleteBucket so the collection's volumes are reclaimed even when S3-level cleanup is blocked. |
||
|
|
6cbcdf488c |
chore(mount,fuse-test): diagnostics for FUSE ConcurrentReadWrite ENOENT flake
PR #9230 attempt 1 hit an intermittent TestConcurrentFileOperations/ConcurrentReadWrite failure where stat returned ENOENT for a path all writers had just succeeded against, and the captured mount.log carried no signal about which layer dropped the entry because the relevant lookup logged at V(4). Two diagnostic-only changes (no behavior change on the happy path): - weed/mount/weedfs.go: in lookupEntry, when filer GetEntry returns ErrNotFound for a path whose inode is still tracked locally with no in-flight create or flush, log Warningf with inode + dirtyHandle + pendingFlush + localCache + dirCached. This surfaces layer-by-layer state at the moment of the suspicious ENOENT. - test/fuse_integration/framework_test.go: on AssertFileExists failure, dump five 100ms-spaced stat retries, a parent ReadDir, and a direct O_RDONLY open before failing. Triangulates kernel dentry caching vs mount lookup vs filer state. |
||
|
|
4f628ff4e5 |
fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908) (#9228)
* fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908) buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY chunk upfront, then walked them with io.MultiReader. For a multipart SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob with 25+ chunks), N volume-server bodies sat live at once; chunks 1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent load the volume server's keep-alive logic closed those idle responses mid-flight, and the S3 client saw `unexpected EOF` partway through the GET. Truncated bytes hash to the wrong SHA-256, which is exactly the "Digest did not match" symptom Docker Registry reports in #8908 (and which persisted even after the per-chunk metadata fix in #9211 and the completion backfill in #9224). Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk, wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for the SSE-specific decryption setup. Per-chunk metadata is still validated UPFRONT so a malformed chunk fails fast without opening any HTTP connection -- the eager validation contract callers and tests rely on is preserved. The volume-server GET and the SSE-specific decrypt wrap, however, fire LAZILY: at most one chunk body is live at any time, regardless of object size. This commit applies the new pattern to buildMultipartSSES3Reader only; the SSE-KMS and SSE-C multipart readers retain their eager form for now and will be migrated in follow-up commits, since the same shape exists there too. Tests: - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract: zero chunks opened at construction, peak liveness == 1, all closed after drain. - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch (replaces ClosesAppendedOnError) asserts a malformed chunk in position N causes zero fetches for chunks 0..N -- the previous test pinned a weaker contract (cleanup after eager open). - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same reason: the fetch callback must NOT be invoked at all on a bad-IV chunk. - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts encrypted the way putToFiler writes them (shared DEK + baseIV, partOffset=0, post-completion global offsets) and walks them through buildMultipartSSES3Reader. * fix(s3api): stream multipart SSE-KMS chunks lazily Apply the same fix as the previous commit to createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata is validated upfront, but volume-server GETs fire lazily through lazyMultipartChunkReader. At most one chunk body is live at any time. This is the same eager-open-all-chunks shape that produced #8908's truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks were exposed to the same idle-keepalive failure mode under concurrent load. The wire format on disk is unchanged (same per-chunk metadata, same encrypted bytes, same object Extended attributes). Existing SSE-KMS multipart objects read back identically -- only when the volume-server GETs fire changes. * fix(s3api): stream multipart SSE-C chunks lazily Apply the same fix as the previous two commits to createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is validated upfront (IV decode, IV length check, non-negative PartOffset), but the volume-server GET and CreateSSECDecryptedReader- WithOffset wrap fire lazily through lazyMultipartChunkReader. At most one chunk body is live at any time. This is the same eager-open-all-chunks shape that produced #8908's truncated GETs for SSE-S3; SSE-C multipart objects with many chunks were exposed to the same idle-keepalive failure mode under concurrent load. The pre-existing TODO note about CopyObject SSE-C PartOffset handling is preserved verbatim. The wire format on disk is unchanged (same per-chunk metadata, same encrypted bytes); existing SSE-C multipart objects read back identically. After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS, SSE-C) share lazyMultipartChunkReader as their streaming engine. * test(s3): add Docker Registry-shape multipart SSE-S3 GET regression Pin the end-to-end fix for #8908 with a test that mirrors what Docker Registry actually does on pull: a 25-part * 5MB upload with bucket- default SSE-S3, then a full GET, then SHA-256 over the streamed body must match SHA-256 over the uploaded bytes. The eager-multipart-reader bug was specifically a streaming truncation under load: the response status was 200 with a Content-Length matching the object size, but the body short-circuited mid-stream because later chunks' volume-server connections had already been closed by keepalive. The hash check is the symptom Docker Registry surfaces ("Digest did not match"), so this is the most faithful regression we can pin without spinning up a registry. uploadAndVerifyMultipartSSEObject already byte-compares the GET body, but hashing on top is intentionally explicit -- it documents WHY the test exists, and matches the failure mode reported in the issue. * test(s3): add range-read coverage matrix across SSE modes and sizes Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB) single-chunk objects, with one ad-hoc range case per SSE mode and one 129-byte boundary-crossing case in TestSSEMultipartUploadIntegration. Nothing exercised: - Range reads on single-PUT objects whose content crosses the 8MB internal chunk boundary (medium size class). - Range reads on multipart objects whose parts each span multiple internal chunks (large size class) -- the shape #8908 originally surfaced for full-object GETs and the most likely site of any future regression in per-chunk IV / PartOffset plumbing for partial reads. - A consistent range-pattern set applied uniformly across SSE modes, so any divergence between modes (SSE-C uses random IV + PartOffset; SSE-S3/KMS use base IV + offset) is comparable at a glance. TestSSERangeReadCoverageMatrix introduces a parameterized matrix: modes: no_sse, sse_c, sse_kms, sse_s3 sizes: small (256KB single chunk), medium (12MB single PUT crossing one internal boundary), large (5x9MB multipart, ~10 internal chunks, every part itself spans an 8MB boundary) ranges: single byte at 0, prefix 512B, single byte at last, suffix bytes=-100, open-ended bytes=N-, whole object, AES-block boundary 15-31, mid straddling one internal boundary (medium+large), mid spanning many internal boundaries (large only) Per case it asserts: body bytes equal the expected slice, Content-Length matches the range length, Content-Range matches start-end/total, and the SSE response headers match the mode. The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's the remaining sse_kms subtests with a clear reason if the local server has no KMS provider configured -- the default `weed mini` setup lacks one; the Makefile target `test-with-kms` provides one via OpenBao. Other modes always run. Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against weed mini, sse_kms cleanly skipped. * test(s3): conform new test names to TestSSE*Integration so CI runs them The two tests added in the previous commits had names that did NOT match the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml use to discover SSE integration tests: - test/s3/sse/Makefile `test` target: TestSSE.*Integration - test/s3/sse/Makefile `test-multipart`: TestSSEMultipartUploadIntegration - .github/workflows/s3-sse-tests.yml: ...|.*Multipart.*Integration|.*RangeRequestsServerBehavior Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and the Docker-Registry-shape multipart regression in TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible to CI even though the underlying test setup (start-seaweedfs-ci using s3-config-template.json with the embedded `local` KMS provider) already has SSE-KMS configured. Renames: TestSSERangeReadCoverageMatrix -> TestSSERangeReadIntegration TestSSES3MultipartManyChunks_... -> TestSSEMultipartManyChunksIntegration Both names now match `TestSSE.*Integration` (Makefile `test` target) and TestSSEMultipartManyChunksIntegration additionally matches `.*Multipart.*Integration` (CI's comprehensive subset). No behavior change; only the function names move. Verified locally against `weed mini` with s3-config-template.json: TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes (none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns, all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only fires for ad-hoc local setups that don't load any KMS provider; the project's standard test setup loads the local provider, so CI has full SSE-KMS range coverage. * fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch Addresses CodeRabbit review on PR #9228: in createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata was deserialized in the prep loop but the IV length was only validated later, inside CreateSSEKMSDecryptedReader, which runs from the wrap closure -- AFTER the chunk's volume-server fetch has already started. That weakens the new "reject malformed chunks before any fetch" contract for SSE-KMS specifically: a chunk with a missing/short/long IV would fire its HTTP GET, then fail mid-stream during decrypt. The fix moves the existing ValidateIV check into the prep loop, matching the SSE-S3 and SSE-C paths. Drive-by: extract the SSE-KMS prep loop into a free buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader, so the new contract is unit-testable without an S3ApiServer. The exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a thin caller, so behavior for production callers is unchanged. New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract: - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers missing IV, empty IV, short IV, long IV. Each case asserts both that an error is returned AND that the fetch callback is never invoked. - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch pins the analogous behavior when SseMetadata is nil on a chunk in position N: chunks 0..N-1 must not be fetched (the earlier eager implementation depended on a closeAppendedReaders cleanup path; the new contract is stronger -- nothing is opened in the first place). - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch covers the JSON-unmarshal failure branch. - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the documented sort-by-offset contract by recording the order in which fetch is invoked. All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit suite + the SSE integration suite (with the local KMS provider enabled via s3-config-template.json) continue to pass. * test(s3): address CodeRabbit nitpicks on range coverage matrix Three small follow-ups on the range-read coverage matrix from the previous commit, per CodeRabbit nitpicks on PR #9228: 1. Promote the body-length check from `assert.Equal` to `require.Equal` so a truncation regression -- the canonical #8908 failure mode -- aborts the subtest immediately. Previously the assertion logged a length mismatch and then `assertDataEqual` ran on differently-sized slices, producing a noisy byte-diff on top of the actual symptom. The redundant trailing `t.Fatalf` block becomes dead and is removed. 2. Broaden the SSE-KMS probe-skip heuristic. The probe previously produced the friendly "KMS provider not configured" message only for 5xx responses; KMS-misconfig surfaces also include 501 NotImplemented, 4xx KMS.NotConfigured, and error messages containing "KMS.NotConfigured" / "NotImplemented" / "not configured". The behaviour change is purely cosmetic (the caller t.Skip's on any non-empty reason either way) but the new diagnostic is more useful in CI logs. 3. Add `t.Parallel()` at the mode and size-class levels of the matrix. Each (mode, size) writes an independent object key under the shared bucket, with no cross-talk, so parallel execution is safe. Local wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%); the savings scale with chunk count and CI machine concurrency. Verified locally against `weed mini` with s3-config-template.json: - go test ./weed/s3api/ -count=1 PASS - TestSSERangeReadIntegration -v 112 PASS, 0 SKIP - TestSSEMultipartUploadIntegration etc. PASS * fix(s3api): tighten lazy reader error path; unify SSE IV validation Three CodeRabbit nitpicks on PR #9228: 1. lazyMultipartChunkReader: mark finished on non-EOF Read errors The Read loop's three earlier failure paths (chunk index past end, fetch error, wrap error) all set l.finished = true before returning. The non-EOF Read path -- where l.current.Read itself errors mid-chunk -- did not, leaving l.current/l.closer set and l.finished = false. A caller that retried Read after an error would re-enter the same broken stream instead of advancing or giving up. Set l.finished = true on non-EOF Read error so post-error state is consistent across all four failure sites; Close() (which the GetObjectHandler defers) still releases the chunk body. 2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths The previous commit moved SSE-KMS to the shared ValidateIV helper but left SSE-S3 and SSE-C with bespoke inline `len(...) != AESBlockSize` checks. All three are enforcing the same invariant; inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to ValidateIV too, with the same `<algo> chunk <fileId> IV` name convention. Error message wording shifts from "<algo> chunk X has invalid IV length N (expected 16)" to ValidateIV's "invalid <algo> chunk X IV length: expected 16 bytes, got N". The substring "IV length" is preserved across both, so the existing TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion is loosened to match either form. 3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering The test previously drove Read() to observe fetch-call order, but CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap the encrypted DEK -- unavailable in unit tests -- so the wrap closure failed on the first chunk and only one fetch was ever recorded. The test asserted only fetchOrder[0] == "c0", which is weaker than the comment promised. Switch to a static check: type-assert the returned reader to *lazyMultipartChunkReader (same package so unexported fields are accessible) and inspect the prepared chunks slice directly. This pins the entire [c0, c1, c2] sort order in one place, doesn't depend on KMS, and runs in zero fetch calls. The fetch closure now asserts it is never invoked during preparation. All weed/s3api unit tests pass; integration suite (with KMS provider configured via s3-config-template.json) passes. * test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe Two CodeRabbit comments on PR #9228, both about test/s3/sse/s3_sse_range_coverage_test.go: 1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration The test creates one bucket up front, then runs subtests that call t.Parallel() at the mode and size levels (added in |
||
|
|
525900dfe4 |
fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224)
* fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion When a part of an SSE-S3 multipart upload lands with SseType=NONE on its chunks (e.g. a transient failure to apply SSE-S3 setup in PutObjectPart), the completed object inherits NONE-tagged chunks and detectPrimarySSEType then misses the chunked SSE-S3 encryption. The read path falls through to the unencrypted serve and GET returns ciphertext, producing the SHA mismatch reported in #8908. Recover at completion using the base IV and key data the upload directory recorded at CreateMultipartUpload: - extractMultipartSSES3Info validates upload-entry metadata up front and hard-fails completion if the base IV or key data are malformed; serializing chunk metadata we then could not decrypt is worse than rejecting the upload. - completedMultipartChunk re-derives a per-chunk IV from baseIV + chunk.Offset (matching what putToFiler would have written) and serializes per-chunk SSE-S3 metadata when the chunk has no tag. Existing per-chunk metadata is left alone; we cannot recover an already-derived IV from the upload-entry alone. The IV formula intentionally has no partNumber term: putToFiler hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption for every part, so each chunk's encryption IV is calculateIVWithOffset(baseIV, chunk.Offset_part_local). PartOffsetMultiplier is defined in s3_constants but is not consumed by the encryption path. Adopting (partNumber-1)*PartOffsetMultiplier + chunk.Offset would produce IVs that fail to decrypt the bytes on disk - a stronger failure mode than the bug being fixed. Tests pin this: - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext runs the round trip across the encryption boundary: encrypt parts with CreateSSES3EncryptedReaderWithBaseIV (the call putToFiler uses), drop chunk metadata to reproduce #8908, backfill, decrypt with backfilled IV, assert plaintext intact. - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula constructs the IV the partNumber formula would produce and shows it does not decrypt the actual ciphertext. This commit covers the chunk-level recovery only. The companion fix for the object-level Extended attributes (SeaweedFSSSES3Key / X-Amz-Server-Side-Encryption) follows separately. * fix(s3api): backfill canonical SSE-S3 attributes onto multipart object The previous commit ensures every chunk of an SSE-S3 multipart upload carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct read path can decrypt. The completed object's Extended map can still miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal look at: - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header detectPrimarySSEType reads on inline / small-object reads) - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by IsSSES3EncryptedInternal and by the read-path key lookup) When a part of the upload was written by a path that did not set those (the same #8908 race that produced the NONE chunks), copySSEHeadersFromFirstPart finds nothing to copy and the final entry ends up with only the multipart-init keys (SeaweedFSSSES3Encryption / BaseIV / KeyData). The read path then mis-detects the object as unencrypted. applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair from the multipart-init metadata in all three completion paths (versioned, suspended, non-versioned), only when the keys are missing so a healthy first part still wins. extractMultipartSSES3Info already ran in prepareMultipartCompletionState, so the data is reused without re-decoding. Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill, do-not-clobber, and nil-info no-op cases. * fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV (calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the adjusted IV to CreateSSEKMSDecryptedReader, which itself runs calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The offset was being applied twice for any chunk with a non-zero ChunkOffset, corrupting the keystream for range reads that cross multipart chunk boundaries. Pass the raw SSE-KMS key (with base IV and the original ChunkOffset field) into CreateSSEKMSDecryptedReader so the offset is applied exactly once, and remove the now-dead intra-block skip that was compensating for the double adjustment. Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment that decrypts the same ciphertext with a deliberately double-adjusted IV and asserts the output is corrupted, so any regression that re-introduces the double application fails the unit test. * test(s3): cover multipart SSE across chunk-spanning parts and ranges Adds an integration subtest "Multipart Parts Larger Than Internal Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that exercises the end-to-end S3 path for the bugs fixed in this branch: - Two-part multipart upload with each part larger than the 8MB internal SeaweedFS chunk, so each part itself spans multiple underlying chunks. - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default SSE-S3 - the four paths multipart parts can take through the SSE pipeline. - Each subtest does a full GET (verifying every byte and the response Content-Length / SSE response headers) plus a 129-byte range read straddling the 8MB internal chunk boundary, which is the path that produced the SSE-KMS double-IV corruption (fix in the previous commit) and the SSE-S3 chunk-tag loss (fix in the earlier commits). Factored the request shape behind multipartSSEOptions / uploadAndVerifyMultipartSSEObject so all four SSE flavors share the same upload+verify code; only the SSE-specific input/output configuration differs per subtest. * test(s3): abort orphan multipart uploads on test failure Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The helper used require.NoError after CreateMultipartUpload, UploadPart and CompleteMultipartUpload, so a failure in any of those (or in the later GET / range read on a still-incomplete upload) called t.Fatal without aborting the in-flight MPU, leaving an orphan upload in the bucket. Harmless in CI where the data dir is wiped on shutdown, but a real annoyance when iterating locally and a textbook AWS S3 caveat in production. Register a t.Cleanup that calls AbortMultipartUpload unless a "completed" flag was set right after a successful CompleteMultipartUpload. Use context.Background for the abort call since the parent ctx may already be cancelled at cleanup time, and t.Logf the abort error rather than failing the test so the original failure remains visible in the run output. |
||
|
|
a14cbc176b | debug(kafka): add restart flake diagnostics | ||
|
|
a0be40e070 | Merge branch 'master' of https://github.com/seaweedfs/seaweedfs | ||
|
|
b94ad82472 |
fix(test): stabilize ConcurrentLockContention; warn on coherence drift
TestPosixFileLocking/ConcurrentLockContention failed in CI (run 24857323067) with ENOENT when re-opening the file after all 8 workers had successfully written and closed. The 20s openWithRetry budget was exhausted, pointing at a real but unproven metaCache/parent-cache coherence issue in the mount under bursts of concurrent Release. Test: hold the initial fd open for the whole subtest; use it for the post-workers Sync() and the verification read. Workers still exercise the concurrent-flock invariant and per-record write correctness; the re-open path is no longer load-bearing. On Eventually failure, dump ReadDir of the parent, Stat, and a fresh O_RDONLY open so a future recurrence has state to debug from. Drop the darwin-only ENOENT t.Skip branches that hid this same flake. Mount: in weedfs.lookupEntry, when returning ENOENT from the "parent cached but child missing" branch, log at Warningf instead of V(4) when the kernel is still tracking this path's inode. That combination is the smoking-gun signal for cache drift and is rare enough in normal use not to spam the log. |
||
|
|
cd5004cfbd |
build(deps): bump github.com/Azure/go-ntlmssp from 0.1.0 to 0.1.1 in /test/kafka (#9204)
build(deps): bump github.com/Azure/go-ntlmssp in /test/kafka Bumps [github.com/Azure/go-ntlmssp](https://github.com/Azure/go-ntlmssp) from 0.1.0 to 0.1.1. - [Release notes](https://github.com/Azure/go-ntlmssp/releases) - [Commits](https://github.com/Azure/go-ntlmssp/compare/v0.1.0...v0.1.1) --- updated-dependencies: - dependency-name: github.com/Azure/go-ntlmssp dependency-version: 0.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
f438cc3544 |
fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard (#9184) (#9186)
* test(volume_server): reproduce #9184 ReceiveFile truncating a mounted shard ReceiveFile for an EC shard calls os.Create(filePath) which opens the path with O_TRUNC. When the shard is already mounted, the in-memory EcVolume holds a file descriptor against the same inode, so a second ReceiveFile call for the same (volume, shard) truncates the live shard file beneath the reader. Reproducer: generate and mount shard 0 for a populated volume, capture the on-disk size, then send a smaller payload for the same shard via ReceiveFile. The current handler accepts the overwrite and leaves the shard truncated in place; this test pins that behavior. When the fix lands the server should reject (or rename-then-swap) and this test must be inverted. * fix(volume_server): refuse ReceiveFile overwrite of mounted EC shard ReceiveFile used os.Create on EC shard paths, which opens with O_TRUNC and truncates in place. When an EC shard is already mounted, the in-memory EcVolume holds file descriptors against the same inodes, so the truncation corrupts the live shard beneath any ongoing read. On retries of an EC task this produced the "missing parts" class of errors in #9184. The fix rejects any ReceiveFile for an EC volume that currently has mounted shards. The caller must unmount before retrying — silent truncation is never an acceptable outcome. Non-EC writes and ReceiveFile for volumes that have never been mounted on this server continue to work as before. Tests: - TestReceiveFileRejectsOverwriteOfMountedEcShard: mounts a shard, attempts an overwrite, asserts the error response and that the on-disk file and live reads are undisturbed. - TestReceiveFileAllowsEcShardWhenNoMount: pins the common-case contract that a first write to a target still succeeds. * fix(volume-rust): refuse ReceiveFile overwrite of mounted EC shard Mirror the Go-side change: reject receive_file for any EC volume that currently has mounted shards on this server. std::fs::File::create truncates in place and the in-memory EcVolume holds fds on the same inodes, so an overwrite would corrupt live readers. |
||
|
|
cb882ced46 |
fix(test): retry ENOENT in fcntl lock subprocess helper
TestPosixFileLocking/FcntlReleaseOnClose was flaky because the subprocess spawned by startLockHolder occasionally saw ENOENT when opening a file the parent had just created on the FUSE mount. Retry on ENOENT (matching the existing openWithRetry pattern used in testConcurrentLockContention) so the subprocess waits for the mount's dentry state to propagate before reporting the lock acquire as failed. |
||
|
|
c4e1885053 |
fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement (#9184) (#9185)
* test(volume_server): reproduce #9184 EC ReceiveFile disk-placement bug The plugin-worker EC task sends shards via ReceiveFile, which picks Locations[0] as the target directory regardless of the admin planner's TargetDisk assignment. ReceiveFileInfo has no disk_id field, so there is no wire channel to honor the plan. Adds StartSingleVolumeClusterWithDataDirs to the integration framework so tests can launch a volume server with N data directories. The new repro asserts the current (buggy) behavior: sending three distinct EC shards via ReceiveFile leaves all three files in dir[0] and the other dirs empty. When the fix adds disk_id to ReceiveFileInfo, this assertion must flip to verify the planned placement is respected. * fix(ec): honor disk_id in ReceiveFile so EC shards respect admin placement Before this change, VolumeServer.ReceiveFile for EC shards always selected the first HDD location (Locations[0]). The plugin-worker EC task had no way to pass the admin planner's per-shard disk assignment — ReceiveFileInfo carried no disk_id field — so every received EC shard piled onto a single disk per destination server. On multi-disk servers this caused uneven load (one disk absorbing all EC shard I/O), frequent ENOSPC retries, and a growing EC backlog under sustained ingest (see issue #9184). Changes: - proto: add disk_id to ReceiveFileInfo, mirroring VolumeEcShardsCopyRequest.disk_id. - worker: DistributeEcShards tracks the planner-assigned disk per shard; sendShardFileToDestination forwards that disk id. Metadata files (ecx/ecj/vif) inherit the disk of the first data shard targeting the same node so they land next to the shards. - server: ReceiveFile honors disk_id when > 0 with bounds validation; disk_id=0 (unset) falls back to the same auto-selection pattern as VolumeEcShardsCopy (prefer disk that already has shards for this volume, then any HDD with free space, then any location with free space). Tests updated: - TestReceiveFileEcShardHonorsDiskID asserts three shards sent with disk_id={1,2,0} land on data dirs 1, 2, and 0 respectively. - TestReceiveFileEcShardRejectsInvalidDiskID pins the out-of-range disk_id rejection path. * fix(volume-rust): honor disk_id in ReceiveFile for EC shards Mirror the Go-side change: when disk_id > 0 place the EC shard on the requested disk; when unset, auto-select with the same preference order as volume_ec_shards_copy (disk already holding shards, then any HDD, then any disk). * fix(volume): compare disk_id as uint32 to avoid 32-bit overflow On 32-bit Go builds `int(fileInfo.DiskId) >= len(Locations)` can wrap a high-bit uint32 to a negative int, bypassing the bounds check before the index operation. Compare in the uint32 domain instead. * test(ec): fail invalid-disk_id test on transport error Previously a transport-level error from CloseAndRecv silently passed the test by returning early, masking any real gRPC failure. Fail loudly so only the structured ReceiveFileResponse rejection path counts as a pass. * docs(test): explain why DiskId=0 auto-selects dir 0 in EC placement test Documents the load-bearing assumption that shards are never mounted in this test, so loc.FindEcVolume always returns false and auto-select falls through to the first HDD. Saves future readers from re-deriving the expected directory for the DiskId=0 case. * fix(test): preserve baseDir/volume path for single-dir clusters StartSingleVolumeClusterWithDataDirs started naming the data directory volume0 even in the dataDirCount=1 case, which broke Scrub tests that reach into baseDir/volume via CorruptDatFile / CorruptEcShardFile / CorruptEcxFile. Keep the legacy name for single-dir clusters; only use the indexed "volumeN" layout when multiple disks are requested. |
||
|
|
1220468a33 |
build(deps): bump github.com/rclone/rclone from 1.73.1 to 1.73.5 in /test/kafka (#9189)
build(deps): bump github.com/rclone/rclone in /test/kafka Bumps [github.com/rclone/rclone](https://github.com/rclone/rclone) from 1.73.1 to 1.73.5. - [Release notes](https://github.com/rclone/rclone/releases) - [Changelog](https://github.com/rclone/rclone/blob/master/RELEASE.md) - [Commits](https://github.com/rclone/rclone/compare/v1.73.1...v1.73.5) --- updated-dependencies: - dependency-name: github.com/rclone/rclone dependency-version: 1.73.5 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> |
||
|
|
be9996962d |
fix(test): avoid port collision between master gRPC and volume ports
AllocateMiniPorts(1) reserved masterPort and masterPort+GrpcPortOffset by holding listeners open, but closed them on return. The subsequent AllocatePorts call bound 127.0.0.1:0, so the OS could immediately reuse the just-released mini gRPC port as a volume port — causing the volume server to fail at bind time with "address already in use". Introduce AllocatePortSet(miniCount, regularCount) that holds every listener open until the full set is chosen, and route the five volume test cluster builders through it. |
||
|
|
96e5fea08e |
test(catalog_spark): bound weed shell invocation with 30s timeout
createTableBucket ran `weed shell` via exec.Command with no deadline. When the shell's first command retries on a transient master connection blip, the trailing `exit` on stdin never gets processed and the subprocess blocks until the outer 20m `go test` timeout fires — the surfacing symptom is a flaky 20m panic with no diagnostic output. Wrap the invocation in exec.CommandContext with a 30s timeout, matching the existing pattern in test/s3tables/catalog_risingwave/setup_test.go. |
||
|
|
7f67995c24 |
chore(filer): remove -mount.p2p flag; registry is always on (#9183)
The filer-side mount peer registry (tier 1 of peer chunk sharing) was gated behind -mount.p2p (default true). Idle cost is negligible — a tiny in-memory map plus a 60s sweeper — so the opt-out is not worth the surface area. Removes the flag from weed filer, weed server (-filer.mount.p2p), and weed mini, and always constructs the registry in NewFilerServer. Also drops the now-dead nil guards in MountRegister/MountList/sweeper and the TestMountRegister_DisabledIsNoOp case. |
||
|
|
9ae905e456 |
feat(security): hot-reload HTTPS certs without restart (k8s cert-manager) (#9181)
* feat(security): hot-reload HTTPS certs for master/volume/filer/webdav/admin S3 and filer already use a refreshing pemfile provider for their HTTPS cert, so rotated certificates (e.g. from k8s cert-manager) are picked up without a restart. Master, volume, webdav, and admin, however, passed cert/key paths straight to ServeTLS/ListenAndServeTLS and loaded once at startup — rotating those certs required a pod restart. Add a small helper NewReloadingServerCertificate in weed/security that wraps pemfile.Provider and returns a tls.Config.GetCertificate closure, then wire it into the four remaining HTTPS entry points. httpdown now also calls ServeTLS when TLSConfig carries a GetCertificate/Certificates but CertFile/KeyFile are empty, so volume server can pre-populate TLSConfig. A unit test exercises the rotation path (write cert, rotate on disk, assert the callback returns the new cert) with a short refresh window. * refactor(security): route filer/s3 HTTPS through the shared cert reloader Before: filer.go and s3.go each kept a *certprovider.Provider on the options struct plus a duplicated GetCertificateWithUpdate method. Both were loading pemfile themselves. Behaviorally they already reloaded, but the logic was duplicated two ways and neither path was shared with the newly-added master/volume/webdav/admin wiring. After: both use security.NewReloadingServerCertificate like the other servers. The per-struct certProvider field and GetCertificateWithUpdate method are removed, along with the now-unused certprovider and pemfile imports. Net: -32 lines, one code path for all HTTPS cert reloading. No behavior change — the refresh window, cache, and handshake contract are identical (the helper wraps the same pemfile.NewProvider). * feat(security): hot-reload HTTPS client certs for mount/backup/upload/etc The HTTP client in weed/util/http/client loaded the mTLS client cert once at startup via tls.LoadX509KeyPair. That left every long-lived HTTPS client process (weed mount, backup, filer.copy, filer→volume, s3→filer/volume) unable to pick up a rotated client cert without a restart — even though the same cert-manager setup was already rotating the server side fine. Swap the client cert loader for a tls.Config.GetClientCertificate callback backed by the same refreshing pemfile provider. New TLS handshakes pick up the rotated cert; in-flight pooled connections keep their old cert and drop as normal transport churn happens. To keep this reusable from both server and client TLS code without an import cycle (weed/security already imports weed/util/http/client for LoadHTTPClientFromFile), extract the pemfile wrapper into a new weed/security/certreload subpackage. weed/security keeps its thin NewReloadingServerCertificate wrapper. The existing unit test moves with the implementation. gRPC mTLS was already handled by security.LoadServerTLS / LoadClientTLS; this PR does not change any gRPC paths. MQ broker, MQ agent, Kafka gateway, and FUSE mount control plane are gRPC-only and therefore already rotate. CA bundles (ClientCAs / RootCAs / grpc.ca) are still loaded once — noted as a known limitation in the wiki. * fix(security): address PR review feedback on cert reloader Bots (gemini-code-assist + coderabbit) flagged three real issues and a couple of nits. Addressing them here: 1. KeyMaterial used context.Background(). The grpc pemfile provider's KeyMaterial blocks until material arrives or the context deadline expires; with Background() a slow disk could hang the TLS handshake indefinitely. Switched both the server and client callbacks to use hello.Context() / cri.Context() so a stuck read is bounded by the handshake timeout. 2. Admin server loaded TLS inside the serve goroutine. If the cert was bad, the goroutine returned but startAdminServer kept blocking on <-ctx.Done() with no listener, making the process look healthy with nothing bound. Moved TLS setup to run before the goroutine starts and propagate errors via fmt.Errorf; also captures the provider and defers Close(). 3. HTTP client discarded the certprovider.Provider from NewClientGetCertificate. That leaked the refresh goroutine, and NewHttpClientWithTLS had a worse case where a CA-file failure after provider creation orphaned the provider entirely. Added a certProvider field and a Close() method on HTTPClient, and made the constructors close the provider on subsequent error paths. 4. Server-side paths (master/volume/filer/s3/webdav/admin) now retain the provider. filer and webdav run ServeTLS synchronously, so a plain defer works. master/volume/s3 dispatch goroutines and return while the server keeps running, so they hook Close() into grace.OnInterrupt. 5. Test: certreload_test now tolerates transient read/parse errors during file rotation (writeSelfSigned rewrites cert before key) and reports the last error only if the deadline expires. No user-visible behavior change for the happy path. * test(tls): add end-to-end HTTPS cert rotation integration test Boots a real `weed master` with HTTPS enabled, captures the leaf cert served at TLS handshake time, atomically rewrites the cert/key files on disk (the same rename-in-place pattern kubelet does when it swaps a cert-manager Secret), and asserts that a subsequent TLS handshake observes the rotated leaf — with no process restart, no SIGHUP, no reloader sidecar. Verifies the full path: on-disk change → pemfile refresh tick → provider.KeyMaterial → tls.Config.GetCertificate → server TLS handshake. Runtime is ~1s by exposing the reloader's refresh window as an env var (WEED_TLS_CERT_REFRESH_INTERVAL) and setting it to 500ms for the test. The same env var is user-facing — documented in the wiki — so operators running short-lived certs (Vault, cert-manager with duration: 24h, etc.) can tighten the rotation-pickup window without a rebuild. Defaults to 5h to preserve prior behavior. security.CredRefreshingInterval is kept for API compatibility but now aliases certreload.DefaultRefreshInterval so the same env controls both gRPC mTLS and HTTPS reload. * ci(tls): wire the TLS rotation integration test into GitHub Actions Mirrors the existing vacuum-integration-tests.yml shape: Ubuntu runner, Go 1.25, build weed, run `go test` in test/tls_rotation, upload master logs on failure. 10-minute job timeout; the test itself finishes in about a second because WEED_TLS_CERT_REFRESH_INTERVAL is set to 500ms inside the test. Runs on every push to master and on every PR to master. * fix(tls): address follow-up PR review comments Three new comments on the integration test + volume shutdown path: 1. Test: peekServerCert was swallowing every dial/handshake error, which meant waitForCert's "last err: <nil>" fatal message lost all diagnostic value. Thread errors back through: peekServerCert now returns (*x509.Certificate, error), and waitForCert records the latest error so a CI flake points at the actual cause (master didn't come up, handshake rejected, CA pool mismatch, etc.). 2. Test: set HOME=<tempdir> on the master subprocess. Viper today registers the literal path "$HOME/.seaweedfs" without env expansion, so a developer's ~/.seaweedfs/security.toml is accidentally invisible — the test was relying on that. Pinning HOME is belt-and-braces against a future viper upgrade that does expand env vars. 3. volume.go: startClusterHttpService's provider close was registered via grace.OnInterrupt, which fires on SIGTERM but NOT on the v.shutdownCtx.Done() path used by mini / integration tests. The pemfile refresh goroutine leaked in that shutdown path. Now the helper returns a close func and the caller invokes it on BOTH shutdown paths for parity. Also add MinVersion: TLS 1.2 to the test's tls.Config to quiet the ast-grep static-analysis nit — zero-risk since the pool only trusts our in-memory CA. Test runs clean 3/3. |
||
|
|
e77f8ae204 |
fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157) The STS GetFederationToken handler was implemented but never reachable. Three routing gaps sent requests to the S3/IAM path instead of STS: - No explicit mux route for Action=GetFederationToken in the URL query - iamMatcher did not exclude GetFederationToken, so authenticated POSTs with Action in the form body were matched and dispatched to IAM - UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity to STS, leaving GetFederationToken to fall through to DoActions and return NotImplemented Add the missing route, the matcher exclusion, and the dispatch branch. Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount into the s3-iam-tests workflow as a new "sts" matrix entry. Before this change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four test functions ran in CI, which is why this regression shipped. * test(iam): make orphaned STS/service-account tests pass under auth-enabled CI Follow-up to wiring STS tests into CI: fixes several pre-existing issues that made the newly-included tests fail locally. Server fixes: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when the caller is a legacy S3-config identity (not in the IAM user store). Previously any GetPoliciesForUser error short-circuited to InternalError, which hard-failed every SigV4 caller using keys from -s3.config. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs in the sa:<parent>:<uuid> format required by credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed the persistence-layer regex and caused every CreateServiceAccount call to return 500 once a filer-backed credential store validated the ID. Test helpers: - test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from req.URL.Host, and a manual Host header made the signer emit host;host in SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn subtest to match the existing SeaweedFS behavior (user-context assumption). - test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned IAM writes otherwise fall through to the STS fallback and return InvalidAction. CI matrix: - .github/workflows/s3-iam-tests.yml: skip TestServiceAccountLifecycle/use_service_account_credentials only. The rest of the service-account suite passes; that one subtest depends on a separate credential-reload issue where new ABIA keys briefly register into accessKeyIdent but aren't persisted to the filer, so they vanish on the next reload. Out of scope for the #9157 GetFederationToken fix. * fix(credential): accept AWS IAM username chars in service-account IDs Gemini review on #9167 pointed out that ServiceAccountIdPattern's parent-user segment was more restrictive than an AWS IAM username: `[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with `@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail validation at the filer store even though the embedded IAM API happily created them. Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html) and add a table-driven test that locks the expansion in. * address PR review feedback on #9167 All five review items were valid; changes keyed to review bullets: - weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows arbitrary policy-lookup failures. Only credential.ErrUserNotFound is treated leniently (the legacy-config SigV4 path); any other error now returns InternalError so we don't mint tokens with an incomplete policy set. - weed/credential/grpc/grpc_identity.go: GetUser translates gRPC NotFound back to credential.ErrUserNotFound so errors.Is(...) above matches for gRPC-backed stores, not just memory/filer-direct. - weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates the generated saId against credential.ValidateServiceAccountId before returning. Surfaces a client 400 with the offending ID instead of the opaque 500 that used to bubble up from the persistence layer. - weed/s3api/s3api_server_routing_test.go: seed a routing-test identity with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody with aws-sdk-go v4.Signer so the request actually passes AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers with no stsService) instead of just NotEqual(501) — 503 proves the dispatch reached STSHandlers.HandleSTSRequest. - test/s3/iam/s3_service_account_test.go: callIAMAPI signs with service="iam" instead of "s3" (SeaweedFS verifies against whichever service the client signed with, but "iam" is semantically correct). - weed/credential/validation_test.go: add positive rows for an uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000). |
||
|
|
86c5e815d2 |
fix(kafka): make consumer-group rebalancing work end-to-end (#9143)
* fix(kafka): make consumer-group rebalancing work end-to-end
TestConsumerGroups was failing every run since the job was added
(2026-04-17) but the failures were masked by a `|| echo ...` trailer on
the go test invocation, so the CI reported green. Removing the mask
exposes several real bugs in the gateway's group-coordinator code:
1. JoinGroup deduplicated members by ClientID, which collapsed two
Sarama consumers that share the default ClientID ("sarama") into a
single member slot and broke rebalancing. Key dedup off the TCP
ConnectionID instead; keep ClientID on the member for DescribeGroup
fidelity.
2. Every JoinGroup replaced the *GroupMember struct, wiping the
Assignment the leader had just published in its SyncGroup and leaving
non-leader consumers with 0 partitions after a rebalance. Update the
existing member in place on rejoin.
3. Non-leader SyncGroup returned an empty assignment while the leader
was mid-rebalance, so consumers silently came up with no partitions.
Return REBALANCE_IN_PROGRESS when the group is not Stable so Sarama
retries the join/sync cycle (4 retries x 2s backoff by default).
4. Heartbeat returned ILLEGAL_GENERATION on a gen mismatch even when
the group was in PreparingRebalance/CompletingRebalance. Return
REBALANCE_IN_PROGRESS in that case so the heartbeat loop cleanly
cancels the session instead of tearing it down on a fatal error.
5. LeaveGroup parser only handled v0-v2. Sarama at V2_8_0_0 sends v3
(Members array) by default, so the gateway silently rejected the
request as InvalidGroupID and dead consumers stayed in the group as
phantom leaders. Added v3 (Members array) and v4+ (flexible/compact/
tagged-fields) parsing.
The rebalancing integration tests called Consume() once per consumer,
which cannot survive a rebalance (heartbeat RBIP cancels the session
and Consume() returns - this is documented Sarama behaviour; callers
are expected to loop). Added a runConsumeLoop helper and used it in the
four affected sub-tests. RebalanceTestHandler.Setup now overwrites
stale entries in its assignments channel so the test observes the
settled post-rebalance snapshot rather than whatever arrived first.
* fix(kafka): address PR review feedback
- JoinGroup now snapshots existing members before mutating and restores
the snapshot on INCONSISTENT_GROUP_PROTOCOL rollback. Previously the
rollback path always deleted the entry, corrupting group state when
an existing member rejoined with an incompatible protocol.
- handleLeaveGroup iterates request.Members instead of processing only
the first entry, so v3+ batch departures (KIP-345 style) correctly
remove every listed member and build a per-member response. A single
group-state transition runs after the loop, with leader election
only triggered if the actual group leader was among the departures.
- Added buildLeaveGroupFlexibleResponse for v4+ clients. The parser
already decoded flexible versions, but the response still went out in
non-flexible encoding (4-byte array lengths, 2-byte strings, no
tagged fields), which v4+ clients could not parse. Route flexible
versions through the new builder; v1-v3 keep buildLeaveGroupFullResponse.
- BasicFunctionality gives each consumer its own
ConsumerGroupHandler/ready channel. The previous shared handler
closed ready once, so readyCount advanced to numConsumers from a
single signal; the test could proceed without the other consumers
actually reaching Setup.
- RebalanceTestHandler.assignments is now a size-1 channel, so readers
always observe the most recent rebalance snapshot instead of an
intermediate one from an earlier round.
|
||
|
|
8857dbfb74 |
fix(test): drop host port mapping from risingwave catalog test to kill TOCTOU flake
The random host port allocated by MustFreeMiniPorts was released before docker run bound it, occasionally losing the race to another process and failing with "address already in use". The sidecar already reaches RisingWave via shared netns (--network container:...), so the host -p mapping and the corresponding WaitForPort check were unused. |
||
|
|
a8ba9d106e |
peer chunk sharing 7/8: tryPeerRead read-path hook (#9136)
* mount: batched announcer + pooled peer conns for mount-to-mount RPCs * peer_announcer.go: non-blocking EnqueueAnnounce + ticker flush that groups fids by HRW owner, fans out one ChunkAnnounce per owner in parallel. announcedAt is pruned at 2× TTL so it stays bounded. * peer_dialer.go: PeerConnPool caches one grpc.ClientConn per peer address; the announcer and (next PR) the fetcher share it so steady-state owner RPCs skip the handshake cost entirely. Bounded at 4096 cached entries; shutdown conns are transparently replaced. * WFS starts both alongside the gRPC server; stops them on unmount. * mount: wire tryPeerRead via FetchChunk streaming gRPC Replaces the HTTP GET byte-transfer path with a gRPC server-stream FetchChunk call. Same fall-through semantics: any failure drops through to entryChunkGroup.ReadDataAt, so reads never slow below status quo. * peer_fetcher.go: tryPeerRead resolves the offset to a leaf chunk (flattening manifests), asks the HRW owner for holders via ChunkLookup, then opens FetchChunk on each holder in LRU order (PR #5) until one succeeds. Assembled bytes are verified against FileChunk.ETag end-to-end — the peer is still treated as untrusted. Reuses the shared PeerConnPool from PR #6 for all outbound gRPC. * peer_grpc.go: expose SelfAddr() so the fetcher can avoid dialing itself on a self-owned fid. * filehandle_read.go: tryPeerRead slot between tryRDMARead and entryChunkGroup.ReadDataAt. Gated by option.PeerEnabled and the presence of peerGrpcServer (the single identity test). Read ordering with the feature enabled is now: local cache -> RDMA sidecar -> peer mount (gRPC stream) -> volume server One port, one identity, one connection pool — no more HTTP bytecast. * test(fuse_p2p): end-to-end CI test for peer chunk sharing Adds a FUSE-backed integration test that proves mount B can satisfy a read from mount A's chunk cache instead of the volume tier. Layout (modelled on test/fuse_dlm): test/fuse_p2p/framework_test.go — cluster harness (1 master, 1 volume, 1 filer, N mounts, all with -peer.enable) test/fuse_p2p/peer_chunk_sharing_test.go — writer-reader scenario The test (TestPeerChunkSharing_ReadersPullFromPeerCache): 1. Starts 3 mounts. Three is the sweet spot: with 2 mounts, HRW owner of a chunk is self ~50 % of the time (peer path short-circuits); with 3+ it drops to ≤ 1/3, so a multi-chunk file almost certainly exercises the remote-owner fan-out. 2. Mount 0 writes a ~8 MiB file, then reads it back through its own FUSE to warm its chunk cache. 3. Waits for seed convergence (one full MountList refresh) plus an announcer flush cycle, so chunk-holder entries have reached each HRW owner. 4. Mount 1 reads the same file. 5. Verifies byte-for-byte equality AND greps mount 1's log for "peer read successful" — content matching alone is not proof (the volume fallback would also succeed), so the log marker is what distinguishes p2p from fallback. Workflow .github/workflows/fuse-p2p-integration.yml triggers on any change to mount/filer peer code, the p2p protos, or the test itself. Failure artifacts (server + mount logs) are uploaded for 3 days. Mounts run with -v=4 so the tryPeerRead success/failure glog messages land in the log file the test greps. |
||
|
|
6787a4b4e8 |
fix kafka gateway and consumer group e2e flakes (#9129)
* fix(test): reduce kafka gateway and consumer group flakes * fix(kafka): make broker health-check backoff respect context Replace time.Sleep in the retry loop with a select on bc.ctx.Done() and time.After so the backoff is interruptible during shutdown, per review feedback on PR #9129. * fix(kafka): guard broker HealthCheck against nil client Return the same "broker client not connected" error used by the other exported BrokerClient methods instead of panicking on a partially initialized client, per CodeRabbit review feedback on PR #9129. |
||
|
|
32cbed9658 |
fix(fuse-tests): avoid ephemeral port reuse racing weed mini bind
freePort allocated in [20000, 55535], which overlaps the Linux ephemeral range (32768-60999). The kernel could reuse the chosen port for an outbound connection between the test closing its listener and weed mini re-checking availability, causing: Port allocation failed: port N for Filer (specified by flag filer.port) is not available on 0.0.0.0 and cannot be used Narrow the range to [20000, 32000] to stay below the ephemeral floor, and pass -ip.bind=127.0.0.1 so mini's pre-check runs on the same IP the test actually reserved the port on. |
||
|
|
f720f559cb |
ci(kafka-loadtest): switch off Ubuntu/Debian base images to avoid apt mirror flakes (#9119)
* ci(kafka-loadtest): retry apt-get to survive Ubuntu mirror flakes
The Kafka Quick Test workflow's Docker build of Dockerfile.loadtest
keeps hitting "Connection failed [IP: ...]" on archive.ubuntu.com /
security.ubuntu.com mid-build, e.g.:
failed to solve: process "/bin/sh -c apt-get update && \
apt-get install -y ca-certificates curl jq bash netcat ..."
did not complete successfully: exit code: 100
Same class of failure PR #9106 fixed for pjdfstest. Apply the same
two retry knobs to Dockerfile.loadtest and Dockerfile.seektest so a
transient mirror flake retries five times with a 30s timeout instead
of failing the whole workflow.
(The fuller pjdfstest fix also restructured the build to use
docker/build-push-action with type=gha cache. Doing that for
kafka-client-loadtest would mean rewriting the make/docker-compose
build path; defer until the apt-retry alone proves insufficient.)
* ci(kafka-loadtest): also drop recommends/suggests + apt-get clean
Address PR review (gemini-code-assist): fully align with PR #9106's
pattern by adding --no-install-recommends / --no-install-suggests so
the runtime images stay small and don't pull in extra packages, plus
apt-get clean before rm -rf /var/lib/apt/lists/* in Dockerfile.seektest.
* ci(kafka-loadtest): use alpine / maven base images instead of apt
The previous rounds of apt-retry / apt-clean knobs aren't enough:
the Ubuntu mirror is persistently unreachable from the GitHub runner
for minutes at a time, which blows past the 5-retry / 30-second
Acquire configuration and still kills the build (see run 24551809614).
Switch both runtime images so no apt fetch is needed at all:
- Dockerfile.loadtest now runs on alpine:3.20. All runtime deps
(ca-certificates, curl, jq, bash, netcat-openbsd) are in the
Alpine main repo, fetched from Alpine's CDN rather than the
Ubuntu archive that keeps going dark.
- Dockerfile.seektest now uses maven:3.9-eclipse-temurin-11, which
ships JDK 11 and Maven preinstalled — no apt-get maven step.
This also means the runtime images no longer care about
Acquire::Retries / DEBIAN_FRONTEND / apt-get clean, so those lines
are removed with the apt call they were configuring.
|