146 Commits

Author SHA1 Message Date
Chris Lu
285025eb73 s3api: support group inline policies + Condition enforcement (#9569)
* test(s3api): cover IAM inline policy aws:SourceIp + group inline gap

Unit tests under weed/s3api/ drive PutUserPolicy / PutGroupPolicy → reload
→ VerifyActionPermission with a synthetic 127.0.0.1 request and assert that
the policy's IpAddress condition flips the outcome.

The user-policy cases pass on master (hydrateRuntimePolicies already routes
inline docs through the policy engine, so Condition blocks are honored end-
to-end). The group-policy case fails: PutGroupPolicy still returns
NotImplemented, so a group inline doc never lands in the engine.

Integration counterparts live under test/s3/iam/ and exercise the same
paths against a live SeaweedFS S3+IAM endpoint.

* s3api: support group inline policies + Condition enforcement

PutGroupPolicy/GetGroupPolicy/DeleteGroupPolicy/ListGroupPolicies used to
return NotImplemented in embedded IAM mode, so anything attached to a
group as an inline doc — including aws:SourceIp or any other Condition —
was simply unreachable.

Wire the four endpoints to the credential-store methods that were
already in place (memory, postgres, filer_etc all implement
GroupInlinePolicyStore). On every config reload, hydrateRuntimePolicies
now also walks LoadGroupInlinePolicies, registers each doc in the IAM
policy engine under __inline_group_policy__/<group>/<policy>, and
appends that key to Group.PolicyNames so evaluateIAMPolicies picks it up
through its existing group walk. PutGroupPolicy/DeleteGroupPolicy are
added to the ReloadConfiguration trigger list in DoActions.

Side fix: MemoryStore.LoadConfiguration now surfaces store.groups too.
Without it iam.groups never repopulated on a memory-store reload, so
group policy evaluation silently no-op'd whether the policy was inline
or attached. The existing tests didn't notice because no test reloaded
through cm after creating a group.

The NotImplemented unit test is inverted to drive the new round-trip.

* s3api: drop redundant refreshIAMConfiguration from Put/DeleteGroupPolicy

DoActions already triggers ReloadConfiguration for both actions via the
explicit reload list, so calling refreshIAMConfiguration inline runs the
load twice per request. Per PR review.

* s3api: scope group-policy resource names per test; tighten deny polling

- Integration test resource names get a per-test suffix so retried or
  parallel CI jobs don't trip EntityAlreadyExists / BucketAlreadyExists.
- Deny-path Eventually loops gate on AccessDenied via a typed helper
  rather than any non-nil error; transient setup errors no longer end
  the wait prematurely.
- ListGroupPolicies returns ServiceFailure when the credential manager
  is nil, matching Put/Get/DeleteGroupPolicy.

* test(s3 iam): cover both IPv4 and IPv6 loopback in allow CIDRs

CI runners with happy-eyeballs resolve `localhost` to ::1 first, in
which case a 127.0.0.0/8-only allow would silently never match and the
deny-driven enforcement test would hang for the allow case. Add ::1/128
to every loopback-matching policy so the allow path works regardless of
which loopback family the SDK lands on.
2026-05-19 16:03:45 -07:00
Chris Lu
f72983c1fd fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table" (#9566)
* fix(s3): stop S3 Tables routes from swallowing buckets named "buckets" or "get-table"

The S3 Tables REST endpoints share top-level paths with the regular S3
API (/buckets for ListTableBuckets/CreateTableBucket, /get-table for
GetTable). They are registered first on the same router as the bucket
subrouter, so a path-style request such as GET /buckets?list-type=2 on
a bucket actually named "buckets" matched ListTableBuckets and returned
JSON. AWS SDK V2 (and Hadoop s3a / Spark) then failed XML parsing with
"Unexpected character '{' (code 123) in prolog".

Disambiguate by requiring the AWS V4 credential scope to name the
s3tables service on the colliding routes. Regular S3 SDKs sign with
service=s3, S3 Tables SDKs sign with service=s3tables, and the scope is
present in both the Authorization header and the X-Amz-Credential query
parameter for presigned URLs, so the matcher works for both flavors.

ARN-bearing S3 Tables routes (/buckets/<arn>, /namespaces/<arn>, etc.)
already cannot collide because colons are not valid in bucket names, so
they are left untouched.

* fix(s3): accept AWS JSON RPC content type as S3 Tables intent signal

The Iceberg catalog integration tests send unsigned PUT /buckets with
Content-Type: application/x-amz-json-1.1 to create table buckets. With
only the credential-scope check, those requests fell through to the
regular S3 CreateBucket handler and the suite went red on this branch.

Extend the matcher so a request is recognized as S3 Tables when either:

  - its AWS V4 credential scope names SERVICE=s3tables; or
  - it carries the canonical AWS JSON RPC 1.1 content type and is
    unsigned (a request explicitly signed for SERVICE=s3 still wins).

The regular S3 SDKs do not send application/x-amz-json-1.1, so the
signal is safe for the colliding paths (/buckets, /get-table).

Also add an AWS SDK V2 for Go integration test under
test/s3/sdk_v2_routing/ that drives the SDK's own XML deserializer
against a bucket literally named "buckets" and "get-table" — the SDK
errors before the test asserts if the server returns the wrong body
shape. Wired up via .github/workflows/s3-sdk-v2-routing-tests.yml,
mirroring the etag/acl workflow.

* s3api: extend service matcher to all S3 Tables routes; simplify scope check

- Apply serviceMatcher to every S3 Tables route, not just the bare-path
  ones. ARN-bearing paths could otherwise be hit by an S3 object key
  that starts with arn:aws:s3tables:..., inside a bucket named
  "buckets", "namespaces", "tables", or "tag". One matcher everywhere
  closes both collision classes.
- Replace strings.Split + index lookup with strings.Contains for the
  credential-scope check. The scope shape is fixed at
  AK/DATE/REGION/SERVICE/aws4_request, slashes only delimit components,
  and access keys are alphanumeric — so /s3tables/ matches iff SERVICE
  is exactly s3tables. Existing unit cases (including the
  access-key-substring case) still pass.
- Read the GetObject body in the SDK v2 routing test with io.ReadAll;
  the single Read could return short and make the equality check flaky.

* s3api: drop content-type fallback; sign s3 tables harness traffic instead

The content-type fallback in isS3TablesSignedRequest let an anonymous
regular-S3 request whose body type is application/x-amz-json-1.1 hit
an S3 Tables route when the path-style object key happened to be
shaped like an S3 Tables ARN (e.g. PutObject on bucket "buckets"
with key arn:aws:s3tables:.../bucket/foo/policy). Narrow the matcher
back to the AWS V4 credential scope so only requests signed for
SERVICE=s3tables match the S3 Tables routes.

Update the Iceberg catalog test harness — the only caller still
sending unsigned PUT /buckets — to sign with SERVICE=s3tables. The
mini instance runs in default-allow mode, so the signature itself is
not verified; only the credential scope matters for the route match.

Drop the stale unit cases for the JSON-RPC content-type signal and
the routing test that exercised unsigned harness traffic.
2026-05-19 14:24:25 -07:00
Chris Lu
b4289abb0a admin: convert filer address to gRPC form before dispatch (#9523)
The master returns each registered filer in pb.ServerAddress dual-port
form (host:httpPort.grpcPort, e.g. 10.0.0.1:8888.18888). The admin's
plugin context builder forwarded that string verbatim as
filer_grpc_address, so workers calling grpc.DialContext on it failed
every job in ~3ms with "dial tcp: lookup tcp/8888.18888: unknown port".

Run each entry through pb.ServerAddress.ToGrpcAddress before populating
ClusterContext.FilerGrpcAddresses.

The lifecycle integration test now pins filer.port.grpc to a value that
breaks the FILER_PORT+10000 assumption, and a new dispatch test drives
the admin's /api/plugin/job-types/s3_lifecycle/run path end-to-end and
asserts the dispatched job both reaches the filer and deletes the
backdated object.
2026-05-17 11:33:54 -07:00
Chris Lu
b1d59b04a8 fix(s3/lifecycle): walker dispatch uses entry.Path for ABORT_MPU (#9477)
* fix(s3/lifecycle): WalkerDispatcher uses entry.Path for ABORT_MPU + shell announces load

Two CI-surfaced bugs caught by PR #9471's S3 Lifecycle Tests run on
master after PRs #9475 + #9466:

1. Walker dispatch for ABORT_MPU was sending entry.DestKey as
   req.ObjectPath. The server's ABORT_MPU handler
   (weed/s3api/s3api_internal_lifecycle.go) strips the .uploads/
   prefix to extract the upload id and reads the init record from
   that directory, so it expects the .uploads/<id> path verbatim.
   DestKey looks like a regular object path; the server's prefix
   check fails and the dispatch returns BLOCKED with
   "FATAL_EVENT_ERROR: ABORT_MPU object_path missing .uploads/
   prefix". The test fix renames TestWalkerDispatcher_MPUInitUsesDestKey
   to ...UsesUploadsPath and inverts the assertion to match the
   actual server contract.

   DestKey is still used for the WalkBuckets shard predicate and
   for rule-prefix matching in bootstrap.walker; both surfaces want
   the user's intended path, while DISPATCH wants the .uploads/<id>
   directory. The bootstrap test
   (TestLifecycleAbortIncompleteMultipartUpload) caught this when
   the walker's BLOCKED error surfaced as FATAL output.

2. test/s3/lifecycle/s3_lifecycle_empty_bucket_test.go asserts the
   shell command logs "loaded lifecycle for N bucket(s)" so a
   regression that produces half-shaped output (no load summary)
   is caught. The restored shell command (PR #9475) didn't print
   that line; add it back on the first pass that finds non-zero
   inputs.

* fix(s3/lifecycle): walker fires for walker-only buckets (empty replay path)

runShard's empty-replay sentinel (rsh == [32]byte{}) was returning
BEFORE the steady-state walker check. A bucket whose only lifecycle
rule was walker-only (ExpirationDate / ExpiredDeleteMarker /
NewerNoncurrent) would never have it dispatched because:

  - ReplayContentHash only hashes replay-eligible kinds, so
    walker-only-only snapshots produce rsh == empty.
  - The early-return persisted the empty cursor and exited before
    the steady-state walker block at the bottom of the function.

Move the walker invocation INTO the empty-replay branch so walker-
only rules dispatch on the same path as mixed-rule buckets.

TestLifecycleExpirationDateInThePast and
TestLifecycleExpiredDeleteMarkerCleanup were both timing out their
"object must be deleted" Eventually polls because of this. Caught
on PR #9471's S3 Lifecycle Tests run after PR #9475 restored the
shell entry point that exercises the integration tests.

* fix(s3/lifecycle): cold-start walker covers pre-existing objects

runShard only walked the bucket tree on the recovery branch (found
&& hash mismatch). For a fresh worker with no persisted cursor,
found=false, so the recovery walker never fired and the meta-log
replay only scanned runNow - maxTTL of events. Objects PUT before
that window — including pre-existing objects in a newly-rule-enabled
bucket — never matched the rule.

The streaming worker handled this with scheduler.BucketBootstrapper.
Daily-replay needed the equivalent: walk the live tree once on the
first run for each shard so pre-existing objects get evaluated even
when their PUT events are outside meta-log scan window.

Restructured the recovery branch to fire the walker on either
(found && mismatch) OR !found. On cold-start the cursor isn't
rewound — we keep TsNs=0 and let the drain below floor to
runNow - maxTTL like before; the walker just handles whatever the
sliding window can't reach.

TestLifecycleBootstrapWalkOnExistingObjects was the exact CI failure
this addresses (https://github.com/seaweedfs/seaweedfs/actions/runs/25777823522/job/75714014151).

* fix(s3/lifecycle): restore walker tag and null-version state

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): parallelize shell shard sweeps

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(s3/lifecycle): bound each runPass ctx + refresh in runLifecycleShard

Two CI bugs surfaced after PR #9466 deleted the streaming worker:

1. The shell command's -refresh loop never fires. runPass used the
   outer ctx (full -runtime), so dailyrun.Run blocked for the entire
   1800s s3tests window — the background worker only ran one pass
   and never re-loaded configs that tests created mid-run.
   test_lifecycle_expiration sees 6 objects when expecting 4 because
   expire1/* never reaches the worker's snapshot. Cap each pass to
   cadence+5s when cadence>0; one-shot (cadence=0) keeps the full ctx.

2. TestLifecycleExpiredDeleteMarkerCleanup's docstring says
   "pass 1 cleans v1; pass 2 removes the now-orphaned marker," but
   runLifecycleShard invoked with no -refresh — only one pass ran.
   The marker rule can't fire in the same pass that dispatches v1's
   delete because v1 is still in .versions/. Add -refresh 1s so the
   10s runtime gets multiple passes.

* fix(s3/lifecycle): persist cursor with fresh ctx after passCtx timeout

drainShardEvents only exits via ctx cancellation for an idle subscription
— that's the steady-state when all replayed events are already past.
Saving the cursor with the canceled passCtx silently drops every
advance, so the next pass re-subscribes from the same floor and
re-replays the same events. Symptom in s3tests: status=error shards=16
errors=16 on every pass, and 1/6 expire3/* dispatches lost to a race
between concurrent shard drains all retrying the same events.

Use a 5s timeout derived from context.Background for the save, and
treat passCtx Deadline/Canceled from drain as a clean end-of-pass —
not a shard-level error to log.

* fix(s3/lifecycle): trust persisted cursor; never bump past pending events

The drain freezes cursorAdvanceTo at the last pre-skip event so pending
matches (DueTime > runNow) re-enter the subscription next pass. Combined
with the new cursor persistence, the floor bump (runNow - maxTTL) then
orphans the very events the drain stopped at.

Concrete: a rule with TTL == maxTTL fires at runNow == PUT_TIME +
maxTTL, so floor (= runNow - maxTTL) lands exactly on PUT_TIME. If the
last advance saved a cursor right before the not-yet-due PUT (e.g.,
keep2/* between expire1/* and expire3/* on the same shard), the floor
bump on pass 9 skips past the expire3 event itself — the worker never
re-reads it. Test symptom: expire3/* never expires when worker shards
include other earlier no-match events.

Cold start (found=false) still subscribes from runNow - maxTTL. Steady
state honors the cursor verbatim.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-13 00:19:05 -07:00
Chris Lu
69da20bdae volume: gate FetchAndWriteNeedle behind admin auth and refuse internal endpoints (#9441)
volume: require admin auth and refuse loopback endpoints in FetchAndWriteNeedle

Gate the RPC behind checkGrpcAdminAuth for parity with the rest of the
destructive volume-server RPCs, and reject cluster-internal remote S3
endpoints (loopback / link-local / IMDS / RFC 1918 / CGNAT) before
dialing. Pin the validated address against DNS rebinding by routing the
AWS SDK through an HTTP transport whose DialContext re-resolves the host
and re-applies the deny list on every dial, so an endpoint that resolves
to a public IP at validate-time and then flips to 127.0.0.1 at connect
time is refused. Operators that legitimately fetch from private hosts
can opt out with -volume.allowUntrustedRemoteEndpoints.
2026-05-12 10:11:20 -07:00
Chris Lu
46bb70d93e feat(s3): stamp noncurrent_since on versioned demotions (#9431)
* feat(s3): stamp noncurrent_since on versioned demotions

A version's noncurrent TTL clock starts when the next version is
written, not at its own mtime. Today the lifecycle engine derives
that moment from the next-newer sibling's mtime — a heuristic that
drifts if the sibling is later modified and is unavailable when
the demoting event sits outside meta-log retention.

Stamp Seaweed-X-Amz-Noncurrent-Since-Ns on the demoted entry at
the two places where a PUT flips the latest pointer:
updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning. Timestamp source is
time.Now().UnixNano() captured once per demotion — the documented
Phase 1 fallback until the filer write API surfaces its own TsNs.

Engine reads the stamp on both the bootstrap walker path and the
event-driven router; missing/zero falls back to the legacy
sibling-mtime derivation, so pre-stamp entries keep working.

Prerequisite for the daily-replay lifecycle worker (Phase 2+).

* fix(s3): address CI failure and PR review feedback

- Backdating tests must move both clocks: the lifecycle integration
  tests backdate version mtimes to simulate aging, but my earlier
  commit made the engine prefer the explicit demotion stamp over
  sibling mtime, so a real-now stamp dominated a backdated mtime and
  the rule never fired. Update backdateVersionedMtime to also rewrite
  Seaweed-X-Amz-Noncurrent-Since-Ns when the entry already carries it.
  This is a test simplification — production stamps record when the
  successor was written, not the demoted version's own mtime — but the
  resulting clock is correctly old enough.

- Refactor stamp parsing into one shared helper. Per gemini-code-assist:
  the parsing logic for ExtNoncurrentSinceNsKey was duplicated in
  router/router.go and scheduler/bootstrap.go. Move it to a new
  weed/s3api/s3lifecycle/noncurrent_since.go as exported
  SuccessorFromEntryStamp; both call sites now go through it.

- Make the parser ordering test deterministic. Per coderabbitai:
  time.Now().UnixNano() drops the monotonic clock component, so
  two back-to-back calls can decrease if the wall clock steps
  backward — the prior test was exercising OS clock behavior rather
  than the parser. Replace with fixed nanosecond values.

- Close a suspended-versioning race. Per coderabbitai: the prior
  putSuspendedVersioningObject called updateIsLatestFlagsForSuspendedVersioning
  after putToFiler returned, i.e. after the object write lock released.
  A concurrent PUT could promote a newer latest version, which we'd
  then wipe — leaving the older "null" object incorrectly current.
  Move the cleanup into the afterCreate callback so the null write and
  the .versions pointer clear (including the new demotion stamp) run
  atomically under the same lock. Best-effort logging is preserved.

* fix(s3/lifecycle): clear noncurrent_since stamp on test backdate

Backdating a version's mtime in tests is not a coherent claim about
when it became noncurrent — production stamps record the successor's
PUT time, which the test doesn't manipulate. The prior commit rewrote
the stamp to the backdated instant, but for TestLifecycleNewerNoncurrent
that creates an inconsistent state: v3's stamp says "demoted 30 days
ago" while v4's mtime (the supposed demoter) is real-now. With both
NewerNoncurrentVersions and NoncurrentDays in the same rule, the
NoncurrentDays floor passes against the backdated stamp and the
rank-based check then deletes v3 via the meta-log historical replay
that misranks against current state.

Clearing the stamp instead lets the lifecycle engine fall back to the
sibling-mtime derivation the tests were originally written against:
the legacy code path is preserved end-to-end while the new explicit-
stamp path is exercised by the unit tests in s3lifecycle/noncurrent_since_test.go
and the bootstrap-walker integration in scheduler/bootstrap_test.go.

The deeper interaction — historical meta-log replay ranking against
current state inside routePointerTransitionExpand — is pre-existing
and is no longer masked by the freshly-PUT successor's mtime once the
stamp is read. Tracked separately; not blocking this PR.

* fix(s3): stamp noncurrent_since before the .versions/ pointer flip

The pointer-flip on the .versions/ directory emits a meta-log event that
the lifecycle router consumes via routePointerTransition. The router
then calls LookupVersion on the demoted version's id. With the prior
ordering — pointer flip first, stamp second — the router could read
the demoted entry before markVersionNoncurrent landed and fall back to
the legacy sibling-mtime derivation.

Versioned COPY is the clean break: the new latest version keeps the
source object's mtime instead of recording the moment v_old was
demoted, so the fallback's successor clock can be arbitrarily wrong.
Reorder both updateLatestVersionInDirectory and
updateIsLatestFlagsForSuspendedVersioning so the stamp is written
first; the pointer flip then emits an event into a state where the
stamp is already present.

Failure of the stamp write remains non-fatal — lifecycle still falls
back to the legacy derivation in that case, with the same caveats as
before the PR but no race window.
2026-05-11 13:41:33 -07:00
Chris Lu
c7b01c72b2 test(s3/lifecycle): integration coverage for versioning + filters (#9415)
* test(s3/lifecycle): integration coverage for versioning + filters

First integration-test bundle building on the existing single-test
backdating harness. Each scenario follows the same shape: create
bucket, set lifecycle, PUT object, backdate mtime via filer
UpdateEntry, run the shell command for one shard sweep, assert
S3-side state.

Five new tests:

- TestLifecycleVersionedBucketCreatesDeleteMarker: Expiration on a
  versioned bucket must produce a delete marker (latest after worker
  runs is a marker) AND keep the original version directly addressable
  by versionId. ListObjectVersions confirms IsLatest=true on the
  marker.

- TestLifecycleNoncurrentVersionExpiration: NoncurrentVersionExpiration
  fires only on demoted versions. PUT v1, PUT v2 (so v1 → noncurrent),
  backdate v1, run worker. v1 must be gone, v2 still current.

- TestLifecycleExpiredDeleteMarkerCleanup: combined rule (noncurrent +
  expired-delete-marker) cleans up a sole-survivor marker. PUT v1,
  DELETE (creates marker), backdate both, run worker. Every version
  AND marker must be gone for the key.

- TestLifecycleDisabledRuleSkipsObject: rule with Status=Disabled
  must not produce dispatches even on a backdated match. Negative
  test for the engine's enabled-status gate.

- TestLifecycleTagFilter: rule with And{Prefix, Tag} only matches
  objects carrying the tag. Two backdated objects (one tagged, one
  not) — only the tagged one is removed.

Helpers extracted to keep each test focused: putVersioningEnabled,
putNoncurrentExpirationLifecycle, putExpiredDeleteMarkerLifecycle,
backdateVersionedMtime (ages a specific .versions/v_<id> entry),
runLifecycleShard (one-shot shell invocation with FATAL guard).

* test(s3/lifecycle): tighten noncurrent expiration diagnostics

Local run showed TestLifecycleNoncurrentVersionExpiration failing
with a bare 404 on HEAD(latest), not enough to tell whether v2 was
deleted, the bare-key pointer was removed, or a delete marker was
synthesized. Strengthen the test to:

- HEAD by versionId=v2 first, so we pin "v2 file still on disk"
  separately from "the latest pointer resolves to v2"
- on HEAD(latest) failure, log ListObjectVersions output (versions +
  markers, with IsLatest) so the next failure shows which side the
  bug is on rather than just NotFound

* test(s3/lifecycle): integration coverage for AbortIncompleteMultipartUpload

Exercises the lifecycleAbortMPU handler path that the prefix-based
expiration tests can't reach — routing keys off of .uploads/<id>/
directory events, not regular object events, and the dispatcher uses
a different RPC path (rm on the .uploads/<id>/ folder).

Setup: AbortIncompleteMultipartUpload rule with DaysAfterInitiation=1,
CreateMultipartUpload, UploadPart (so the directory carries the
right shape), backdate the .uploads/<uploadID>/ directory entry 30
days, run the worker. The upload must drop out of
ListMultipartUploads.

Helpers added: putAbortMPULifecycle, backdateUploadDir.

* test(s3/lifecycle): integration coverage for NewerNoncurrentVersions

NewerNoncurrentVersions=N keeps the N most recent noncurrent versions
and expires the rest. Distinct from per-version NoncurrentDays —
depends on per-version rank, not just per-version age — and routes
through routePointerTransition's "needs full expansion" path.

Setup: PUT v1, v2, v3, v4 on a versioned bucket (v4 current; v1-v3
noncurrent), backdate v1+v2+v3 so all satisfy the NoncurrentDays>=1
floor, run the worker. Expect v1+v2 expired (older noncurrent),
v3 (newest noncurrent within keep=1) and v4 (current) preserved.

Helper added: putNewerNoncurrentLifecycle.

* test(s3/lifecycle): integration coverage for suspended-versioning Expiration

Suspended versioning takes a distinct code path in lifecycleDispatch:
the VersioningSuspended branch first deletes the null version (via
deleteSpecificObjectVersion(versionId="null")) and then writes a
fresh delete marker on top. Other branches (Enabled → only writes a
marker; Off → straight rm) miss this two-step.

Setup: enable versioning, PUT v1 (real versionId), suspend
versioning, PUT again (creates the null version, demotes v1 to
noncurrent), set the Expiration rule, backdate the null at the
bare path. Expect: latest is now a fresh delete marker, the
"null" version is gone from ListObjectVersions, and v1 (noncurrent
under Enabled) still addressable directly — suspended Expiration
must only touch the null, not other versions.

Helper added: putVersioningSuspended.

* test(s3/lifecycle): integration coverage for multi-bucket sweep

A single shell-driven shard sweep must process every bucket carrying
lifecycle config, not just the first one alphabetically. Pinned
because the scheduler iterates the buckets directory and a regression
that returns early after the first match would silently disable
lifecycle for every later bucket.

Two buckets, each with their own prefix-expiration rule and a
backdated object. Both must be expired after the same sweep.

* test(s3/lifecycle): integration coverage for ObjectSizeGreaterThan filter

ObjectSizeGreaterThan is a strict > gate (filterAllows uses
ev.Size <= rule.FilterSizeGreaterThan to reject). Pinned at the
boundary: an object whose size equals the threshold must remain;
only an object strictly larger expires. Catches a > vs >= flip.

Two backdated objects on the same prefix, sizes 100 and 150 with
threshold=100 — boundary survives, larger expires.

* test(s3/lifecycle): scrub bucket lifecycle config + versions on cleanup

Tests share one weed mini server. Two pollution modes were producing
order-dependent failures:

- A later test's shard sweep would still load the prior test's
  lifecycle config (the worker reads every bucket's XML from filer
  state, and DeleteBucket alone doesn't drop lifecycle config
  cleanly on this codebase).
- Versioned-bucket tests left versions + delete markers behind that
  ListObjectsV2 can't see, so the existing best-effort empty-then-
  delete didn't actually empty those buckets.
- The AbortMPU test intentionally leaves an in-flight upload; without
  an explicit AbortMultipartUpload the bucket DELETE hits NotEmpty.

Cleanup now runs DeleteBucketLifecycle, ListObjectVersions →
DeleteObject(versionId), ListObjectsV2 → DeleteObject (catches what
ListObjectVersions missed), ListMultipartUploads → AbortMultipartUpload,
then DeleteBucket. Best-effort throughout so a half-torn-down bucket
doesn't fail the cleanup chain.

* test(s3/lifecycle): backdate both versions for NoncurrentDays clock

Per codex review: NoncurrentDays is clocked from the SUCCESSOR
version's mtime (when the displaced version became noncurrent), not
from the displaced version's own mtime. Backdating only v1 left the
clock (v2's mtime) at "now" and the rule never fired — the test was
wrong, not the production path.

Backdate v1=31d and v2=30d so v1 sits past the 1-day threshold
relative to v2, the noncurrent rule fires, and v2 stays current.

* test(s3/lifecycle): assert specific NotFound on multi-bucket deletion

Per codex review: TestLifecycleMultipleBucketsInOneSweep treated any
HeadObject error as "deleted", which lets a transport failure or
dead endpoint mask a real bug. Recognize NoSuchKey/NotFound/HTTP-404
specifically via a small isS3NotFound helper so the assertion
actually proves deletion happened, not just that the call broke.

* test(s3/lifecycle): gofmt size-filter test

* test(s3/lifecycle): integration coverage for Object Lock skip

Object Lock retention must override the lifecycle rule. The handler's
enforceObjectLockProtections check (s3api_internal_lifecycle.go:47)
returns an error when retention is active; the dispatcher then
classifies the outcome as SKIPPED_OBJECT_LOCK and the object stays.
No existing integration test reaches that outcome.

Setup: bucket created with ObjectLockEnabledForBucket=true, expiration
rule on prefix "lock/", two backdated objects under the same prefix —
one with GOVERNANCE retention until 1h from now, one without. After
the worker runs, the unlocked object expires (positive control); the
locked one survives.

Custom cleanup uses BypassGovernanceRetention so the test can drop
the locked version when the test finishes — otherwise the retention
window keeps the bucket from being deleted.

* test(s3/lifecycle): integration coverage for config update between sweeps

An operator changes the lifecycle rule between two shell-driven
sweeps. The second sweep must respect the NEW rule, not a cached
copy of the old one. Each runLifecycleShard invocation spawns a
fresh weed shell subprocess, so cached engine state from a previous
sweep doesn't persist — but a regression that caches rules across
PutBucketLifecycleConfiguration calls within the S3 server itself
would still surface here.

Sweep 1: rule prefix="first/", PUT + backdate firstKey, run worker
→ firstKey expires.

Update rule to prefix="second/", PUT + backdate secondKey AND a
new key under the OLD prefix ("first/post-update.txt"). Sweep 2
must expire only the second-prefix object; the post-update old-
prefix one must survive — config replacement, not merge.

* test(s3/lifecycle): integration coverage for ExpirationDate (past)

Rules with Expiration{Date: <past>} route through ScanAtDate in the
engine (decideMode's ActionKindExpirationDate case) — a separate
compile + dispatch branch from the EventDriven delay-group path the
Days-based tests exercise.

Past date + in-prefix object → must expire. Out-of-prefix object →
must remain. Object also backdated as defense-in-depth so the
assertion doesn't depend on whether the dispatcher consults
MinTriggerAge for date kinds.

* test(s3/lifecycle): integration coverage for bootstrap walk on existing objects

Production scenario: operator enables lifecycle on a bucket that
already holds objects from before the policy. The worker must
discover them via the bootstrap walk (BucketBootstrapper) — there
were no meta-log events to observe because the objects predate the
rule. Without the bootstrap path, only NEW writes would ever match.

Setup: PUT 5 objects (no lifecycle config yet) + 1 out-of-prefix
survivor, backdate all, THEN set the Expiration rule, run the
worker. Every in-prefix pre-existing object must be expired; the
out-of-prefix one must remain.

* test(s3/lifecycle): integration coverage for DeleteBucketLifecycle stops dispatching

Operator UX: after DeleteBucketLifecycle, the worker must observe the
removal on the next sweep and stop expiring objects under the now-gone
rule. A regression that caches old configs across
PutBucketLifecycleConfiguration → DeleteBucketLifecycle would keep
silently dropping objects.

Setup: positive control (rule active, backdated obj expires) →
DeleteBucketLifecycle → PUT + backdate a fresh object → second
sweep. The fresh object must remain.

* test(s3/lifecycle): integration coverage for empty bucket sweep no-op

A bucket carrying lifecycle config but no objects must produce a
successful sweep — no hangs, no errors, no dispatches. Pinned
because the bootstrap walker iterates bucket directories, and an
empty directory is a corner of that traversal that's easy to break
(slice-bounds bug on the first listing returning zero entries).

Asserts: worker logs "loaded lifecycle for" and "shards 0-15
complete", no FATAL output, bucket still exists after the sweep.

* test(s3/lifecycle): fix Object Lock backdate path + skip unwired ScanAtDate

ObjectLock: enabling Object Lock on a bucket implicitly enables
versioning, so PUT objects land at .versions/v_<id>, not at the bare
key. The test was calling backdateMtime (bare path) and failing in
the helper with "filer: no entry is found". Switch to
backdateVersionedMtime with the versionId returned by PutObject.

ExpirationDate: ScanAtDate dispatch path isn't wired to the run-shard
shell command yet — the bootstrap walker explicitly skips actions in
ModeScanAtDate (walker.go:141 says "SCAN_AT_DATE runs its own date-
triggered bootstrap" but no such bootstrap exists in the scheduler or
shell). Skip with a t.Skip + explanation so the test activates the
moment the date-triggered path lands.

* fix(s3/lifecycle): wire ExpirationDate dispatch through bootstrap walker

The walker explicitly skipped ModeScanAtDate actions on the comment
"SCAN_AT_DATE runs its own date-triggered bootstrap" — but no such
bootstrap exists in the scheduler or shell layer. The result: rules
with Expiration{Date: ...} compiled correctly, populated the
snapshot's dateActions map, and were never dispatched.
ExpirationDate is silently a no-op in production.

EvaluateAction already handles ActionKindExpirationDate correctly
(rejects when now.Before(rule.ExpirationDate), otherwise emits
ActionDeleteObject). The walker just needed to fall through instead
of skipping. Pre-date walks become no-ops via EvaluateAction's date
check; post-date walks expire eligible objects.

Un-skip TestLifecycleExpirationDateInThePast — it now exercises the
fixed path end-to-end.

* test(s3/lifecycle): integration coverage for multiple rules per bucket

A single bucket carries two independent Expiration rules with disjoint
prefix filters and different Days thresholds. Each rule must fire
only on its prefix; objects outside both prefixes must survive.

Pinned because Compile builds one CompiledAction per rule per kind
all sharing the same bucket index — a bug that lets one rule's
prefix or threshold leak into another (e.g. last-write-wins on a
shared map) would silently expire wrong objects.

Setup: rule A with prefix=logs/ Days=1, rule B with prefix=tmp/
Days=7. Three backdated objects: logs/access.log, tmp/scratch.bin,
data/keep.bin. After the worker runs, logs/ + tmp/ are gone;
data/ — outside both rule prefixes — survives.

* fix(s3/lifecycle): mark ScanAtDate actions active in Compile

Two layers were silently filtering ScanAtDate actions out of routing:
the walker's mode skip (fixed in e785f59d6) and Compile only marking
ModeEventDriven actions active. MatchPath / MatchOriginalWrite both
require IsActive() to emit a key, so a ScanAtDate action that's never
marked active never reaches a dispatch path even after the walker
falls through.

ScanAtDate's only dispatch path is the bootstrap walk's MatchPath
call — there's no bootstrap-completion rendezvous to wait on. Make
the active flag include ModeScanAtDate alongside the
EventDriven+BootstrapComplete combination.

ExpirationDate-based rules now actually fire end-to-end. The
TestLifecycleExpirationDateInThePast integration test exercises this.

* fix(s3/lifecycle): route date kinds via ComputeDueAt

ExpirationDate has MinTriggerAge=0, so router computed
dueTime = info.ModTime + 0 = info.ModTime. For a backdated entry
that mtime is BEFORE rule.ExpirationDate, so EvaluateAction's
now.Before(rule.ExpirationDate) check returned ActionNone and the
date rule never fired through the event-driven path.

ComputeDueAt already knows the per-kind shape — rule.ExpirationDate
for date kinds, ModTime+Days for the rest — so use it as the
single source of truth for dueTime in Route's main loop.

* test(s3/lifecycle): pin bootstrap walker date dispatch

The original TestWalk_DateActionsSkipped pinned the pre-e785f59d6
behavior that the regular walker skipped ExpirationDate. That
walker was rewired to fire date rules whose date has passed (the
SCAN_AT_DATE bootstrap was never wired); update the test to match.

Split into two: post-date entries dispatch, pre-date entries don't.

* test(s3/lifecycle): drop unused putExpiredDeleteMarkerLifecycle

The helper was never called — TestLifecycleExpiredDeleteMarkerCleanup
constructs a combined noncurrent + expired-marker rule inline, which
the helper doesn't cover. The blank-assignment workaround was just
hiding dead code; remove both.

* test(s3/lifecycle): tighten HeadObject termination check to typed not-found

Generic err != nil also passes on transport/auth/timeouts, letting
the test go green without proving the lifecycle action actually
fired. Switch the three Eventuallyf HeadObject predicates to
isS3NotFound, matching the pattern already in the multi-bucket and
expiration-date tests.

* test(s3/lifecycle): guard ListObjectVersions diagnostic against nil

When ListObjectVersions errors, listOut is nil and the diagnostic
log path panics on listOut.Versions before the real assertion fires.
Branch on (listErr != nil || listOut == nil) so the failure log is
robust whatever ListObjectVersions returned.
2026-05-10 09:30:50 -07:00
Chris Lu
85abf3ca88 feat(shell): s3.lifecycle.run-shard + integration test (#9361)
* feat(shell): s3.lifecycle.run-shard for manual Phase 3 dispatch

Subscribes to the filer meta-log filtered to one (bucket, key-prefix-hash)
shard, routes events through the compiled lifecycle engine, and dispatches
due actions to the S3 server's LifecycleDelete RPC. Persists the per-shard
cursor to /etc/s3/lifecycle/cursors/shard-NN.json so subsequent runs resume.

Operator-runnable harness for end-to-end Phase 3 validation while the
plugin-worker auto-scheduler is still pending. EventBudget bounds a single
invocation; flags expose dispatch + checkpoint cadence.

Discovers buckets by walking the configured DirBuckets path and reading
each bucket entry's Extended[s3-bucket-lifecycle-configuration-xml]
through lifecycle_xml.ParseCanonical. All compiled actions are seeded
BootstrapComplete=true so the run dispatches whatever fires immediately;
production bootstrap walks set this incrementally per bucket.

* test(s3/lifecycle): integration test driving the run-shard shell command

Spins up 'weed mini', creates a bucket with a 1-day expiration on a prefix,
PUTs the target object, then rewrites the entry's Mtime via filer
UpdateEntry to 30 days ago. Runs 's3.lifecycle.run-shard' for every
shard via 'weed shell' subprocess and asserts the backdated object is
deleted within 30s, and the in-prefix-but-recent object remains.

The S3 API rejects Expiration.Days < 1, so 'wait a day' is unworkable.
Backdating via the filer's gRPC sidesteps that constraint while still
exercising the real Reader -> Router -> Schedule -> Dispatcher ->
LifecycleDelete RPC path end-to-end.

Wires a new s3-lifecycle-tests job into s3-go-tests.yml. The test runs
all 16 shards because ShardID(bucket, key) is hash-based and the test
shouldn't couple to that detail; running every shard keeps the test
independent of the hash function.

* fix(shell/s3.lifecycle.run-shard): address review findings

- Reject negative -events explicitly. Help text already defines 0 as
  unbounded; negative budgets created ambiguous behavior in pipeline.Run.
- Bound the gRPC dial with a 30s timeout instead of context.Background()
  so an unreachable S3 endpoint doesn't hang the shell.
- Paginate the bucket listing in loadLifecycleCompileInputs. SeaweedList
  takes a single-RPC limit; the prior 4096 silently dropped buckets
  past that page on large clusters. Loop with startFrom until a page
  comes back short.
- Surface parse errors instead of swallowing them. Buckets with
  malformed lifecycle XML now print the first three errors verbatim
  and a count for the rest, so an operator running this command for
  diagnostics can find what's wrong.

* feat(shell/s3.lifecycle.run-shard): -shards range/set with one subscription

Adds -shards "lo-hi" or "a,b,c" to the manual run command and threads
the same model through Reader and Pipeline.

- reader.Reader gains ShardPredicate (func(int) bool) and StartTsNs;
  ShardID stays for the single-shard short form. Event carries the
  computed ShardID so consumers can route per-shard without rehashing.
- dispatcher.Pipeline gains Shards []int. When set, Run holds one
  Cursor + Schedule + Dispatcher per shard, opens one filer
  SubscribeMetadata stream with a predicate covering the whole set,
  and routes events into the matching shard's schedule from a single
  dispatch goroutine — no per-shard goroutine fan-out.
- shell command parses -shard or -shards (mutually exclusive),
  formats progress messages with a contiguous-range label when
  applicable, and validates against ShardCount.

Integration test now uses -shards 0-15 (one subprocess invocation)
instead of a 16-iteration loop.

* fix(s3/lifecycle): allow Reader with StartTsNs=0 + Cursor=nil

The reader rejected the legitimate 'fresh subscription from epoch'
state when called from a fresh Pipeline.Run on a multi-shard worker
(no cursor file yet, all shards' MinTsNs=0). The downstream
SubscribeMetadata call handles SinceNs=0 fine; the up-front check
was over-defensive and broke the auto-scheduler completely (CI
showed 5-second-cadence retries with this exact error).

* fix(s3/lifecycle): schedule from ModTime not eventTime

A backdated or out-of-band entry update has eventTime ≈ now while
ModTime is far in the past; eventTime+Delay would push the dispatch
into the future even though the rule already fires. ModTime+Delay
is the correct fire moment. The dispatcher's identity-CAS still
catches drift between schedule and dispatch.

* fix(s3/lifecycle): -runtime cap on run-shard so it exits on quiet shards

The CI integration test sets -events 200 expecting the subprocess to
return after 200 in-shard events. But -events counts only events that
pass the shard filter; the test produces ~5 such events (bucket
create, lifecycle PUT, two object PUTs, mtime backdate), so the
reader stays in stream.Recv forever and runShellCommand hangs the
test deadline.

- weed/shell/command_s3_lifecycle_run_shard.go: add -runtime D flag.
  When > 0, Pipeline.Run runs under context.WithTimeout(D); on
  expiry the reader/dispatcher drain cleanly and the cursor saves.
- weed/s3api/s3lifecycle/dispatcher/pipeline.go: treat
  context.DeadlineExceeded the same as context.Canceled at exit
  (both are graceful shutdown signals).

* test(s3/lifecycle): pass -runtime 10s to run-shard

Pair with the new -runtime flag so the subprocess exits cleanly
after 10s instead of waiting for an event budget that never lands
on quiet shards.

* refactor(s3/lifecycle): extract HashExtended to s3lifecycle pkg

The worker's router needs the same length-prefixed sha256 of the entry's
Extended map; pulling it out of the s3api private file lets both sides
import it.

* fix(s3/lifecycle): worker captures ExtendedHash for identity-CAS

Without this, the dispatcher sends ExpectedIdentity.ExtendedHash = nil
while the live entry on the server has a non-nil hash, so every dispatch
returns NOOP_RESOLVED:STALE_IDENTITY and nothing is ever deleted.

* fix(s3/lifecycle): identity HeadFid via GetFileIdString

Meta-log events go through BeforeEntrySerialization, which clears
FileChunk.FileId and writes the Fid struct instead. Reading .FileId
directly returns "" on the worker side while the server's freshly
fetched entry still has a populated string, so the identity-CAS would
mismatch and every expiration ended in NOOP_RESOLVED:STALE_IDENTITY.

* fix(s3/lifecycle): treat gRPC Canceled/DeadlineExceeded as graceful exit

errors.Is doesn't unwrap a gRPC status error back to the stdlib ctx
errors, so a subscription that ends because runCtx was canceled was
being logged as a fatal reader error. Check status.Code as well so the
shell's -runtime cap exits cleanly.

* fix(test/s3/lifecycle): pass the gRPC port (not HTTP) to run-shard

run-shard's -s3 flag dials the LifecycleDelete gRPC service, which
listens on s3.port + 10000. The integration test was passing the HTTP
port instead, so the dispatcher's RPC just timed out and the shell
command exited under -runtime with no work done.

* chore(test/s3/lifecycle): drop emoji from Makefile output

* docs(test/s3/lifecycle): correct '-shards 0-15' wording

* fix(s3/lifecycle): reject out-of-range shard IDs in Pipeline.Run

The shell's parseShardsSpec already validates, but a programmatic caller
(scheduler, future worker config) shouldn't be able to silently produce
no-op states by passing -1 or 99.

* fix(s3/lifecycle): bound drain + final-save with their own timeouts

Shutdown was using context.Background, so a stuck dispatcher RPC or
filer save could keep Pipeline.Run from ever returning.

* fix(test/s3/lifecycle): drop self-killing pkill in stop-server

The pkill pattern \"weed mini -dir=...\" is also in the running shell's
argv (it's the recipe body), so pkill -f matches its own bash and the
recipe exits with Terminated. CI test job passed but the cleanup step
failed with exit 2. The PID file is sufficient on its own.

* docs(test/s3/lifecycle): document S3_GRPC_ENDPOINT env var
2026-05-08 09:59:10 -07:00
Chris Lu
12f283357f fix(iam): four phase-3 follow-ups (provider scoping, public path wrapper, static mirror, claim-mode RoleArn) (#9333)
* fix(iam): scope IAM-managed OIDC provider lookup by role account

Two account-scoped OIDC records sharing an issuer were collapsed into a
single map slot keyed only by the URL. The last-write-wins entry then
served every AssumeRoleWithWebIdentity, so a token destined for account
B's role could be validated by account A's record (its clientIDs and
thumbprints), defeating the per-account isolation the records exist for.

The role-account check in enforceProviderAccountScope still rejected
the cross-account assumption, but only after the wrong record's
audience and TLS pin had already accepted the token.

Refresh now keys IAM-managed records as (issuer, account), and
validation parses the requested role's account up front and matches
the record under that issuer in this order: exact account, global
(account-less), static-config fallback. An unknown account hint
deliberately skips account-scoped entries — picking one arbitrarily is
the bug this commit fixes — and falls through to global or static.

* fix(iam): route public AssumeRoleWithWebIdentity through IAMManager

handleAssumeRoleWithWebIdentity called stsService.AssumeRoleWithWebIdentity
directly, bypassing the IAMManager wrapper. The wrapper is where
enforceProviderAccountScope rejects cross-account assumption attempts
and capDurationByRole clamps to the role's MaxSessionDuration; both
silently became no-ops for any AWS-SDK caller hitting the public
endpoint.

Dispatch through the IAMManager (via the existing IAMManagerProvider
interface that other handlers in this file already use) when one is
wired. Embedded test setups without an IAM integration fall back to
the bare STS service unchanged.

* fix(iam): mirror thumbprints, principal-tag keys, and policy claim from static OIDC config

initOIDCProviderStore mirrored only URL and ClientIDs. Once
RefreshOIDCProvidersFromStore ran (on any IAM-managed mutation, or on
boot once the metadata-subscribe loop kicked in),
buildOIDCProviderFromRecord rebuilt the runtime provider from this
truncated record. Because IAM-managed entries take precedence over the
static-config map, the rebuild silently shadowed the bootstrap with a
weaker provider:

- Thumbprints: dropped, so TLS-pinned issuers fell back to the system
  trust store.
- AllowedPrincipalTagKeys: dropped, so principal-tag claims stopped
  reaching the session.
- PolicyClaim: dropped, so claim-based policy mode stopped triggering.

Pull all three from the provider's static Config map at mirror time so
the stored record round-trips to a runtime provider equivalent to the
one the static config produced directly.

* fix(iam): allow empty RoleArn in AssumeRoleWithWebIdentity HTTP handler

Phase 3b advertises that RoleArn MAY be omitted in claim-based policy
mode — the STS service then derives the assumed-role ARN from the
configured policy claim. The HTTP handler still rejected empty RoleArn
up front with MissingParameter, so SDK callers using the documented
omitted-role flow never reached the STS layer.

Drop the pre-check; STS still validates that claim-based mode is
configured and that the IDP emits policies, returning a precise error
when either is missing. The existing error mapping below this point
surfaces those as InvalidParameterValue, matching what an AWS SDK
expects.

* test(iam): update missing-RoleArn STS integration test for the new contract

The previous commit drops the HTTP-layer RoleArn pre-check so claim-based
mode can derive the ARN from a JWT claim. The integration test still
asserted MissingParameter for the missing-RoleArn case, which now
reaches the STS layer and surfaces a JWT-parse error instead. Update
the assertion to match: missing RoleArn alone must no longer surface
as MissingParameter, but a bogus JWT must still be rejected.
2026-05-05 19:14:44 -07:00
Chris Lu
6141222ab0 fix(test/s3/policy): allocate fresh admin port per subtest (#9332)
* fix(test/s3/policy): allocate fresh admin port per subtest

startMiniCluster ran weed mini in-process and explicitly assigned
master/volume/filer/s3 ports allocated by MustAllocatePorts, but it
left -admin.port and -admin.port.grpc unset, so each subtest reused
the hardcoded defaults 23646 / 33646.

The package's subtests run sequentially within the same go test
process. The previous subtest's admin goroutine is still bound to
23646 by the time the next subtest spins up its own mini, so the
new admin can never bind, mini.go's waitForAdminServerReady hits
its 240-attempt cap, and glog.Fatalf kills the test binary. This
has been the dominant cause of "admin server did not become ready"
flakes across recent IAM PRs.

Allocate two extra ports for admin and pass them through. The other
subprocess-based tests (s3tables/*) are not affected because each
launches weed mini in a fresh OS process.

* fix(mini): make admin readiness wait context-aware

waitForAdminServerReady polled for 240 attempts × 500ms regardless of
whether the surrounding mini context was cancelled. When mini is run
in-process from a test harness (test/s3/policy/...) and the test calls
its cancel func, the leftover wait keeps spinning for the full two
minutes and then glog.Fatalf's, terminating the entire test binary —
including any sibling subtest that has since started its own mini.

Thread the existing miniClientsCtx through the wait so a Stop / cancel
returns context.Canceled immediately. The caller (startMiniAdminWithWorker)
treats a context-cancelled outcome as a graceful shutdown signal and
logs+returns instead of fataling.
2026-05-05 11:24:43 -07:00
Chris Lu
6d95a5592a test(s3/policy): stop racing t.TempDir cleanup against mini shutdown
The mini cluster's admin/plugin worker keeps creating files under
admin/plugin/job_types/ for ~1s after subtests finish, while the
previous Stop() only cancelled an unobserved context and slept 500ms.
t.TempDir()'s registered RemoveAll then raced the worker and
intermittently failed with "directory not empty" (CI run 25352039081).

Manage the data dir manually so it is removed only after the mini
goroutine has exited, and wire MiniClusterCtx so cancel actually drains
master/volume/filer/admin/s3/webdav.
2026-05-04 20:16:53 -07:00
Chris Lu
6844ec067c fix(s3): cache remote-only source before CopyObject (#9304) (#9305)
* fix(s3): cache remote-only source before CopyObject (#9304)

CopyObject from a remote.mount source whose object lives only upstream
created a destination entry with FileSize > 0 but no chunks/content,
because the resolved source entry has no local chunks and the copy path
fell into the "inline/empty chunks" branch with empty entry.Content.
A subsequent GET returned 500 with "data integrity error: size N
reported but no content available". CopyObjectPart had the same shape
via copyChunksForRange iterating an empty chunk list.

Detect entry.IsInRemoteOnly() right after resolving the source in both
CopyObjectHandler and CopyObjectPartHandler and cache the object to the
local cluster first via a new cacheRemoteObjectForCopy helper (a
copy-time analogue of cacheRemoteObjectForStreaming with a bounded 30s
timeout and version-aware path resolution). If caching fails or
produces no chunks, return 503 with Retry-After: 5 instead of writing
a metadata-only destination, mirroring the GetObject behavior added in
the #7817 cold-cache fix.

Adds TestCopyObjectRemoteOnlySourceDetection pinning the four entry
shapes the fix branches on plus the pre-fix broken-output shape.

* address PR review on remote-only copy fix

- Use the resolved entry's version id when srcVersionId is empty so a
  CopyObject reading the latest object in a versioning-enabled bucket
  caches the correct .versions/v_<id> path instead of getting stuck in
  a 503 retry loop. New helper resolvedSourceVersionId handles the
  fallback for both CopyObject and CopyObjectPart.
- Drop the redundant cachedEntry.IsInRemoteOnly() recheck in both
  handlers; the cache helper now reports success based on local data
  presence, and IsInRemoteOnly does not look at inline Content so
  keeping the check would 503 on small inline-cached objects.
- Treat inline Content as a successful cache result in both
  cacheRemoteObjectForStreaming and cacheRemoteObjectForCopy via a
  shared cachedEntryHasLocalData predicate. The CopyObject inline
  branch already handles entries that have Content but no chunks.
- Extract buildVersionedRemoteObjectPath so the streaming and copy
  cache helpers share path construction.

Adds TestResolvedSourceVersionId and TestCachedEntryHasLocalData to
pin the new helpers' contracts.

* narrow streaming cache contract back to chunks-only

CodeRabbit flagged that cacheRemoteObjectForStreaming's caller in
streamFromVolumeServers (lines 997-1002) still required non-empty
chunks, so content-only cache hits would fall through to a 503 retry
loop instead of being honored.

Resolve by keeping the helper's contract chunks-only: the filer's
caching code only ever writes chunks, the streaming downstream isn't
wired to read inline Content from a cached entry, and a partial range-
aware inline writer here would be overkill for a path that doesn't
actually occur in practice. cacheRemoteObjectForCopy keeps the relaxed
contract since the copy path's inline branch genuinely handles both
chunked and content-only entries.

Document the asymmetry on cachedEntryHasLocalData and on
cacheRemoteObjectForStreaming so a future reader can see why the two
helpers diverge.

* extend version-id resolution to streaming cache path

CodeRabbit flagged that GetObjectHandler still passed the raw query
versionId to cacheRemoteObjectForStreaming. For latest-version reads
in versioning-enabled buckets that stays empty even though the
resolved entry lives at .versions/v_<id>, so remote-only GETs would
keep caching the wrong path and 503-ing forever. Reuse the new
resolvedSourceVersionId helper at the streaming call site too.

Also document on cachedEntryHasLocalData that the zero-byte case
flagged in the same review is handled upstream (IsInRemoteOnly
requires RemoteSize > 0, so the cache helper is never invoked for
empty remote objects -- CopyObject's pre-existing inline branch
writes a correct empty destination directly). Pin this with a new
test case.

* trim verbose comments

Drop tutorial-style and review-history comments. Keep only the WHY
that isn't obvious from identifiers: the #9304 reference on the new
branches in CopyObject / CopyObjectPart, the latest-version-fallback
rationale on resolvedSourceVersionId, and the streaming/copy contract
asymmetry on cachedEntryHasLocalData.

* drop issue references from comments

Issue numbers belong in PR descriptions and commit messages, not in
source comments where they rot. Replace with the underlying invariant
the code is preserving.

* test: drive remote-object cache helpers through real gRPC

Existing tests only re-enacted helper-function branching in test
space, so they could not have caught a handler that consumed a
remote-only entry without going through the cache. Stand up an
in-process filer gRPC server (UnimplementedSeaweedFilerServer +
configurable CacheRemoteObjectToLocalCluster response) and exercise
the two cache helpers end-to-end.

What's pinned:
- cacheRemoteObjectForCopy returns nil when the cache makes no
  progress (response is still remote-only), lets gRPC errors
  through as nil, accepts both chunked and inline-content cache
  hits, and surfaces deadline-exceeded as nil so callers can 503
  instead of holding the request open.
- Versioned source paths route to .versions/v_<id>; non-versioned
  and "null" stay at the bucket-relative path. Captured by reading
  the request the stub server received.
- cacheRemoteObjectForStreaming holds the stricter chunks-only
  contract: a content-only cache hit is not propagated, since
  streamFromVolumeServers' downstream isn't wired to read from
  inline Content there.

Any current or future handler that calls these helpers exercises
the same gRPC path under test, so the bug class is closed for
helper-routed cache calls.

* move remote-only copy test into the integration suite

The previous gRPC-stub test in weed/s3api/ was integration-flavored
but stubbed; relocate the coverage to the existing two-server suite
under test/s3/remote_cache/, which already exercises the real
remote.mount + remote.uncache flow against a primary SeaweedFS plus
a secondary acting as remote storage.

The new test/s3/remote_cache/remote_cache_copy_test.go drives:
- TestRemoteCacheCopyObject: upload to primary, uncache (entry now
  remote-only), CopyObject to a new key, GET the destination. Pre-
  fix the GET returned 500 'data integrity error: size N reported
  but no content'; this pins the fixed behavior over real HTTP
  through the actual handler stack.
- TestRemoteCacheCopyObjectPart: same shape via multipart
  UploadPartCopy on a 6 MiB object split into two parts, exercising
  CopyObjectPartHandler's range-copy path.

Drop weed/s3api/s3api_remote_storage_grpc_test.go: the helper-level
classification tests in s3api_remote_storage_test.go still cover
the contract pieces (cachedEntryHasLocalData, resolvedSourceVersionId,
the remote-only entry shape), and the integration suite covers the
end-to-end behavior that those classifications enable.
2026-05-03 18:52:45 -07:00
Chris Lu
1da091f798 ci: bring previously-uncovered integration tests into CI (#9281 follow-up) (#9283)
* ci: bring previously-uncovered integration tests into CI (#9281 follow-up)

Six integration test packages had _test.go files but no GitHub workflow
running them. The s3-sse-tests CI gap that let #8908's UploadPartCopy bug
(and the four cross-SSE copy bugs in #9281) ship undetected was an
instance of this same pattern. This change wires three of them into CI
and removes a fourth that was deadcode:

  test/multi_master/                    NEW workflow: multi-master-tests.yml
    - 3-node master raft cluster failover/recovery (5 tests, ~65s)
  test/testutil/                         (run alongside multi_master)
    - port-allocator regression test
  test/s3/etag/                          NEW workflow: s3-etag-acl-tests.yml
    - PutObject ETag format regression for #7768 (must be pure MD5 hex,
      not "<md5>-N" composite, for AWS Java SDK v2 compatibility)
  test/s3/acl/                           (same workflow as etag)
    - object-ACL behavior on versioned buckets

  test/s3/catalog_trino/                 DELETED (deadcode)
    - Single-file copy of test/s3tables/catalog_trino/trino_catalog_test.go
      from a 2024 commit that was never iterated, while the
      test/s3tables/ counterpart has been actively maintained (and IS
      in CI via s3-tables-tests.yml's trino-iceberg-catalog-tests job).

Both workflows trigger only on changes to relevant code paths and use the
existing simple "build weed → run go test" pattern (no per-test-dir
Makefile boilerplate). The S3 workflow starts a single `weed mini`
shared by etag and acl, which keeps the job under 2 minutes on a fresh
runner.

Two tests remain knowingly uncovered:

  test/s3/basic/  — order-dependent state across tests (TestListObjectV2
    expects a bucket created by an earlier test, etc.) and uses the
    deprecated aws-sdk-go v1. Treated as sample programs, not a
    regression suite. Fixing them is out of scope for this PR.

  test/s3/catalog_trino/  — see "DELETED" above.

Verified locally:
  - go test -v -timeout=8m ./test/multi_master/... ./test/testutil/...
    PASS  (5 multi_master + 1 testutil tests, 64s)
  - weed mini + go test ./test/s3/etag/... + go test ./test/s3/acl/...
    PASS  (8 etag + 5 acl tests, ~6s after server startup)

* ci: fix log-collector glob for multi-master tests (review feedback on #9283)

test/multi_master/cluster.go creates per-test temp dirs via
os.MkdirTemp("", "seaweedfs_multi_master_it_"), so the glob has to
match that prefix. The previous version looked for MasterCluster* /
TestLeader* / TestTwoMasters* / TestAllMasters* which never matches —
the failure-artifact upload would have been empty on a real failure.

Switch the find to /tmp/seaweedfs_multi_master_it_* (maxdepth 1) so it
actually picks up the per-node master*.log files under
<baseDir>/logs/.

Found by coderabbitai review on PR #9283.
2026-04-29 10:06:59 -07:00
Chris Lu
82cf60a44f fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908) (#9280)
* fix(s3api): re-encrypt UploadPartCopy bytes for the destination's SSE config (#8908)

The remaining failure mode in #8908 was that Docker Registry's blob
finalization (server-side Move via UploadPartCopy) silently corrupts
SSE-S3 multipart objects. Reproduces with `aws s3api upload-part-copy`
under bucket-default SSE-S3: the GET on the completed object returns
deterministic wrong bytes (correct length, same wrong SHA-256 across
runs). The metadata is mathematically self-consistent — every chunk's
stored IV equals `calculateIVWithOffset(baseIV_dst, partLocalOffset)` —
but the bytes on disk were encrypted with the SOURCE upload's key+baseIV.

Root cause:

  - `copyChunksForRange` (and `createDestinationChunk`) constructs new
    chunks for UploadPartCopy without copying `SseType` / `SseMetadata`,
    so destination chunks are written with `SseType=NONE`.

  - At completion, `completedMultipartChunk` (PR #9224's NONE→SSE_S3
    backfill, intended to recover from a different missing-metadata
    bug) sees those NONE chunks under an SSE-S3 multipart upload and
    backfills SSE-S3 metadata derived from the destination upload's
    baseIV. The chunk metadata is now internally consistent and the
    GET path applies decryption — but the bytes on disk are encrypted
    with the source upload's key, not the destination's. Decryption
    produces deterministic garbage. Docker Registry pulls then fail
    with "Digest did not match".

Fix: when either the source object or the destination multipart upload
has any SSE configured, take a slow-path UploadPartCopy that
(1) opens a plaintext reader of the source range — decrypting the
source's per-chunk SSE-S3 metadata if needed via a reused
`buildMultipartSSES3Reader`, and
(2) feeds that plaintext through `putToFiler`'s existing encryption
pipeline by staging the destination upload entry's SSE-S3/SSE-KMS
headers on a cloned request. Encryption then matches PutObjectPart's
contract: every part starts a fresh CTR stream from counter 0 with
`baseIV_dst`, and each internal chunk's metadata records
`calculateIVWithOffset(baseIV_dst, chunk.partLocalOffset)`.

The `non-SSE → non-SSE` case still takes the existing fast raw-byte
copy path — bytes on disk are plaintext on both sides, so chunk-level
metadata is irrelevant.

Cross-encryption from SSE-KMS / SSE-C sources is left as TODO — the
new path returns an explicit error rather than the previous silent
corruption. SSE-S3 (the user-reported case) round-trips correctly.

Tests:

  - test/s3/sse/s3_sse_uploadpartcopy_integration_test.go pins three
    UploadPartCopy shapes against bucket-default SSE-S3:
    * Docker-Registry-shape 32MB+tail (the user's exact 5-chunk /
      2-part metadata layout)
    * single full-object UploadPartCopy
    * many small range copies
    Each round-trips SHA-256.

  - test/s3/sse/s3_sse_concurrent_repro_test.go covers the parallel
    multipart-upload shape from the user report (5 blobs in parallel,
    full GET and chunked range GET both hash-checked) — pre-existing
    coverage; added here as a regression sentinel.

* test(s3-sse): rename UploadPartCopy regression test so CI matches it

The CI workflow .github/workflows/s3-sse-tests.yml dispatches on the
TEST_PATTERN ".*Multipart.*Integration" — i.e. the test name must contain
both "Multipart" and "Integration" for CI to run it. The previous name
TestSSES3UploadPartCopyIntegration had only "Integration"; "UploadPart"
isn't "Multipart". Rename to TestSSES3MultipartUploadPartCopyIntegration
so the regression test actually runs in CI rather than only locally.

* fix(s3api): map unsupported UploadPartCopy SSE source to 501, not 500 (review feedback on #9280)

openSourcePlaintextReader explicitly rejects SSE-KMS and SSE-C sources
(SSE-S3 is the only one wired up in this slow path so far). Earlier
the caller blanket-mapped that to ErrInternalError, which collapses
"this shape isn't implemented yet" into the same 500 response a
real server failure would produce. Clients can no longer tell whether
they hit a feature gap or a bug.

Introduce a sentinel errCopySourceSSEUnsupported and have
copyObjectPartViaReencryption errors.Is-check it; on match, return
ErrNotImplemented (501) instead of ErrInternalError (500). Other
failures still map to 500.

Found by coderabbitai review on PR #9280.

* fix(s3api): UploadPartCopy must fail with NoSuchUpload when upload entry is missing (review feedback on #9280)

CopyObjectPartHandler's earlier checkUploadId call only verifies that
the uploadID's hash prefix matches dstObject; it does not prove the
upload directory exists in the filer. The previous logic silently
swallowed filer_pb.ErrNotFound from getEntry(uploadDir) and fell
through with uploadEntry=nil, which then skipped the destination
SSE check and could route a plain-source copy through the raw-byte
fast path even though the destination's encryption state is unknown.

Treat ErrNotFound as ErrNoSuchUpload so the client sees the right
status, matching the AWS S3 contract for UploadPartCopy on a
non-existent upload.

Found by coderabbitai review on PR #9280.

* feat(s3api): set SSE response headers on UploadPartCopy slow path (review feedback on #9280)

PutObjectPartHandler writes x-amz-server-side-encryption (and the KMS
key-id header for SSE-KMS) on every successful part response so clients
can confirm the destination's encryption state. The new UploadPartCopy
slow path was missing this — it returned only the ETag in the response
body and no SSE response headers.

Plumb putToFiler's SSEResponseMetadata back through
copyObjectPartViaReencryption to the handler, then call
setSSEResponseHeaders before writing the XML response, matching the
PutObjectPart contract.

Found by gemini-code-assist review on PR #9280.

* fix(s3api): map transient filer errors on UploadPartCopy upload-entry fetch to 503 (review feedback on #9280)

Earlier non-ErrNotFound errors from getEntry(uploadDir, uploadID) all
returned 500 InternalError, which most SDKs treat as fatal — even
though a transient filer outage (gRPC Unavailable, leader election in
flight, deadline exceeded) is exactly the kind of failure SDK retry
logic is supposed to recover from.

Add an isTransientFilerError helper that recognises:
  - context.DeadlineExceeded / context.Canceled
  - gRPC codes.Unavailable, DeadlineExceeded, ResourceExhausted, Aborted

When the upload-entry fetch fails for one of those reasons, return
503 ServiceUnavailable so the client retries; everything else still
maps to 500. Log line now also carries dstObject (in addition to
dstBucket and uploadID) to make incident triage easier.

Found by gemini-code-assist review on PR #9280.
2026-04-29 09:46:44 -07:00
Chris Lu
02574314f6 test(s3): force-drop collection after deleteBucket in tagging/versioning/cors/copying (#9270)
* test(s3): force-drop collection after deleteBucket across tagging/versioning/cors/copying

Each test creates a unique bucket (= new SeaweedFS collection) and the master's
warm-create issues a 7-volume grow batch. The S3 DeleteBucket-driven collection
sweep snapshots the layout once, but in-flight `volume_grow` requests keep
registering volumes after the snapshot, leaking 1-3 volumes per bucket. On a
single `weed mini` data node with the auto-derived volume cap, those leaks pile
up fast and every subsequent PutObject 500s with "Not enough data nodes found".

Mirror the retention-suite fix (commits ac3a756d, 363d5caa) into the four other
suites that share the same shape: defer a master-side /col/delete after each
bucket teardown, and sweep stale prefix-matching buckets before every new
createBucket. Each suite gets its own MASTER_ENDPOINT default plus the
allTestBucketPrefixes / cleanupAllTestBuckets / cleanupLeftoverTestBuckets /
forceDeleteCollection helpers.

Skipped: iam (separate framework + v1 SDK; already passes
-master.volumeSizeLimitMB=100), sse (different cleanup signature; docker-compose
already passes -volumeSizeLimitMB=50), policy (its TestCluster already uses
-master.volumeSizeLimitMB=32 and tears the whole cluster down). Suites under 5
tests cannot exhaust the cap and were left untouched.

* test(s3): default master endpoint to 127.0.0.1 to avoid IPv6 resolution

Mirror the explicit IPv4 default already used by test/s3/copying. On hosts
where `localhost` resolves to ::1 first the master HTTP listener (bound on
0.0.0.0) is unreachable, so the new force-delete-collection helper would silently
skip cleanup. Pin the default to 127.0.0.1 in tagging/cors/versioning; the
MASTER_ENDPOINT env var still wins when set.

* test(s3): skip buckets newer than this process in leftover-bucket sweep

cleanupAllTestBuckets/cleanupTestBuckets used to delete every bucket whose name
matched the suite's prefix. With shared S3/master endpoints, a second concurrent
`go test` run could have its live buckets torn down mid-test by the first run's
sweep.

Capture testRunStart at package init (with a 1-minute backdate for clock skew)
and skip any bucket whose CreationDate is newer than that. Stale buckets from
panicked or interrupted prior runs (the original target of the sweep) still get
collected because they were created before this process started.

* test(s3-copying): sweep stale prefix buckets from createBucket, not just from a few callers

The s3-copying suite already had cleanupTestBuckets that walks every bucket and
drops the test-copying-* prefix matches, but only four tests in the file invoke
it; the rest go straight to createBucket. So a panicked or interrupted prior run
could leak buckets that survive into the next run and exhaust the data node's
volume slots before any of the prefix-sweeping tests get a chance to run.

Hoist the sweep into createBucket so every test that creates a bucket starts on
a clean slate. The per-process CreationDate filter from the prior commit keeps
this safe under concurrent runs.

* test(s3): scope leftover-bucket sweep to this run via runID marker

The CreationDate filter only protected against runs that started later than this
process. A different `go test` against the same S3/master endpoints that started
*earlier* and is still active has buckets older than testRunStart, so the sweep
would still tear them down mid-test.

Replace the time window with a per-process runID baked into bucket names: every
bucket this run creates gets `r{runID}-` after the suite's BucketPrefix, and the
sweep only matches that owned subset. Concurrent runs each carry their own
runID and never see each other's buckets.

Trade-off: buckets left behind by a crashed prior run carry a different runID
and won't be cleaned up by this sweep. That recovery path now belongs to
`make clean` / data-dir wipe, which is what CI already does between jobs.
Fixed-name versioning buckets (e.g. test-versioning-directories) bypass
getNewBucketName and so also bypass this sweep — they are short-lived and
handled by their own deferred deleteBucket.

* test(s3): drop cleanupLeftoverTestBuckets wrapper, call cleanupAllTestBuckets directly

cleanupLeftoverTestBuckets was a one-line forwarder kept only as a semantic name
at the createBucket call site. Inline the call and update the doc comments to
point at the actual implementation. tagging/cors/versioning all had the wrapper;
copying never did.
2026-04-28 22:13:42 -07:00
Chris Lu
363d5caa85 test(s3-retention): purge stale buckets before each create to avoid volume exhaustion
The WORM suite creates one bucket per test, each backed by ~3 reserved
volumes on the data node. With ~30 tests and the default `weed mini`
volume cap, the data node runs out of slots midway through the run and
every PutObject after that fails with InternalError.

Hook a sweep of every test-prefix bucket into the create helpers so a
panicked or interrupted prior test cannot leak buckets into the next.
2026-04-27 22:14:20 -07:00
Chris Lu
ac3a756dae test(s3-retention): force-drop collection after deleteBucket to free volumes
COMPLIANCE-mode retention leaves objects that BypassGovernanceRetention cannot
clear, so the test's DeleteBucket keeps returning BucketNotEmpty and the
underlying SeaweedFS collection (with its 7 reserved volumes) leaks. After a
few leaks on the single-node `weed mini` server, the master logs "Not enough
data nodes found" and every subsequent PutObject 500s, timing the suite out.

Call the master's /col/delete admin endpoint from deleteBucket so the
collection's volumes are reclaimed even when S3-level cleanup is blocked.
2026-04-27 12:12:36 -07:00
Chris Lu
4f628ff4e5 fix(s3api): stream multipart-SSE chunks lazily to avoid truncated GETs (#8908) (#9228)
* fix(s3api): stream multipart SSE-S3 chunks lazily to avoid truncated GETs (#8908)

buildMultipartSSES3Reader opened a volume-server HTTP response for EVERY
chunk upfront, then walked them with io.MultiReader. For a multipart
SSE-S3 object with N internal chunks (e.g. a 200MB Docker Registry blob
with 25+ chunks), N volume-server bodies sat live at once; chunks
1..N-1 were idle while io.MultiReader drained chunk 0. Under concurrent
load the volume server's keep-alive logic closed those idle responses
mid-flight, and the S3 client saw `unexpected EOF` partway through the
GET. Truncated bytes hash to the wrong SHA-256, which is exactly the
"Digest did not match" symptom Docker Registry reports in #8908 (and
which persisted even after the per-chunk metadata fix in #9211 and the
completion backfill in #9224).

Introduce lazyMultipartChunkReader + preparedMultipartChunk{chunk,
wrap}: a generic lazy chunk streamer with a per-chunk wrap closure for
the SSE-specific decryption setup. Per-chunk metadata is still
validated UPFRONT so a malformed chunk fails fast without opening any
HTTP connection -- the eager validation contract callers and tests
rely on is preserved. The volume-server GET and the SSE-specific
decrypt wrap, however, fire LAZILY: at most one chunk body is live at
any time, regardless of object size.

This commit applies the new pattern to buildMultipartSSES3Reader only;
the SSE-KMS and SSE-C multipart readers retain their eager form for
now and will be migrated in follow-up commits, since the same shape
exists there too.

Tests:
  - TestBuildMultipartSSES3Reader_LazyChunkFetch pins the new contract:
    zero chunks opened at construction, peak liveness == 1, all closed
    after drain.
  - TestBuildMultipartSSES3Reader_RejectsBadChunkBeforeAnyFetch
    (replaces ClosesAppendedOnError) asserts a malformed chunk in
    position N causes zero fetches for chunks 0..N -- the previous
    test pinned a weaker contract (cleanup after eager open).
  - TestBuildMultipartSSES3Reader_InvalidIVLength updated for the same
    reason: the fetch callback must NOT be invoked at all on a bad-IV
    chunk.
  - TestMultipartSSES3RealisticEndToEnd round-trips multiple parts
    encrypted the way putToFiler writes them (shared DEK + baseIV,
    partOffset=0, post-completion global offsets) and walks them
    through buildMultipartSSES3Reader.

* fix(s3api): stream multipart SSE-KMS chunks lazily

Apply the same fix as the previous commit to
createMultipartSSEKMSDecryptedReaderDirect: per-chunk SSE-KMS metadata
is validated upfront, but volume-server GETs fire lazily through
lazyMultipartChunkReader. At most one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-KMS multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The wire format on disk is unchanged (same per-chunk metadata, same
encrypted bytes, same object Extended attributes). Existing SSE-KMS
multipart objects read back identically -- only when the volume-server
GETs fire changes.

* fix(s3api): stream multipart SSE-C chunks lazily

Apply the same fix as the previous two commits to
createMultipartSSECDecryptedReaderDirect: per-chunk SSE-C metadata is
validated upfront (IV decode, IV length check, non-negative
PartOffset), but the volume-server GET and CreateSSECDecryptedReader-
WithOffset wrap fire lazily through lazyMultipartChunkReader. At most
one chunk body is live at any time.

This is the same eager-open-all-chunks shape that produced #8908's
truncated GETs for SSE-S3; SSE-C multipart objects with many chunks
were exposed to the same idle-keepalive failure mode under concurrent
load.

The pre-existing TODO note about CopyObject SSE-C PartOffset handling
is preserved verbatim. The wire format on disk is unchanged (same
per-chunk metadata, same encrypted bytes); existing SSE-C multipart
objects read back identically.

After this commit all three multipart SSE read paths (SSE-S3, SSE-KMS,
SSE-C) share lazyMultipartChunkReader as their streaming engine.

* test(s3): add Docker Registry-shape multipart SSE-S3 GET regression

Pin the end-to-end fix for #8908 with a test that mirrors what Docker
Registry actually does on pull: a 25-part * 5MB upload with bucket-
default SSE-S3, then a full GET, then SHA-256 over the streamed body
must match SHA-256 over the uploaded bytes.

The eager-multipart-reader bug was specifically a streaming truncation
under load: the response status was 200 with a Content-Length matching
the object size, but the body short-circuited mid-stream because
later chunks' volume-server connections had already been closed by
keepalive. The hash check is the symptom Docker Registry surfaces
("Digest did not match"), so this is the most faithful regression we
can pin without spinning up a registry.

uploadAndVerifyMultipartSSEObject already byte-compares the GET body,
but hashing on top is intentionally explicit -- it documents WHY the
test exists, and matches the failure mode reported in the issue.

* test(s3): add range-read coverage matrix across SSE modes and sizes

Existing range-read coverage in test/s3/sse was scoped to small (<= 1MB)
single-chunk objects, with one ad-hoc range case per SSE mode and one
129-byte boundary-crossing case in TestSSEMultipartUploadIntegration.
Nothing exercised:

  - Range reads on single-PUT objects whose content crosses the 8MB
    internal chunk boundary (medium size class).
  - Range reads on multipart objects whose parts each span multiple
    internal chunks (large size class) -- the shape #8908 originally
    surfaced for full-object GETs and the most likely site of any
    future regression in per-chunk IV / PartOffset plumbing for
    partial reads.
  - A consistent range-pattern set applied uniformly across SSE modes,
    so any divergence between modes (SSE-C uses random IV + PartOffset;
    SSE-S3/KMS use base IV + offset) is comparable at a glance.

TestSSERangeReadCoverageMatrix introduces a parameterized matrix:

  modes:     no_sse, sse_c, sse_kms, sse_s3
  sizes:     small (256KB single chunk),
             medium (12MB single PUT crossing one internal boundary),
             large (5x9MB multipart, ~10 internal chunks, every part
                    itself spans an 8MB boundary)
  ranges:    single byte at 0, prefix 512B, single byte at last,
             suffix bytes=-100, open-ended bytes=N-, whole object,
             AES-block boundary 15-31, mid straddling one internal
             boundary (medium+large), mid spanning many internal
             boundaries (large only)

Per case it asserts: body bytes equal the expected slice, Content-Length
matches the range length, Content-Range matches start-end/total, and the
SSE response headers match the mode.

The sse_kms branch probes once with a 1-byte SSE-KMS PUT and t.Skip's
the remaining sse_kms subtests with a clear reason if the local server
has no KMS provider configured -- the default `weed mini` setup lacks
one; the Makefile target `test-with-kms` provides one via OpenBao. Other
modes always run.

Verified locally: 75 subtests pass under no_sse / sse_c / sse_s3 against
weed mini, sse_kms cleanly skipped.

* test(s3): conform new test names to TestSSE*Integration so CI runs them

The two tests added in the previous commits had names that did NOT match
the patterns the test/s3/sse Makefile and .github/workflows/s3-sse-tests.yml
use to discover SSE integration tests:

  - test/s3/sse/Makefile `test` target:           TestSSE.*Integration
  - test/s3/sse/Makefile `test-multipart`:        TestSSEMultipartUploadIntegration
  - .github/workflows/s3-sse-tests.yml:           ...|.*Multipart.*Integration|.*RangeRequestsServerBehavior

Result: SSE-KMS coverage I added to TestSSERangeReadCoverageMatrix and
the Docker-Registry-shape multipart regression in
TestSSES3MultipartManyChunks_DockerRegistryShape were silently invisible
to CI even though the underlying test setup (start-seaweedfs-ci using
s3-config-template.json with the embedded `local` KMS provider) already
has SSE-KMS configured.

Renames:

  TestSSERangeReadCoverageMatrix              -> TestSSERangeReadIntegration
  TestSSES3MultipartManyChunks_...            -> TestSSEMultipartManyChunksIntegration

Both names now match `TestSSE.*Integration` (Makefile `test` target) and
TestSSEMultipartManyChunksIntegration additionally matches
`.*Multipart.*Integration` (CI's comprehensive subset). No behavior
change; only the function names move.

Verified locally against `weed mini` with s3-config-template.json:
TestSSERangeReadIntegration runs 96 leaf subtests across 4 SSE modes
(none, SSE-C, SSE-KMS, SSE-S3) x 3 size classes x 7-9 range patterns,
all passing, 0 skipped. The probe-and-skip in the SSE-KMS arm now only
fires for ad-hoc local setups that don't load any KMS provider; the
project's standard test setup loads the local provider, so CI has full
SSE-KMS range coverage.

* fix(s3api): validate SSE-KMS chunk IV during prep, before any fetch

Addresses CodeRabbit review on PR #9228: in
createMultipartSSEKMSDecryptedReaderDirect the per-chunk SSE-KMS metadata
was deserialized in the prep loop but the IV length was only validated
later, inside CreateSSEKMSDecryptedReader, which runs from the wrap
closure -- AFTER the chunk's volume-server fetch has already started.
That weakens the new "reject malformed chunks before any fetch" contract
for SSE-KMS specifically: a chunk with a missing/short/long IV would
fire its HTTP GET, then fail mid-stream during decrypt.

The fix moves the existing ValidateIV check into the prep loop, matching
the SSE-S3 and SSE-C paths.

Drive-by: extract the SSE-KMS prep loop into a free
buildMultipartSSEKMSReader helper that mirrors buildMultipartSSES3Reader,
so the new contract is unit-testable without an S3ApiServer. The
exported method (createMultipartSSEKMSDecryptedReaderDirect) stays a
thin caller, so behavior for production callers is unchanged.

New tests in weed/s3api/s3api_multipart_ssekms_test.go pin the contract:

  - TestBuildMultipartSSEKMSReader_RejectsBadIVBeforeAnyFetch covers
    missing IV, empty IV, short IV, long IV. Each case asserts both
    that an error is returned AND that the fetch callback is never
    invoked.
  - TestBuildMultipartSSEKMSReader_RejectsMissingMetadataBeforeAnyFetch
    pins the analogous behavior when SseMetadata is nil on a chunk in
    position N: chunks 0..N-1 must not be fetched (the earlier eager
    implementation depended on a closeAppendedReaders cleanup path; the
    new contract is stronger -- nothing is opened in the first place).
  - TestBuildMultipartSSEKMSReader_RejectsUnparseableMetadataBeforeAnyFetch
    covers the JSON-unmarshal failure branch.
  - TestBuildMultipartSSEKMSReader_SortsByOffset smoke-tests the
    documented sort-by-offset contract by recording the order in which
    fetch is invoked.

All four pass under `go test ./weed/s3api/`. Existing weed/s3api unit
suite + the SSE integration suite (with the local KMS provider enabled
via s3-config-template.json) continue to pass.

* test(s3): address CodeRabbit nitpicks on range coverage matrix

Three small follow-ups on the range-read coverage matrix from the
previous commit, per CodeRabbit nitpicks on PR #9228:

1. Promote the body-length check from `assert.Equal` to `require.Equal`
   so a truncation regression -- the canonical #8908 failure mode --
   aborts the subtest immediately. Previously the assertion logged a
   length mismatch and then `assertDataEqual` ran on differently-sized
   slices, producing a noisy byte-diff on top of the actual symptom.
   The redundant trailing `t.Fatalf` block becomes dead and is removed.

2. Broaden the SSE-KMS probe-skip heuristic. The probe previously
   produced the friendly "KMS provider not configured" message only
   for 5xx responses; KMS-misconfig surfaces also include 501
   NotImplemented, 4xx KMS.NotConfigured, and error messages
   containing "KMS.NotConfigured" / "NotImplemented" /
   "not configured". The behaviour change is purely cosmetic (the
   caller t.Skip's on any non-empty reason either way) but the new
   diagnostic is more useful in CI logs.

3. Add `t.Parallel()` at the mode and size-class levels of the matrix.
   Each (mode, size) writes an independent object key under the shared
   bucket, with no cross-talk, so parallel execution is safe. Local
   wall time on the full matrix dropped from ~2.0s to ~1.1s (~45%);
   the savings scale with chunk count and CI machine concurrency.

Verified locally against `weed mini` with s3-config-template.json:
  - go test ./weed/s3api/ -count=1                   PASS
  - TestSSERangeReadIntegration -v                   112 PASS, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.           PASS

* fix(s3api): tighten lazy reader error path; unify SSE IV validation

Three CodeRabbit nitpicks on PR #9228:

1. lazyMultipartChunkReader: mark finished on non-EOF Read errors

   The Read loop's three earlier failure paths (chunk index past end,
   fetch error, wrap error) all set l.finished = true before returning.
   The non-EOF Read path -- where l.current.Read itself errors mid-chunk
   -- did not, leaving l.current/l.closer set and l.finished = false. A
   caller that retried Read after an error would re-enter the same
   broken stream instead of advancing or giving up. Set l.finished =
   true on non-EOF Read error so post-error state is consistent across
   all four failure sites; Close() (which the GetObjectHandler defers)
   still releases the chunk body.

2. Unify IV-length validation across SSE-S3, SSE-KMS, SSE-C prep paths

   The previous commit moved SSE-KMS to the shared ValidateIV helper
   but left SSE-S3 and SSE-C with bespoke inline `len(...) !=
   AESBlockSize` checks. All three are enforcing the same invariant;
   inconsistency obscures the symmetry. Move SSE-S3 and SSE-C to
   ValidateIV too, with the same `<algo> chunk <fileId> IV` name
   convention. Error message wording shifts from "<algo> chunk X has
   invalid IV length N (expected 16)" to ValidateIV's "invalid <algo>
   chunk X IV length: expected 16 bytes, got N". The substring
   "IV length" is preserved across both, so the existing
   TestBuildMultipartSSES3Reader_InvalidIVLength substring assertion
   is loosened to match either form.

3. TestBuildMultipartSSEKMSReader_SortsByOffset: verify full ordering

   The test previously drove Read() to observe fetch-call order, but
   CreateSSEKMSDecryptedReader requires a live KMS provider to unwrap
   the encrypted DEK -- unavailable in unit tests -- so the wrap
   closure failed on the first chunk and only one fetch was ever
   recorded. The test asserted only fetchOrder[0] == "c0", which is
   weaker than the comment promised.

   Switch to a static check: type-assert the returned reader to
   *lazyMultipartChunkReader (same package so unexported fields are
   accessible) and inspect the prepared chunks slice directly. This
   pins the entire [c0, c1, c2] sort order in one place, doesn't
   depend on KMS, and runs in zero fetch calls. The fetch closure
   now asserts it is never invoked during preparation.

All weed/s3api unit tests pass; integration suite (with KMS provider
configured via s3-config-template.json) passes.

* test(s3): switch range coverage cleanup to t.Cleanup; tighten KMS probe

Two CodeRabbit comments on PR #9228, both about
test/s3/sse/s3_sse_range_coverage_test.go:

1. CRITICAL: defer + t.Parallel() race in TestSSERangeReadIntegration

   The test creates one bucket up front, then runs subtests that call
   t.Parallel() at the mode and size levels (added in 058cbf27 to cut
   wall time). t.Parallel() pauses each subtest and yields back to the
   parent. The parent's for loop finishes scheduling, the function
   returns, and the deferred cleanupTestBucket fires -- BEFORE any
   parallel subtest body has executed. The bucket gets deleted out
   from under the parallel subtests, which then race the cleanup and
   either fail with NoSuchBucket or, depending on lazy-deletion
   behaviour on the server side, mask other regressions because
   chunks happen to still be readable for a brief window.

   The local matrix passing prior to this commit was a server-side
   coincidence; the t.Cleanup contract is the right one for parent
   tests with parallel children, and switching to it is a one-line
   change. t.Cleanup runs after the test AND all its (parallel)
   subtests complete, so the bucket survives until every leaf
   subtest is done.

2. MINOR: tighten the SSE-KMS probe-skip heuristic

   The previous broadening (058cbf27) treated `code == 400` as
   "KMS provider not configured", on the theory that some servers
   return 4xx for KMS misconfig. That is too aggressive: a real
   misconfiguration in the SSE-KMS test request itself (bad keyID
   format, missing header) ALSO surfaces as a 400, and would
   silently t.Skip the SSE-KMS subtree in CI -- which is exactly
   the integration coverage the new TestSSERangeReadIntegration is
   supposed to add. Drop the 400 branch (and the redundant 501
   match, since 501 >= 500 already covers it). Genuine
   "KMS.NotConfigured" / "NotImplemented" responses are still
   recognised via the string-match block immediately below,
   regardless of status code, so the friendly skip message survives
   for the cases where it actually applies.

Verified locally against `weed mini` with s3-config-template.json:

  - go test ./weed/s3api/                     PASS
  - TestSSERangeReadIntegration -v            113 PASS lines, 0 SKIP
  - TestSSEMultipartUploadIntegration etc.    PASS
2026-04-26 16:31:42 -07:00
Chris Lu
525900dfe4 fix(s3api): backfill multipart SSE-S3 metadata at completion (#9224)
* fix(s3api): backfill missing per-chunk SSE-S3 metadata at completion

When a part of an SSE-S3 multipart upload lands with SseType=NONE on
its chunks (e.g. a transient failure to apply SSE-S3 setup in
PutObjectPart), the completed object inherits NONE-tagged chunks and
detectPrimarySSEType then misses the chunked SSE-S3 encryption. The
read path falls through to the unencrypted serve and GET returns
ciphertext, producing the SHA mismatch reported in #8908.

Recover at completion using the base IV and key data the upload
directory recorded at CreateMultipartUpload:

  - extractMultipartSSES3Info validates upload-entry metadata up
    front and hard-fails completion if the base IV or key data are
    malformed; serializing chunk metadata we then could not decrypt
    is worse than rejecting the upload.
  - completedMultipartChunk re-derives a per-chunk IV from baseIV +
    chunk.Offset (matching what putToFiler would have written) and
    serializes per-chunk SSE-S3 metadata when the chunk has no tag.
    Existing per-chunk metadata is left alone; we cannot recover an
    already-derived IV from the upload-entry alone.

The IV formula intentionally has no partNumber term: putToFiler
hardcodes partOffset=0 when it calls handleSSES3MultipartEncryption
for every part, so each chunk's encryption IV is
calculateIVWithOffset(baseIV, chunk.Offset_part_local).
PartOffsetMultiplier is defined in s3_constants but is not consumed
by the encryption path. Adopting (partNumber-1)*PartOffsetMultiplier
+ chunk.Offset would produce IVs that fail to decrypt the bytes on
disk - a stronger failure mode than the bug being fixed. Tests pin
this:

  - TestCompletedMultipartChunkBackfilledIVDecryptsActualCiphertext
    runs the round trip across the encryption boundary: encrypt
    parts with CreateSSES3EncryptedReaderWithBaseIV (the call
    putToFiler uses), drop chunk metadata to reproduce #8908,
    backfill, decrypt with backfilled IV, assert plaintext intact.
  - TestCompletedMultipartChunkRejectsPartNumberMultiplierFormula
    constructs the IV the partNumber formula would produce and
    shows it does not decrypt the actual ciphertext.

This commit covers the chunk-level recovery only. The companion
fix for the object-level Extended attributes (SeaweedFSSSES3Key /
X-Amz-Server-Side-Encryption) follows separately.

* fix(s3api): backfill canonical SSE-S3 attributes onto multipart object

The previous commit ensures every chunk of an SSE-S3 multipart upload
carries SseType=SSE_S3 with a per-chunk IV, so the multipart-direct
read path can decrypt. The completed object's Extended map can still
miss the canonical pair detectPrimarySSEType and IsSSES3EncryptedInternal
look at:

  - X-Amz-Server-Side-Encryption (the AmzServerSideEncryption header
    detectPrimarySSEType reads on inline / small-object reads)
  - x-seaweedfs-sse-s3-key (SeaweedFSSSES3Key, required by
    IsSSES3EncryptedInternal and by the read-path key lookup)

When a part of the upload was written by a path that did not set
those (the same #8908 race that produced the NONE chunks),
copySSEHeadersFromFirstPart finds nothing to copy and the final entry
ends up with only the multipart-init keys (SeaweedFSSSES3Encryption /
BaseIV / KeyData). The read path then mis-detects the object as
unencrypted.

applyMultipartSSES3HeadersFromUploadEntry writes the canonical pair
from the multipart-init metadata in all three completion paths
(versioned, suspended, non-versioned), only when the keys are missing
so a healthy first part still wins. extractMultipartSSES3Info already
ran in prepareMultipartCompletionState, so the data is reused without
re-decoding.

Tests: TestApplyMultipartSSES3HeadersFromUploadEntry covers backfill,
do-not-clobber, and nil-info no-op cases.

* fix(s3api): drop double IV adjustment in SSE-KMS chunk view decrypt

decryptSSEKMSChunkView was pre-adjusting the SSE-KMS chunk IV
(calculateIVWithOffset(baseIV, ChunkOffset)) and then handing the
adjusted IV to CreateSSEKMSDecryptedReader, which itself runs
calculateIVWithOffset(IV, ChunkOffset) on whatever it receives. The
offset was being applied twice for any chunk with a non-zero
ChunkOffset, corrupting the keystream for range reads that cross
multipart chunk boundaries.

Pass the raw SSE-KMS key (with base IV and the original ChunkOffset
field) into CreateSSEKMSDecryptedReader so the offset is applied
exactly once, and remove the now-dead intra-block skip that was
compensating for the double adjustment.

Add an anti-test inside TestSSEKMSDecryptChunkView_RequiresOffsetAdjustment
that decrypts the same ciphertext with a deliberately double-adjusted
IV and asserts the output is corrupted, so any regression that
re-introduces the double application fails the unit test.

* test(s3): cover multipart SSE across chunk-spanning parts and ranges

Adds an integration subtest "Multipart Parts Larger Than Internal
Chunks Across SSE Types" to TestSSEMultipartUploadIntegration that
exercises the end-to-end S3 path for the bugs fixed in this branch:

  - Two-part multipart upload with each part larger than the 8MB
    internal SeaweedFS chunk, so each part itself spans multiple
    underlying chunks.
  - Subtests for SSE-C, SSE-KMS, explicit SSE-S3, and bucket-default
    SSE-S3 - the four paths multipart parts can take through the SSE
    pipeline.
  - Each subtest does a full GET (verifying every byte and the
    response Content-Length / SSE response headers) plus a 129-byte
    range read straddling the 8MB internal chunk boundary, which is
    the path that produced the SSE-KMS double-IV corruption (fix in
    the previous commit) and the SSE-S3 chunk-tag loss (fix in the
    earlier commits).

Factored the request shape behind multipartSSEOptions /
uploadAndVerifyMultipartSSEObject so all four SSE flavors share the
same upload+verify code; only the SSE-specific input/output
configuration differs per subtest.

* test(s3): abort orphan multipart uploads on test failure

Address coderabbit nitpick on uploadAndVerifyMultipartSSEObject. The
helper used require.NoError after CreateMultipartUpload, UploadPart
and CompleteMultipartUpload, so a failure in any of those (or in the
later GET / range read on a still-incomplete upload) called t.Fatal
without aborting the in-flight MPU, leaving an orphan upload in the
bucket. Harmless in CI where the data dir is wiped on shutdown, but a
real annoyance when iterating locally and a textbook AWS S3 caveat in
production.

Register a t.Cleanup that calls AbortMultipartUpload unless a
"completed" flag was set right after a successful
CompleteMultipartUpload. Use context.Background for the abort call
since the parent ctx may already be cancelled at cleanup time, and
t.Logf the abort error rather than failing the test so the original
failure remains visible in the run output.
2026-04-25 23:06:37 -07:00
Chris Lu
e77f8ae204 fix(s3api): route STS GetFederationToken to STS handler (#9157) (#9167)
* fix(s3api): route STS GetFederationToken requests to STS handler (#9157)

The STS GetFederationToken handler was implemented but never reachable.
Three routing gaps sent requests to the S3/IAM path instead of STS:

- No explicit mux route for Action=GetFederationToken in the URL query
- iamMatcher did not exclude GetFederationToken, so authenticated POSTs
  with Action in the form body were matched and dispatched to IAM
- UnifiedPostHandler only dispatched AssumeRole* and GetCallerIdentity
  to STS, leaving GetFederationToken to fall through to DoActions and
  return NotImplemented

Add the missing route, the matcher exclusion, and the dispatch branch.

Also wire TestSTS, TestAssumeRoleWithWebIdentity, and TestServiceAccount
into the s3-iam-tests workflow as a new "sts" matrix entry. Before this
change, none of test/s3/iam/s3_sts_get_federation_token_test.go's four
test functions ran in CI, which is why this regression shipped.

* test(iam): make orphaned STS/service-account tests pass under auth-enabled CI

Follow-up to wiring STS tests into CI: fixes several pre-existing issues
that made the newly-included tests fail locally.

Server fixes:
- weed/s3api/s3api_sts.go: handleGetFederationToken no longer 500s when
  the caller is a legacy S3-config identity (not in the IAM user store).
  Previously any GetPoliciesForUser error short-circuited to InternalError,
  which hard-failed every SigV4 caller using keys from -s3.config.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now generates IDs
  in the sa:<parent>:<uuid> format required by
  credential.ValidateServiceAccountId. The old "sa-XXXXXXXX" format failed
  the persistence-layer regex and caused every CreateServiceAccount call
  to return 500 once a filer-backed credential store validated the ID.

Test helpers:
- test/s3/iam/s3_sts_assume_role_test.go: callSTSAPIWithSigV4 no longer
  sets req.Header["Host"]. aws-sdk-go v1 v4.Signer already signs Host from
  req.URL.Host, and a manual Host header made the signer emit host;host in
  SignedHeaders, producing SignatureDoesNotMatch. Updated missing_role_arn
  subtest to match the existing SeaweedFS behavior (user-context
  assumption).
- test/s3/iam/s3_service_account_test.go: callIAMAPI now SigV4-signs
  requests when STS_TEST_{ACCESS,SECRET}_KEY env vars are set. Unsigned
  IAM writes otherwise fall through to the STS fallback and return
  InvalidAction.

CI matrix:
- .github/workflows/s3-iam-tests.yml: skip
  TestServiceAccountLifecycle/use_service_account_credentials only. The
  rest of the service-account suite passes; that one subtest depends on a
  separate credential-reload issue where new ABIA keys briefly register
  into accessKeyIdent but aren't persisted to the filer, so they vanish
  on the next reload. Out of scope for the #9157 GetFederationToken fix.

* fix(credential): accept AWS IAM username chars in service-account IDs

Gemini review on #9167 pointed out that ServiceAccountIdPattern's
parent-user segment was more restrictive than an AWS IAM username:
`[A-Za-z0-9_-]` vs. IAM's `[\w+=,.@-]`. Realistic usernames with
`@`, `.`, `+`, `=`, or `,` (e.g. email-style principals) would fail
validation at the filer store even though the embedded IAM API
happily created them.

Broaden the regex to `[A-Za-z0-9_+=,.@-]` (matching the AWS IAM spec
at https://docs.aws.amazon.com/IAM/latest/APIReference/API_User.html)
and add a table-driven test that locks the expansion in.

* address PR review feedback on #9167

All five review items were valid; changes keyed to review bullets:

- weed/s3api/s3api_sts.go: handleGetFederationToken no longer swallows
  arbitrary policy-lookup failures. Only credential.ErrUserNotFound is
  treated leniently (the legacy-config SigV4 path); any other error now
  returns InternalError so we don't mint tokens with an incomplete
  policy set.
- weed/credential/grpc/grpc_identity.go: GetUser translates gRPC
  NotFound back to credential.ErrUserNotFound so errors.Is(...) above
  matches for gRPC-backed stores, not just memory/filer-direct.
- weed/s3api/s3api_embedded_iam.go: CreateServiceAccount now validates
  the generated saId against credential.ValidateServiceAccountId before
  returning. Surfaces a client 400 with the offending ID instead of the
  opaque 500 that used to bubble up from the persistence layer.
- weed/s3api/s3api_server_routing_test.go: seed a routing-test identity
  with a known AK/SK, sign TestRouting_GetFederationTokenAuthenticatedBody
  with aws-sdk-go v4.Signer so the request actually passes
  AuthSignatureOnly. Assert 503 ServiceUnavailable (from STSHandlers
  with no stsService) instead of just NotEqual(501) — 503 proves the
  dispatch reached STSHandlers.HandleSTSRequest.
- test/s3/iam/s3_service_account_test.go: callIAMAPI signs with
  service="iam" instead of "s3" (SeaweedFS verifies against whichever
  service the client signed with, but "iam" is semantically correct).
- weed/credential/validation_test.go: add positive rows for an
  uppercase parent (sa:ALICE:...) and a canonical hyphenated UUID
  suffix (sa:alice:123e4567-e89b-12d3-a456-426614174000).
2026-04-20 19:33:22 -07:00
Chris Lu
d0a09ea178 fix(s3): honor ChecksumAlgorithm on presigned URL uploads (#9076)
* fix(s3): honor ChecksumAlgorithm on presigned URL uploads

AWS SDK presigners hoist x-amz-sdk-checksum-algorithm (and related
checksum headers) into the signed URL's query string, so servers must
read either location. detectRequestedChecksumAlgorithm only looked at
request headers, so presigned PUTs with ChecksumAlgorithm set validated
and stored no additional checksum, and HEAD/GET never returned the
x-amz-checksum-* header.

Read these parameters from headers first, then fall back to a
case-insensitive query-string lookup. Apply the same fallback when
comparing an object-level checksum value against the computed one.

Fixes #9075

* test(s3): presigned URL checksum integration tests (#9075)

Adds test/s3/checksum with end-to-end coverage for flexible-checksum
behavior on presigned URL uploads. Tests generate a presigned PUT URL
with ChecksumAlgorithm set, upload the body with a plain http.Client
(bypassing AWS SDK middleware so the server must honor the query-string
hoisted x-amz-sdk-checksum-algorithm), then HEAD/GET with
ChecksumMode=ENABLED and assert the stored x-amz-checksum-* header.

Covers SHA256, SHA1, and a negative control with no checksum requested.
Wires the new directory into s3-go-tests.yml as its own CI job.

* perf(s3): parse presigned query once in detectRequestedChecksumAlgorithm

Previously, each header fallback called getHeaderOrQuery, which re-parsed
r.URL.Query() and allocated a new map on every invocation — up to eight
times per PutObject request. Parse the raw query at most once per request
(only when non-empty) and pass the pre-parsed url.Values into a new
lookupHeaderOrQuery helper.

Also drops a redundant strings.ToLower allocation in the case-insensitive
query key scan (strings.EqualFold already handles ASCII case folding).

Addresses review feedback from gemini-code-assist on PR #9076.

* test(s3): honor credential env vars and add presigned upload timeout

- init() now reads S3_ACCESS_KEY/S3_SECRET_KEY (and AWS_ACCESS_KEY_ID /
  AWS_SECRET_ACCESS_KEY / AWS_REGION fallbacks) so that
  `make test-with-server ACCESS_KEY=... SECRET_KEY=...` no longer
  authenticates with hardcoded defaults while the server has been
  started with different credentials.
- uploadViaPresignedURL uses a dedicated http.Client with a 30s timeout
  instead of http.DefaultClient, so a stalled server fails fast in CI
  instead of blocking until the suite's global timeout fires.

Addresses review feedback from coderabbitai on PR #9076.

* test(s3): pass S3_PORT and credentials through to checksum tests

- 'make test' now exports S3_ENDPOINT, S3_ACCESS_KEY, and S3_SECRET_KEY
  derived from the Makefile variables so the Go test process talks to
  the same endpoint/credentials that start-server was launched with.
- start-server cleans up the background SeaweedFS process and PID file
  when the readiness poll times out, preventing stale port conflicts on
  subsequent runs.

Addresses review feedback from coderabbitai on PR #9076.

* ci(s3): raise checksum tests step timeout

make test-with-server builds weed_binary, waits up to 90s for readiness,
then runs go test -timeout=10m. The previous 12-minute step timeout only
had ~2 minutes of headroom over the Go timeout, risking the Actions
runner killing the step before tests reported a real failure.

Bumps the job timeout from 15 to 20 minutes and the step timeout from
12 to 16 minutes, matching other S3 integration jobs.

Addresses review feedback from coderabbitai on PR #9076.

* perf(s3): thread pre-parsed query through putToFiler hot path

Parse the request's query string once at the top of putToFiler and
reuse the resulting url.Values for both the checksum-algorithm detection
and the expected-checksum verification. Previously, the verification
path called getHeaderOrQuery which re-parsed r.URL.Query() again on
every PutObject, defeating the previous commit's single-parse goal.

- Add parseRequestQuery + detectRequestedChecksumAlgorithmQ (the
  pre-parsed-query variant). detectRequestedChecksumAlgorithm is now a
  thin wrapper used by callers that do a single lookup.
- putToFiler parses once and threads the result through both call sites.
- Remove getHeaderOrQuery and update the unit test to use
  lookupHeaderOrQuery directly.

Addresses follow-up review from gemini-code-assist on PR #9076.

* test(s3): check io.ReadAll error in uploadViaPresignedURL helper

* test(s3): drop SHA1 presigned test case

The AWS SDK v2 presigner signs a Content-MD5 header at presign time
for SHA1 PutObject requests even when no body is attached (the MD5 of
the empty payload gets baked into the signed headers). Uploading the
real body via a plain http.Client then trips SeaweedFS's MD5 validation
and returns BadDigest — an SDK/presigner quirk, not a SeaweedFS bug.

The SHA256 positive case already exercises the server-side
query-hoisted algorithm path that issue #9075 is about, and the unit
tests in weed/s3api cover each algorithm's header mapping. Drop the
SHA1 integration case rather than chase SDK-specific workarounds.

* test(s3): provide real Content-MD5 to presigned checksum test

AWS SDK v2's flexible-checksum middleware signs a Content-MD5 header at
presign time. There is no body to hash at that point, so it seeds the
header with MD5 of the empty payload. When the real body is then PUT
with a plain http.Client, SeaweedFS's server-side Content-MD5
verification correctly rejects the upload with BadDigest.

Pre-compute the MD5 of the test body and thread it into
PutObjectInput.ContentMD5 so the signed Content-MD5 matches the body
that will actually be uploaded. The test still exercises the
server-side path that reads X-Amz-Sdk-Checksum-Algorithm from the
query string (the fix that PR #9076 is validating).

* test(s3): send the signed Content-MD5 header on presigned upload

uploadViaPresignedURL now accepts an extraHeaders map so callers can
thread through headers that the presigner signed but the raw http
request would otherwise omit. The SHA256 test passes the Content-MD5
it computed, matching what the presigner baked into the signature.

Fixes SignatureDoesNotMatch seen in CI after the previous commit set
ContentMD5 on the presign input without sending the corresponding
header on the actual upload.

* test(s3): build presigned URL with the raw v4 signer

The AWS SDK v2 s3.PresignClient runs the flexible-checksum middleware
for any PutObject input that carries ChecksumAlgorithm. That middleware
injects a Content-MD5 header at presign time, and with no body present
it seeds MD5-of-empty. Any subsequent upload of a non-empty body
through a plain http.Client then trips SeaweedFS's Content-MD5
verification and returns BadDigest — not the code path that issue
#9075 is about.

Replace the PresignClient usage in the integration test with a direct
call to v4.Signer.PresignHTTP, building a canonical URL whose query
string already contains x-amz-sdk-checksum-algorithm=SHA256. This is
exactly the shape of URL a browser/curl client would receive from any
presigner that hoists the algorithm header, and it exercises the
server-side fix from PR #9076 without dragging in SDK-specific
middleware quirks.

* test(s3): set X-Amz-Expires on presigned URL before signing

v4.Signer.PresignHTTP does not add X-Amz-Expires on its own — the
caller has to seed it into the request's query string so the signer
includes it in the canonical query and the server accepts the
presigned URL. Without it, SeaweedFS correctly returns
AuthorizationQueryParametersError.

Also adds a .gitignore for the make-managed test volume data, log
file, and PID file so local `make test-with-server` runs do not leave
artifacts tracked by git.

Verified by running the integration tests locally:
  make test-with-server → both presigned checksum tests PASS.
2026-04-14 21:52:49 -07:00
Chris Lu
10e7f0f2bc fix(shell): s3.user.provision handles existing users by attaching policy (#9040)
* fix(shell): s3.user.provision handles existing users by attaching policy

Instead of erroring when the user already exists, the command now
creates the policy and attaches it to the existing user via UpdateUser.
Credentials are only generated and displayed for newly created users.

* fix(shell): skip duplicate policy attachment in s3.user.provision

Check if the policy is already attached before appending and calling
UpdateUser, making repeated runs idempotent.

* fix(shell): generate service account ID in s3.serviceaccount.create

The command built a ServiceAccount proto without setting Id, which was
rejected by credential.ValidateServiceAccountId on any real store. Now
generates sa:<parent>:<uuid> matching the format used by the admin UI.

* test(s3): integration tests for s3.* shell commands

Adds TestShell* integration tests covering ~40 previously untested
shell commands: user, accesskey, group, serviceaccount, anonymous,
bucket, policy.attach/detach, config.show, and iam.export/import.

Switches the test cluster's credential store from memory to filer_etc
because the memory store silently drops groups and service accounts
in LoadConfiguration/SaveConfiguration.

* fix(shell): rollback policy on key generation failure in s3.user.provision

If iam.GenerateRandomString or iam.GenerateSecretAccessKey fails after
the policy was persisted, the policy would be left orphaned. Extracts
the rollback logic into a local closure and invokes it on all failure
paths after policy creation for consistency.

* address PR review feedback for s3 shell tests and serviceaccount

- s3.serviceaccount.create: use 16 bytes of randomness (hex-encoded) for
  the service account UUID instead of 4 bytes to eliminate collision risk
- s3.serviceaccount.create: print the actual ID and drop the outdated
  "server-assigned" note (the ID is now client-generated)
- tests: guard createdAK in accesskey rotate/delete subtests so sibling
  failures don't run invalid CLI calls
- tests: requireContains/requireNotContains use t.Fatalf to fail fast
- tests: Provision subtest asserts the "Attached policy" message on the
  second provision call for an existing user
- tests: update extractServiceAccountID comment example to match the
  sa:<parent>:<uuid> format
- tests: drop redundant saID empty-check (extractServiceAccountID fatals)

* test(s3): use t.Fatalf for precondition check in serviceaccount test
2026-04-11 22:30:51 -07:00
Chris Lu
e648c76bcf go fmt 2026-04-10 17:31:14 -07:00
Chris Lu
e21d7602c3 feat(iam): implement group inline policy actions (#8992)
* feat(iam): implement group inline policy actions

Add PutGroupPolicy, GetGroupPolicy, DeleteGroupPolicy, and
ListGroupPolicies to both embedded and standalone IAM servers.

The standalone IAM stores group inline policies in a new
GroupInlinePolicies field in the Policies JSON, mirroring the
existing user inline policy pattern. DeleteGroup now also checks
for inline policies before allowing deletion.

* fix: address review feedback for group inline policies

- Embedded IAM: return NotImplemented for group inline policies
  instead of silently succeeding as no-ops (Gemini + CodeRabbit)
- Standalone IAM: recompute member actions after PutGroupPolicy
  and DeleteGroupPolicy (Gemini)
- Add parameter validation for GroupName/PolicyName/PolicyDocument
  on PutGroupPolicy, DeleteGroupPolicy, ListGroupPolicies (Gemini)
- Add UserName validation for ListUserPolicies in standalone IAM
- Call cleanupGroupInlinePolicies from DeleteGroup (Gemini)
- Migrate GroupInlinePolicies on group rename in UpdateGroup (CodeRabbit)
- Fix integration test cleanup order (CodeRabbit)

* fix: persist recomputed actions and improve error handling

- Set changed=true for PutGroupPolicy/DeleteGroupPolicy in standalone
  IAM DoActions so recomputed member actions are persisted (Gemini critical)
- Make cleanupGroupInlinePolicies accept policies parameter to avoid
  redundant I/O, return error (Gemini)
- Make migrateGroupInlinePolicies return error, handle in caller (Gemini)

* fix: include group policies in action recomputation

Extend computeAllActionsForUser to also aggregate group inline
policies and group managed policies when s3cfg is provided.
Previously, group inline policies were stored but never reflected
in member Identity.Actions. (CodeRabbit critical)

* perf: use identity index in recomputeActionsForGroupMembers for O(N+M)

* fix: skip group inline policy integration test on embedded IAM

The embedded IAM returns NotImplemented for group inline policies.
Skip TestIAMGroupInlinePolicy when running against embedded mode
to avoid CI failures in the group integration test matrix.
2026-04-08 15:57:04 -07:00
Chris Lu
45ee2ab4b9 feat(iam): implement ListUserPolicies API action (#8991)
* feat(iam): implement ListUserPolicies API action (#8987)

Add ListUserPolicies support to both embedded and standalone IAM servers,
resolving the NotImplemented error when calling `aws iam list-user-policies`.

* fix: address review feedback for ListUserPolicies

- Add handleImplicitUsername for ListUserPolicies in both IAM servers
  so omitting UserName defaults to the calling user (Gemini review)
- Assert synthetic policy name in unit test (CodeRabbit)
- Use require.True for error type assertion in integration test (CodeRabbit)
2026-04-08 12:27:03 -07:00
Chris Lu
fbe758efa8 test: consolidate port allocation into shared test/testutil package (#8982)
* test: consolidate port allocation into shared test/testutil package

Move duplicated port allocation logic from 15+ test files into a single
shared package at test/testutil/. This fixes a port collision bug where
independently allocated ports could overlap via the gRPC offset
(port+10000), causing weed mini to reject the configuration.

The shared package provides:
- AllocatePorts: atomic allocation of N unique ports
- AllocateMiniPorts/MustFreeMiniPorts: gRPC-offset-aware allocation
  that prevents port A+10000 == port B collisions
- WaitForPort, WaitForService, FindBindIP, WriteIAMConfig, HasDocker

* test: address review feedback and fix FUSE build

- Revert fuse_integration change: it has its own go.mod and cannot
  import the shared testutil package
- AllocateMiniPorts: hold all listeners open until the entire batch is
  allocated, preventing race conditions where other processes steal ports
- HasDocker: add 5s context timeout to avoid hanging on stalled Docker
- WaitForService: only treat 2xx HTTP status codes as ready

* test: use global rand in AllocateMiniPorts for better seeding

Go 1.20+ auto-seeds the global rand generator. Using it avoids
identical sequences when multiple tests call at the same nanosecond.

* test: revert WaitForService status code check

S3 endpoints return non-2xx (e.g. 403) on bare GET requests, so
requiring 2xx caused the S3 integration test to time out. Any HTTP
response is sufficient proof that the service is running.

* test: fix gofmt formatting in s3tables test files
2026-04-08 11:30:02 -07:00
Chris Lu
a4753b6a3b S3: delay empty folder cleanup to prevent Spark write failures (#8970)
* S3: delay empty folder cleanup to prevent Spark write failures (#8963)

Empty folders were being cleaned up within seconds, causing Apache Spark
(s3a) writes to fail when temporary directories like _temporary/0/task_xxx/
were briefly empty.

- Increase default cleanup delay from 5s to 2 minutes
- Only process queue items that have individually aged past the delay
  (previously the entire queue was drained once any item triggered)
- Make the delay configurable via filer.toml:
  [filer.options]
  s3.empty_folder_cleanup_delay = "2m"

* test: increase cleanup wait timeout to match 2m delay

The empty folder cleanup delay was increased to 2 minutes, so the
Spark integration test needs to wait longer for temporary directories
to disappear.

* fix: eagerly clean parent directories after empty folder deletion

After deleting an empty folder, immediately try to clean its parent
rather than relying on cascading metadata events that each re-enter
the 2-minute delay queue. This prevents multi-minute waits when
cleaning nested temporary directory trees (e.g. Spark's _temporary
hierarchy with 3+ levels would take 6m+ vs near-instant).

Fixes the CI failure where lingering _temporary parent directories
were not cleaned within the test's 3-minute timeout.
2026-04-07 13:20:59 -07:00
Chris Lu
8fad85aed7 feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK (#8904)
* feat(s3): support WEED_S3_SSE_KEY env var for SSE-S3 KEK

Add support for providing the SSE-S3 Key Encryption Key (KEK) via the
WEED_S3_SSE_KEY environment variable (hex-encoded 256-bit key). This
avoids storing the master key in plaintext on the filer at /etc/s3/sse_kek.

Key source priority:
1. WEED_S3_SSE_KEY environment variable (recommended)
2. Existing filer KEK at /etc/s3/sse_kek (backward compatible)
3. Auto-generate and save to filer (deprecated for new deployments)

Existing deployments with a filer-stored KEK continue to work unchanged.
A deprecation warning is logged when auto-generating a new filer KEK.

* refactor(s3): derive KEK from any string via HKDF instead of requiring hex

Accept any secret string in WEED_S3_SSE_KEY and derive a 256-bit key
using HKDF-SHA256 instead of requiring a hex-encoded key. This is
simpler for users — no need to generate hex, just set a passphrase.

* feat(s3): add WEED_S3_SSE_KEK and WEED_S3_SSE_KEY env vars for KEK

Two env vars for providing the SSE-S3 Key Encryption Key:

- WEED_S3_SSE_KEK: hex-encoded, same format as /etc/s3/sse_kek.
  If the filer file also exists, they must match.
- WEED_S3_SSE_KEY: any string, 256-bit key derived via HKDF-SHA256.
  Refuses to start if /etc/s3/sse_kek exists (must delete first).

Only one may be set. Existing filer-stored KEKs continue to work.
Auto-generating and storing new KEKs on filer is deprecated.

* fix(s3): stop auto-generating KEK, fail only when SSE-S3 is used

Instead of auto-generating a KEK and storing it on the filer when no
key source is configured, simply leave SSE-S3 disabled. Encrypt and
decrypt operations return a clear error directing the user to set
WEED_S3_SSE_KEK or WEED_S3_SSE_KEY.

* refactor(s3): move SSE-S3 KEK config to security.toml

Move KEK configuration from standalone env vars to security.toml's new
[sse_s3] section, following the same pattern as JWT keys and TLS certs.

  [sse_s3]
  kek = ""   # hex-encoded 256-bit key (same format as /etc/s3/sse_kek)
  key = ""   # any string, HKDF-derived

Viper's WEED_ prefix auto-mapping provides env var support:
WEED_SSE_S3_KEK and WEED_SSE_S3_KEY.

All existing behavior is preserved: filer KEK fallback, mismatch
detection, and HKDF derivation.

* refactor(s3): rename SSE-S3 config keys to s3.sse.kek / s3.sse.key

Use [s3.sse] section in security.toml, matching the existing naming
convention (e.g. [s3.*]). Env vars: WEED_S3_SSE_KEK, WEED_S3_SSE_KEY.

* fix(s3): address code review findings for SSE-S3 KEK

- Don't hold mutex during filer retry loop (up to 20s of sleep).
  Lock only to write filerClient and superKey.
- Remove dead generateAndSaveSuperKeyToFiler and unused constants.
- Return error from deriveKeyFromSecret instead of ignoring it.
- Fix outdated doc comment on InitializeWithFiler.
- Use t.Setenv in tests instead of manual os.Setenv/Unsetenv.

* fix(s3): don't block startup on filer errors when KEK is configured

- When s3.sse.kek is set, a temporarily unreachable filer no longer
  prevents startup. The filer consistency check becomes best-effort
  with a warning.
- Same treatment for s3.sse.key: filer unreachable logs a warning
  instead of failing.
- Rewrite error messages to suggest migration instead of file deletion,
  avoiding the risk of orphaning encrypted data.

Finding 3 (restore auto-generation) intentionally skipped — auto-gen
was removed by design to avoid storing plaintext KEK on filer.

* fix(test): set WEED_S3_SSE_KEY in SSE integration test server startup

SSE-S3 no longer auto-generates a KEK, so integration tests must
provide one. Set WEED_S3_SSE_KEY=test-sse-s3-key in all weed mini
invocations in the test Makefile.
2026-04-03 13:01:21 -07:00
Chris Lu
059bee683f feat(s3): add STS GetFederationToken support (#8891)
* feat(s3): add STS GetFederationToken support

Implement the AWS STS GetFederationToken API, which allows long-term IAM
users to obtain temporary credentials scoped down by an optional inline
session policy. This is useful for server-side applications that mint
per-user temporary credentials.

Key behaviors:
- Requires SigV4 authentication from a long-term IAM user
- Rejects calls from temporary credentials (session tokens)
- Name parameter (2-64 chars) identifies the federated user
- DurationSeconds supports 900-129600 (15 min to 36 hours, default 12h)
- Optional inline session policy for permission scoping
- Caller's attached policies are embedded in the JWT token
- Returns federated user ARN: arn:aws:sts::<account>:federated-user/<Name>

No performance impact on the S3 hot path — credential vending is a
separate control-plane operation, and all policy data is embedded in
the stateless JWT token.

* fix(s3): address GetFederationToken PR review feedback

- Fix Name validation: max 32 chars (not 64) per AWS spec, add regex
  validation for [\w+=,.@-]+ character whitelist
- Refactor parseDurationSeconds into parseDurationSecondsWithBounds to
  eliminate duplicated duration parsing logic
- Add sts:GetFederationToken permission check via VerifyActionPermission
  mirroring the AssumeRole authorization pattern
- Change GetPoliciesForUser to return ([]string, error) so callers fail
  closed on policy-resolution failures instead of silently returning nil
- Move temporary-credentials rejection before SigV4 verification for
  early rejection and proper test coverage
- Update tests: verify specific error message for temp cred rejection,
  add regex validation test cases (spaces, slashes rejected)

* refactor(s3): use sts.Action* constants instead of hard-coded strings

Replace hard-coded "sts:AssumeRole" and "sts:GetFederationToken" strings
in VerifyActionPermission calls with sts.ActionAssumeRole and
sts.ActionGetFederationToken package constants.

* fix(s3): pass through sts: prefix in action resolver and merge policies

Two fixes:

1. mapBaseActionToS3Format now passes through "sts:" prefix alongside
   "s3:" and "iam:", preventing sts:GetFederationToken from being
   rewritten to s3:sts:GetFederationToken in VerifyActionPermission.
   This also fixes the existing sts:AssumeRole permission checks.

2. GetFederationToken policy embedding now merges identity.PolicyNames
   (from SigV4 identity) with policies from the IAM manager (which may
   include group-attached policies), deduplicated via a map. Previously
   the IAM manager lookup was skipped when identity.PolicyNames was
   non-empty, causing group policies to be omitted from the token.

* test(s3): add integration tests for sts: action passthrough and policy merge

Action resolver tests:
- TestMapBaseActionToS3Format_ServicePrefixPassthrough: verifies s3:, iam:,
  and sts: prefixed actions pass through unchanged while coarse actions
  (Read, Write) are mapped to S3 format
- TestResolveS3Action_STSActionsPassthrough: verifies sts:AssumeRole,
  sts:GetFederationToken, sts:GetCallerIdentity pass through ResolveS3Action
  unchanged with both nil and real HTTP requests

Policy merge tests:
- TestGetFederationToken_GetPoliciesForUser: tests IAMManager.GetPoliciesForUser
  with no user store (error), missing user, user with policies, user without
- TestGetFederationToken_PolicyMergeAndDedup: tests that identity.PolicyNames
  and IAM-manager-resolved policies are merged and deduplicated (SharedPolicy
  appears in both sources, result has 3 unique policies)
- TestGetFederationToken_PolicyMergeNoManager: tests that when IAM manager is
  unavailable, identity.PolicyNames alone are embedded

* test(s3): add end-to-end integration tests for GetFederationToken

Add integration tests that call GetFederationToken using real AWS SigV4
signed HTTP requests against a running SeaweedFS instance, following the
existing pattern in test/s3/iam/s3_sts_assume_role_test.go.

Tests:
- TestSTSGetFederationTokenValidation: missing name, name too short/long,
  invalid characters, duration too short/long, malformed policy, anonymous
  rejection (7 subtests)
- TestSTSGetFederationTokenRejectTemporaryCredentials: obtains temp creds
  via AssumeRole then verifies GetFederationToken rejects them
- TestSTSGetFederationTokenSuccess: basic success, custom 1h duration,
  36h max duration with expiration time verification
- TestSTSGetFederationTokenWithSessionPolicy: creates a bucket, obtains
  federated creds with GetObject-only session policy, verifies GetObject
  succeeds and PutObject is denied using the AWS SDK S3 client
2026-04-02 17:37:05 -07:00
Chris Lu
a4b896a224 fix(s3): skip directories before marker in ListObjectVersions pagination (#8890)
* fix(s3): skip directories before marker in ListObjectVersions pagination

ListObjectVersions was re-traversing the entire directory tree from the
beginning on every paginated request, only skipping entries at the leaf
level. For buckets with millions of objects in deep hierarchies, this
caused exponentially slower responses as pagination progressed.

Two optimizations:
1. Use keyMarker to compute a startFrom position at each directory level,
   skipping directly to the relevant entry instead of scanning from the
   beginning (mirroring how ListObjects uses marker descent).
2. Skip recursing into subdirectories whose keys are entirely before the
   keyMarker.

Changes per-page cost from O(entries_before_marker) to O(tree_depth).

* test(s3): add integration test for deep-hierarchy version listing pagination

Adds TestVersioningPaginationDeepDirectoryHierarchy which creates objects
across 20 subdirectories at depth 6 (mimicking Veeam 365 backup layout)
and paginates through them with small maxKeys. Verifies correctness
(no duplicates, sorted order, all objects found) and checks that later
pages don't take dramatically longer than earlier ones — the symptom
of the pre-fix re-traversal bug. Also tests delimiter+pagination
interaction across subdirectories.

* test(s3): strengthen deep-hierarchy pagination assertions

- Replace timing warning (t.Logf) with a failing assertion (t.Errorf)
  so pagination regressions actually fail the test.
- Replace generic count/uniqueness/sort checks on CommonPrefixes with
  exact equality against the expected prefix slice, catching wrong-but-
  sorted results.

* test(s3): use allKeys for exact assertion in deep-hierarchy pagination test

Wire the allKeys slice (previously unused dead code) into the version
listing assertion, replacing generic count/uniqueness/sort checks with
an exact equality comparison against the keys that were created.
2026-04-02 15:59:52 -07:00
Chris Lu
98714b9f70 fix(test): address flaky S3 distributed lock integration test (#8888)
* fix(test): address flaky S3 distributed lock integration test

Two root causes:

1. Lock ring convergence race: After waitForFilerCount(2) confirms the
   master sees both filers, there's a window where filer0's lock ring
   still only contains itself (master's LockRingUpdate broadcast is
   delayed by the 1s stabilization timer). During this window filer0
   considers itself primary for ALL keys, so both filers can
   independently grant the same lock.

   Fix: Add waitForLockRingConverged() that acquires the same lock
   through both filers and verifies mutual exclusion before proceeding.

2. Hash function mismatch: ownerForObjectLock used util.HashStringToLong
   (MD5 + modulo) to predict lock owners, but the production DLM uses
   CRC32 consistent hashing via HashRing. This meant the test could
   pick keys that route to the same filer, not exercising the
   cross-filer coordination it intended to test.

   Fix: Use lock_manager.NewHashRing + GetPrimary() to match production
   routing exactly.

* fix(test): verify lock denial reason in convergence check

Ensure the convergence check only returns true when the second lock
attempt is denied specifically because the lock is already owned,
avoiding false positives from transient errors.

* fix(test): check one key per primary filer in convergence wait

A single arbitrary key can false-pass: if its real primary is the filer
with the stale ring, mutual exclusion holds trivially because that filer
IS the correct primary. Generate one test key per distinct primary using
the same consistent-hash ring as production, so a stale ring on any
filer is caught deterministically.
2026-04-02 13:11:04 -07:00
Chris Lu
d5068b3ee6 test: harden weed mini readiness checks 2026-03-30 16:21:38 -07:00
Chris Lu
0884acd70c ci: add S3 mutation regression coverage (#8804)
* test/s3: stabilize distributed lock regression

* ci: add S3 mutation regression workflow

* test: fix delete regression readiness probe

* test: address mutation regression review comments
2026-03-28 20:17:20 -07:00
Chris Lu
0adb78bc6b s3api: make conditional mutations atomic and AWS-compatible (#8802)
* s3api: serialize conditional write finalization

* s3api: add conditional delete mutation checks

* s3api: enforce destination conditions for copy

* s3api: revalidate multipart completion under lock

* s3api: rollback failed put finalization hooks

* s3api: report delete-marker version deletions

* s3api: fix copy destination versioning edge cases

* s3api: make versioned multipart completion idempotent

* test/s3: cover conditional mutation regressions

* s3api: rollback failed copy version finalization

* s3api: resolve suspended delete conditions via latest entry

* s3api: remove copy test null-version injection

* s3api: reject out-of-order multipart completions

* s3api: preserve multipart replay version metadata

* s3api: surface copy destination existence errors

* s3api: simplify delete condition target resolution

* test/s3: make conditional delete assertions order independent

* test/s3: add distributed lock gateway integration

* s3api: fail closed multipart versioned completion

* s3api: harden copy metadata and overwrite paths

* s3api: create delete markers for suspended deletes

* s3api: allow duplicate multipart completion parts
2026-03-27 19:22:26 -07:00
Chris Lu
ba624f1f34 Rust volume server implementation with CI (#8539)
* Match Go gRPC client transport defaults

* Honor Go HTTP idle timeout

* Honor maintenanceMBps during volume copy

* Honor images.fix.orientation on uploads

* Honor cpuprofile when pprof is disabled

* Match Go memory status payloads

* Propagate request IDs across gRPC calls

* Format pending Rust source updates

* Match Go stats endpoint payloads

* Serve Go volume server UI assets

* Enforce Go HTTP whitelist guards

* Align Rust metrics admin-port test with Go behavior

* Format pending Rust server updates

* Honor access.ui without per-request JWT checks

* Honor keepLocalDatFile in tier upload shortcut

* Honor Go remote volume write mode

* Load tier backends from master config

* Check master config before loading volumes

* Remove vif files on volume destroy

* Delete remote tier data on volume destroy

* Honor vif version defaults and overrides

* Reject mismatched vif bytes offsets

* Load remote-only tiered volumes

* Report Go tail offsets in sync status

* Stream remote dat in incremental copy

* Honor collection vif for EC shard config

* Persist EC expireAtSec in vif metadata

* Stream remote volume reads through HTTP

* Serve HTTP ranges from backend source

* Match Go ReadAllNeedles scan order

* Match Go CopyFile zero-stop metadata

* Delete EC volumes with collection cleanup

* Drop deleted collection metrics

* Match Go tombstone ReadNeedleMeta

* Match Go TTL parsing: all-digit default to minutes, two-pass fit algorithm

* Match Go needle ID/cookie formatting and name size computation

* Match Go image ext checks: webp resize only, no crop; empty healthz body

* Match Go Prometheus metric names and add missing handler counter constants

* Match Go ReplicaPlacement short string parsing with zero-padding

* Add missing EC constants MAX_SHARD_COUNT and MIN_TOTAL_DISKS

* Add walk_ecx_stats for accurate EC volume file counts and size

* Match Go VolumeStatus dat file size, EC shard stats, and disk pct precision

* Match Go needle map: unconditional delete counter, fix redb idx walk offset

* Add CompactMapSegment overflow panic guard matching Go

* Match Go volume: vif creation, version from superblock, TTL expiry, dedup data_size, garbage_level fallback

* Match Go 304 Not Modified: return bare status with no headers

* Match Go JWT error message: use "wrong jwt" instead of detailed error

* Match Go read handler bare 400, delete error prefix, download throttle timeout

* Match Go pretty JSON 1-space indent and "Deletion Failed:" error prefix

* Match Go heartbeat: keep is_heartbeating on error, add EC shard identification

* Match Go needle ReadBytes V2: tolerate EOF on truncated body

* Match Go volume: cookie check on any existing needle, return DataSize, 128KB meta guard

* Match Go DeleteCollection: propagate destroy errors

* Match Go gRPC: BatchDelete no flag, IncrementalCopy error, FetchAndWrite concurrent, VolumeUnmount/DeleteCollection errors, tail draining, query error code

* Match Go Content-Disposition RFC 6266 formatting with RFC 2231 encoding

* Match Go Guard isWriteActive: combine whitelist and signing key check

* Match Go DeleteCollectionMetrics: use partial label matching

* Match Go heartbeat: send state-only delta on volume state changes

* Match Go ReadNeedleMeta paged I/O: read header+tail only, skip data; add EIO tracking

* Match Go ScrubVolume INDEX mode dispatch; add VolumeCopy preallocation and EC NeedleStatus TODOs

* Add read_ec_shard_needle for full needle reconstruction from local EC shards

* Make heartbeat master config helpers pub for VolumeCopy preallocation

* Match Go gRPC: VolumeCopy preallocation, EC NeedleStatus full read, error message wording

* Match Go HTTP responses: omitempty fields, 2-space JSON indent, JWT JSON error, delete pretty/JSONP, 304 Last-Modified, raw write error

* Match Go WriteNeedleBlob V3 timestamp patching, fix makeup_diff double padding, count==0 read handling

* Add rebuild_ecx_file for EC index reconstruction from data shards

* Match Go gRPC: tail header first-chunk-only, EC cleanup on failure, copy append mode, ecx rebuild, compact cancellation

* Add EC volume read and delete support in HTTP handlers

* Add per-shard EC mount/unmount, location predicate search, idx directory for EC

* Add CheckVolumeDataIntegrity on volume load matching Go

* Match Go gRPC: EC multi-disk placement, per-shard mount/unmount, no auto-mount on reconstruct, streaming ReadAll/EcShardRead, ReceiveFile cleanup, version check, proxy streaming, redirect Content-Type

* Match Go heartbeat metric accounting

* Match Go duplicate UUID heartbeat retries

* Delete expired EC volumes during heartbeat

* Match Go volume heartbeat pruning

* Honor master preallocate in volume max

* Report remote storage info in heartbeats

* Emit EC heartbeat deltas on shard changes

* Match Go throttle boundary: use <= instead of <, fix pretty JSON to 1-space

* Match Go write_needle_blob monotonic appendAtNs via get_append_at_ns

* Match Go VolumeUnmount: idempotent success when volume not found

* Match Go TTL Display: return empty string when unit is Empty

Go checks `t.Unit == Empty` separately and returns "" for TTLs
with nonzero count but Empty unit. Rust only checked is_empty()
(count==0 && unit==0), so count>0 with unit=0 would format as
"5 " instead of "".

* Match Go error behavior for truncated needle data in read_body_v2

Go's readNeedleDataVersion2 returns "index out of range %d" errors
(indices 1-7) when needle body or metadata fields are truncated.
Rust was silently tolerating truncation and returning Ok. Now returns
NeedleError::IndexOutOfRange with the matching index for each field.

* Match Go download throttle: return JSON error instead of plain text

* Match Go crop params: default x1/y1 to 0 when not provided

* Match Go ScrubEcVolume: accumulate total_files from EC shards

* Match Go ScrubVolume: count total_files even on scrub error

* Match Go VolumeEcShardsCopy: set ignore_source_file_not_found for .vif

* Match Go VolumeTailSender: send needle_header on every chunk

* Match Go read_super_block: apply replication override from .vif

* Match Go check_volume_data_integrity: verify all 10 entries, detect trailing corruption

* Match Go WriteNeedleBlob: dedup check before writing during replication

* handlers: use meta-only reads for HEAD

* handlers: align range parsing and responses with Go

* handlers: align upload parsing with Go

* deps: enable webp support

* Make 5bytes the default feature for idx entry compatibility

* Match Go TTL: preserve original unit when count fits in byte

* Fix EC locate_needle: use get_actual_size for full needle size

* Fix raw body POST: only parse multipart when Content-Type contains form-data

* Match Go ReceiveFile: return protocol errors in response body, not gRPC status

* add docs

* Match Go VolumeEcShardsCopy: append to .ecj file instead of truncating

* Match Go ParsePath: support _delta suffix on file IDs for sub-file addressing

* Match Go chunk manifest: add Accept-Ranges, Content-Disposition, filename fallback, MIME detection

* Match Go privateStoreHandler: use proper JSON error for unsupported methods

* Match Go Destroy: add only_empty parameter to reject non-empty volume deletion

* Fix compilation: set_read_only_persist and set_writable return ()

These methods fire-and-forget save_vif internally, so gRPC callers
should not try to chain .map_err() on the unit return type.

* Match Go SaveVolumeInfo: check writability and propagate errors in save_vif

* Match Go VolumeDelete: propagate only_empty to delete_volume for defense in depth

The gRPC VolumeDelete handler had a pre-check for only_empty but then
passed false to store.delete_volume(), bypassing the store-level check.
Go passes req.OnlyEmpty directly to DeleteVolume. Now Rust does the same
for defense in depth against TOCTOU races (though the store write lock
makes this unlikely).

* Match Go ProcessRangeRequest: return full content for empty/oversized ranges

Go returns nil from ProcessRangeRequest when ranges are empty or total
range size exceeds content length, causing the caller to serve the full
content as a normal 200 response. Rust was returning an empty 200 body.

* Match Go Query: quote JSON keys in output records

Go's ToJson produces valid JSON with quoted keys like {"name":"Alice"}.
Rust was producing invalid JSON with unquoted keys like {name:"Alice"}.

* Match Go VolumeCopy: reject when no suitable disk location exists

Go returns ErrVolumeNoSpaceLeft when no location matches the disk type
and has sufficient space. Rust had an unsafe fallback that silently
picked the first location regardless of type or available space.

* Match Go DeleteVolumeNeedle: check noWriteOrDelete before allowing delete

Go checks v.noWriteOrDelete before proceeding with needle deletion,
returning "volume is read only" if true. Rust was skipping this check.

* Match Go ReceiveFile: prefer HardDrive location for EC and use response-level write errors

Two fixes: (1) Go prefers HardDriveType disk location for EC volumes,
falling back to first location. Returns "no storage location available"
when no locations exist. (2) Write failures are now response-level
errors (in response body) instead of gRPC status errors, matching Go.

* Match Go CopyFile: sync EC volume journal to disk before copying

Go calls ecVolume.Sync() before copying EC volume files to ensure the
.ecj journal is flushed to disk. Added sync_to_disk() to EcVolume and
call it in the CopyFile EC branch.

* Match Go readSuperBlock: propagate replication parse errors

Go returns an error when parsing the replication string from the .vif
file fails. Rust was silently ignoring the parse failure and using the
super block's replication as-is.

* Match Go TTL expiry: remove append_at_ns > 0 guard

Go computes TTL expiry from AppendAtNs without guarding against zero.
When append_at_ns is 0, the expiry is epoch + TTL which is in the past,
correctly returning NotFound. Rust's extra guard skipped the check,
incorrectly returning success for such needles.

* Match Go delete_collection: skip volumes with compaction in progress

Go checks !v.isCompactionInProgress.Load() before destroying a volume
during collection deletion, skipping compacting volumes. Also changed
destroy errors to log instead of aborting the entire collection delete.

* Match Go MarkReadonly/MarkWritable: always notify master even on local error

Go always notifies the master regardless of whether the local
set_read_only_persist or set_writable step fails. The Rust code was
using `?` which short-circuited on error, skipping the final master
notification. Save the result and defer the `?` until after the
notify call.

* Match Go PostHandler: return 500 for all write errors

Go returns 500 (InternalServerError) for all write failures. Rust was
returning 404 for volume-not-found and 403 for read-only volumes.

* Match Go makeupDiff: validate .cpd compaction revision is old + 1

Go reads the new .cpd file's super block and verifies the compaction
revision is exactly old + 1. Rust only validated the old revision.

* Match Go VolumeStatus: check data backend before returning status

Go checks v.DataBackend != nil before building the status response,
returning an error if missing. Rust was silently returning size 0.

* Match Go PostHandler: always include mime field in upload response JSON

Go always serializes the mime field even when empty ("mime":""). Rust was
omitting it when empty due to Option<String> with skip_serializing_if.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go subtracts EC shard equivalents when computing available volume slots.
Rust was only comparing volume count, potentially over-counting free
slots on locations with many EC shards.

* Match Go privateStoreHandler: use INVALID as metrics label for unsupported methods

Go records the method as INVALID in metrics for unsupported HTTP methods.
Rust was using the actual method name.

* Match Go volume: add commit_compact guard and scrub data size validation

Two fixes: (1) commit_compact now checks/sets is_compacting flag to
prevent concurrent commits, matching Go's CompareAndSwap guard.
(2) scrub now validates total needle sizes against .dat file size.

* Match Go gRPC: fix TailSender error propagation, EcShardsInfo all slots, EcShardRead .ecx check

Three fixes: (1) VolumeTailSender now propagates binary search errors
instead of silently falling back to start. (2) VolumeEcShardsInfo
returns entries for all shard slots including unmounted. (3)
VolumeEcShardRead checks .ecx index for deletions instead of .ecj.

* Match Go metrics: add BuildInfo gauge and connection tracking functions

Go exposes a BuildInfo Prometheus metric with version labels, and tracks
open connections via stats.ConnectionOpen/Close. Added both to Rust.

* Match Go NeedleMap.Delete: use !is_deleted() instead of is_valid()

Go's CompactMap.Delete checks !IsDeleted() not IsValid(), so needles
with size==0 (live but anomalous) can still be deleted. The Rust code
was using is_valid() which returns false for size==0, preventing
deletion of such needles.

* Match Go fitTtlCount: always normalize TTL to coarsest unit

Go's fitTtlCount always converts to seconds first, then finds the
coarsest unit that fits in one byte (e.g., 120m → 2h). Rust had an
early return for count<=255 that skipped normalization, producing
different binary encodings for the same duration.

* Match Go BuildInfo metric: correct name and add missing labels

Go uses SeaweedFS_build_info (Namespace=SeaweedFS, Subsystem=build,
Name=info) with labels [version, commit, sizelimit, goos, goarch].
Rust had SeaweedFS_volumeServer_buildInfo with only [version].

* Match Go HTTP handlers: fix UploadResult fields, DiskStatus JSON, chunk manifest ETag

- UploadResult.mime: add skip_serializing_if to omit empty MIME (Go uses omitempty)
- UploadResult.contentMd5: only include when request provided Content-MD5 header
- Content-MD5 response header: only set when request provided it
- DiskStatuses: use camelCase field names (percentFree, percentUsed, diskType)
  to match Go's protobuf JSON marshaling
- Chunk manifest: preserve needle ETag in expanded response headers

* Match Go volume: fix version(), integrity check, scrub, and commit_compact

- version(): use self.version() instead of self.super_block.version in
  read_all_needles, check_volume_data_integrity, scan_raw_needles_from
  to respect volumeInfo.version override
- check_volume_data_integrity: initialize healthy_index_size to idx_size
  (matching Go) and continue on EOF instead of returning error
- scrub(): count deleted needles in total_read since they still occupy
  space in the .dat file (matches Go's totalRead += actualSize for deleted)
- commit_compact: clean up .cpd/.cpx files on makeup_diff failure
  (matches Go's error path cleanup)

* Match Go write queue: add 4MB batch byte limit

Go's startWorker breaks the batch at either 128 requests or 4MB of
accumulated write data. Rust only had the 128-request limit, allowing
large writes to accumulate unbounded latency.

* Add TTL normalization tests for Go parity verification

Test that fit_ttl_count normalizes 120m→2h, 24h→1d, 7d→1w even
when count fits in a byte, matching Go's fitTtlCount behavior.

* Match Go FindFreeLocation: account for EC shards in free slot calculation

Go's free volume count subtracts both regular volumes and EC volumes
from max_volume_count. Rust was only counting regular volumes, which
could over-report available slots when EC shards are mounted.

* Match Go EC volume: mark deletions in .ecx and replay .ecj at startup

Go's DeleteNeedleFromEcx marks needles as deleted in the .ecx index
in-place (writing TOMBSTONE_FILE_SIZE at the size field) in addition
to appending to the .ecj journal. Go's RebuildEcxFile replays .ecj
entries into .ecx on startup, then removes the .ecj file.

Rust was only appending to .ecj without marking .ecx, which meant
deleted EC needles remained readable via .ecx binary search. This
fix:
- Opens .ecx in read/write mode (was read-only)
- Adds mark_needle_deleted_in_ecx: binary search + in-place write
- Calls it from journal_delete before appending to .ecj
- Adds rebuild_ecx_from_journal: replays .ecj into .ecx on startup

* Match Go check_all_ec_shards_deleted: use MAX_SHARD_COUNT instead of hardcoded 14

Go's TotalShardsCount is DataShardsCount + ParityShardsCount = 14 by
default, but custom EC configs via .vif can have more shards (up to
MaxShardCount = 32). Using MAX_SHARD_COUNT ensures all shard files
are checked regardless of EC configuration.

* Match Go EC locate: subtract 1 from shard size and use datFileSize override

Go's LocateEcShardNeedleInterval passes shard.ecdFileSize-1 to
LocateData (shards are padded, -1 avoids overcounting large block
rows). When datFileSize is known, Go uses datFileSize/DataShards
instead. Rust was passing the raw shard file size without adjustment.

* Fix TTL parsing and DiskStatus field names to match Go exactly

TTL::read: Go's ReadTTL preserves the original unit (7d stays 7d,
not 1w) and errors on count > 255. The previous normalization change
was incorrect — Go only normalizes internally via fitTtlCount, not
during string parsing.

DiskStatus: Go uses encoding/json on protobuf structs, which reads
the json struct tags (snake_case: percent_free, percent_used,
disk_type), not the protobuf JSON names (camelCase). Revert to
snake_case to match Go's actual output.

* Fix heartbeat: check leader != current master before redirect, process duplicated UUIDs first

Match Go's volume_grpc_client_to_master.go behavior:
1. Only trigger leader redirect when the leader address differs from the
   current master (prevents unnecessary reconnect loops when master confirms
   its own address).
2. Process duplicated_uuids before leader redirect check, matching Go's
   ordering where duplicate UUID detection takes priority.

* Remove SetState version check to match Go behavior

Go's SetState unconditionally applies the state without any version
mismatch check. The Rust version had an extra optimistic concurrency
check that would reject valid requests from Go clients that don't
track versions.

* Fix TTL::read() to normalize via fit_ttl_count matching Go's ReadTTL

Go's ReadTTL calls fitTtlCount which converts to seconds and normalizes
to the coarsest unit that fits in a byte count (e.g. 120m->2h, 7d->1w,
24h->1d). The Rust version was preserving the original unit, producing
different binary encodings on disk and in heartbeat messages.

* Always return Content-MD5 header and JSON field on successful writes

Go always sets Content-MD5 in the response regardless of whether the
request included it. The Rust version was conditionally including it
only when the request provided Content-MD5.

* Include name and size in UploadResult JSON even when empty/zero

Go's encoding/json always includes empty strings and zero values in
the upload response. The Rust version was using skip_serializing_if
to omit them, causing JSON structure differences.

* Include deleted needles in scan_raw_needles_from to match Go

Go's ScanVolumeFileFrom visits ALL needles including deleted ones.
Skipping deleted entries during incremental copy would cause tombstones
to not be propagated, making deleted files reappear on the receiving side.

* Match Go NeedleMap.Delete: always write tombstone to idx file

Go's NeedleMap.Delete unconditionally writes a tombstone entry to the
idx file and updates metrics, even if the needle doesn't exist or is
already deleted. This is important for replication where every delete
operation must produce an idx write. The Rust version was skipping the
tombstone write for non-existent or already-deleted needles.

* Limit MIME type to 255 bytes matching Go's CreateNeedleFromRequest

* Title-case Seaweed-* pair keys to match Go HTTP header canonicalization

* Unify DiskType::Hdd into HardDrive to match Go's single HardDriveType

* Skip tombstone entries in walk_ecx_stats total_size matching Go's Raw()

* Return EMPTY TTL when computed seconds is zero matching Go's fitTtlCount

* Include disk-space-low in Volume.is_read_only() matching Go

* Log error on CIDR parse failure in whitelist matching Go's glog.Errorf

* Log cookie mismatch in gRPC Query matching Go's V(0).Infof

* Fix is_expired volume_size comparison to use < matching Go

Go checks `volumeSize < super_block.SuperBlockSize` (strict less-than),
but Rust used `<=`. This meant Rust would fail to expire a volume that
is exactly SUPER_BLOCK_SIZE bytes.

* Apply Go's JWT expiry defaults: 10s write, 60s read

Go calls v.SetDefault("jwt.signing.expires_after_seconds", 10) and
v.SetDefault("jwt.signing.read.expires_after_seconds", 60). Rust
defaulted to 0 for both, which meant tokens would never expire when
security.toml has a signing key but omits expires_after_seconds.

* Stop [grpc.volume].ca from overriding [grpc].ca matching Go

Go reads the gRPC CA file only from config.GetString("grpc.ca"), i.e.
the [grpc] section. The [grpc.volume] section only provides cert and
key. Rust was also reading ca from [grpc.volume] which would silently
override the [grpc].ca value when both were present.

* Fix free_volume_count to use EC shard count matching Go

Was counting EC volumes instead of EC shards, which underestimates EC
space usage. One EC volume with 14 shards uses ~1.4 volume slots, not 1.
Now uses Go's formula: ((max - volumes) * DataShardsCount - ecShardCount) / DataShardsCount.

* Include preallocate in compaction space check matching Go

Go uses max(preallocate, estimatedCompactSize) for the free space check.
Rust was only using the estimated volume size, which could start a
compaction that fails mid-way if preallocate exceeds the volume size.

* Check gzip magic bytes before setting Content-Encoding matching Go

Go checks both Accept-Encoding contains "gzip" AND IsGzippedContent
(data starts with 0x1f 0x8b) before setting Content-Encoding: gzip.
Rust only checked Accept-Encoding, which could incorrectly declare
gzip encoding for non-gzip compressed data.

* Only set upload response name when needle HasName matching Go

Go checks reqNeedle.HasName() before setting ret.Name. Rust always set
the name from the filename variable, which could return the fid portion
of the path as the name for raw PUT requests without a filename.

* Treat MaxVolumeCount==0 as unlimited matching Go's hasFreeDiskLocation

Go's hasFreeDiskLocation returns true immediately when MaxVolumeCount
is 0, treating it as unlimited. Rust was computing effective_free as
<= 0 for max==0, rejecting the location. This could fail volume
creation during early startup before the first heartbeat adjusts max.

* Read lastAppendAtNs from deleted V3 entries in integrity check

Go's doCheckAndFixVolumeData reads AppendAtNs from both live entries
(verifyNeedleIntegrity) and deleted tombstones (verifyDeletedNeedleIntegrity).
Rust was skipping deleted entries, which could result in a stale
last_append_at_ns if the last index entry is a deletion.

* Return empty body for empty/oversized range requests matching Go

Go's ProcessRangeRequest returns nil (empty body, 200 OK) when
parsed ranges are empty or combined range size exceeds total content
size. The Rust buffered path incorrectly returned the full file data
for both cases. The streaming path already handled this correctly.

* Dispatch ScrubEcVolume by mode matching Go's INDEX/LOCAL/FULL

Go's ScrubEcVolume switches on mode: INDEX calls v.ScrubIndex()
(ecx integrity only), LOCAL calls v.ScrubLocal(), FULL calls
vs.store.ScrubEcVolume(). Rust was ignoring the mode and always
running verify_ec_shards. Now INDEX mode checks ecx index integrity
(sorted overlap detection + file size validation) without shard I/O,
while LOCAL/FULL modes run the existing shard verification.

* Fix TTL test expectation: 7d normalizes to 1w matching Go's fitTtlCount

Go's ReadTTL calls fitTtlCount which normalizes to the coarsest unit
that fits: 7 days = 1 week, so "7d" becomes {Count:1, Unit:Week}
which displays as "1w". Both Go and Rust normalize identically.

* Add version mismatch check to SetState matching Go's State.Update

Go's State.Update compares the incoming version with the stored
version and returns "version mismatch" error if they differ. This
provides optimistic concurrency control. The Rust implementation
was accepting any version unconditionally.

* Use unquoted keys in Query JSON output matching Go's json.ToJson

Go's json.ToJson produces records with unquoted keys like
{score:12} not {"score":12}. This is a custom format used
internally by SeaweedFS for query results.

* Fix TTL test expectation in VolumeNeedleStatus: 7d normalizes to 1w

Same normalization as the HTTP test: Go's ReadTTL calls fitTtlCount
which converts 7 days to 1 week.

* Include ETag header in 304 Not Modified responses matching Go behavior

Go sets ETag on the response writer (via SetEtag) before the
If-Modified-Since and If-None-Match conditional checks, so both
304 response paths include the ETag header. The Rust implementation
was only adding ETag to 200 responses.

* Remove needle-name fallback in chunk manifest filename resolution

Go's tryHandleChunkedFile only falls back from URL filename to
manifest name. Rust had an extra fallback to needle.name that
Go does not perform, which could produce different
Content-Disposition filenames for chunk manifests.

* Validate JWT nbf (Not Before) claim matching Go's jwt-go/v5

Go's jwt.ParseWithClaims validates the nbf claim when present,
rejecting tokens whose nbf is in the future. The Rust jsonwebtoken
crate defaults validate_nbf to false, so tokens with future nbf
were incorrectly accepted.

* Set isHeartbeating to true at startup matching Go's VolumeServer init

Go unconditionally sets isHeartbeating: true in the VolumeServer
struct literal. Rust was starting with false when masters are
configured, causing /healthz to return 503 until the first
heartbeat succeeds.

* Call store.close() on shutdown matching Go's Shutdown()

Go's Shutdown() calls vs.store.Close() which closes all volumes
and flushes file handles. The Rust server was relying on process
exit for cleanup, which could leave data unflushed.

* Include server ID in maintenance mode error matching Go's format

Go returns "volume server %s is in maintenance mode" with the
store ID. Rust was returning a generic "maintenance mode" message.

* Fix DiskType test: use HardDrive variant matching Go's HddType=""

Go maps both "" and "hdd" to HardDriveType (empty string). The
Rust enum variant is HardDrive, not Hdd. The test referenced a
nonexistent Hdd variant causing compilation failure.

* Do not include ETag in 304 responses matching Go's GetOrHeadHandler

Go sets ETag at L235 AFTER the If-Modified-Since and If-None-Match
304 return paths, so Go's 304 responses do not include the ETag header.
The Rust code was incorrectly including ETag in both 304 response paths.

* Return 400 on malformed query strings in PostHandler matching Go's ParseForm

Go's r.ParseForm() returns HTTP 400 with "form parse error: ..." when
the query string is malformed. Rust was silently falling back to empty
query params via unwrap_or_default().

* Load EC volume version from .vif matching Go's NewEcVolume

Go sets ev.Version = needle.Version(volumeInfo.Version) from the .vif
file. Rust was always using Version::current() (V3), which would produce
wrong needle actual size calculations for volumes created with V1 or V2.

* Sync .ecx file before close matching Go's EcVolume.Close

Go calls ev.ecxFile.Sync() before closing to ensure in-place deletion
marks are flushed to disk. Without this, deletion marks written via
MarkNeedleDeleted could be lost on crash.

* Validate SuperBlock extra data size matching Go's Bytes() guard

Go checks extraSize > 256*256-2 and calls glog.Fatalf to prevent
corrupt super block headers. Rust was silently truncating via u16 cast,
which would write an incorrect extra_size field.

* Update quinn-proto 0.11.13 -> 0.11.14 to fix GHSA-6xvm-j4wr-6v98

Fixes Dependency Review CI failure: quinn-proto < 0.11.14 is vulnerable
to unauthenticated remote DoS via panic in QUIC transport parameter
parsing.

* Skip TestMultipartUploadUsesFormFieldsForTimestampAndTTL for Go server

Go's r.FormValue() cannot read multipart text fields after
r.MultipartReader() consumes the body, so ts/ttl sent as multipart
form fields only work with the Rust volume server. Skip this test
when VOLUME_SERVER_IMPL != "rust" to fix CI failure.

* Flush .ecx in EC volume sync_to_disk matching Go's Sync()

Go's EcVolume.Sync() flushes both the .ecj journal and the .ecx index
to disk. The Rust version only flushed .ecj, leaving in-place deletion
marks in .ecx unpersisted until close(). This could cause data
inconsistency if the server crashes after marking a needle deleted in
.ecx but before close().

* Remove .vif file in EC volume destroy matching Go's Destroy()

Go's EcVolume.Destroy() removes .ecx, .ecj, and .vif files. The Rust
version only removed .ecx and .ecj, leaving orphaned .vif files on
disk after EC volume destruction (e.g., after TTL expiry).

* Fix is_expired to use <= for SuperBlockSize check matching Go

Go checks contentSize <= SuperBlockSize to detect empty volumes (no
needles). Rust used < which would incorrectly allow a volume with
exactly SuperBlockSize bytes (header only, no data) to proceed to
the TTL expiry check and potentially be marked as expired.

* Fix read_append_at_ns to read timestamps from tombstone entries

Go reads the full needle body for all entries including tombstones
(deleted needles with size=0) to extract the actual AppendAtNs
timestamp. The Rust version returned 0 early for size <= 0 entries,
which would cause the binary search in incremental copy to produce
incorrect results for positions containing deleted needles.

Now uses get_actual_size to compute the on-disk size (which handles
tombstones correctly) and only returns 0 when the actual size is 0.

* Add X-Request-Id response header matching Go's requestIDMiddleware

Go sets both X-Request-Id and x-amz-request-id response headers.
The Rust server only set x-amz-request-id, missing X-Request-Id.

* Add skip_serializing_if for UploadResult name and size fields

Go's UploadResult uses json:"name,omitempty" and json:"size,omitempty",
omitting these fields from JSON when they are zero values (empty
string / 0). The Rust struct always serialized them, producing
"name":"" and "size":0 where Go would omit them.

* Support JSONP/pretty-print for write success responses

Go's writeJsonQuiet checks for callback (JSONP) and pretty query
parameters on all JSON responses including write success. The Rust
write success path used axum::Json directly, bypassing JSONP and
pretty-print support. Now uses json_result_with_query to match Go.

* Include actual limit in file size limit error message

Go returns "file over the limited %d bytes" with the actual limit
value included. Rust returned a generic "file size limit exceeded"
without the limit value, making it harder to debug.

* Extract extension from 2-segment URL paths for image operations

Go's parseURLPath extracts the file extension from all URL formats
including 2-segment paths like /vid,fid.jpg. The Rust version only
handled 3-segment paths (/vid/fid/filename.ext), so extensions in
2-segment paths were lost. This caused image resize/crop operations
requested via query params to be silently skipped for those paths.

* Add size_hint to TrackedBody so throttled downloads get Content-Length

TrackedBody (used for download throttling) did not implement
size_hint(), causing HTTP/1.1 to fall back to chunked transfer
encoding instead of setting Content-Length. Go always sets
Content-Length explicitly for non-range responses.

* Add Last-Modified, pairs, and S3 headers to chunk manifest responses

Go sets Last-Modified, needle pairs, and S3 pass-through headers on
the response writer BEFORE calling tryHandleChunkedFile. Since the
Rust chunk manifest handler created fresh response headers and
returned early, these headers were missing from chunk manifest
responses. Now passes last_modified_str into the chunk manifest
handler and applies pairs and S3 pass-through query params
(response-cache-control, response-content-encoding, etc.) to the
chunk manifest response headers.

* Fix multipart fallback to use first part data when no filename

Go reads the first part's data unconditionally, then looks for a
part with a filename. If none found, Go uses the first part's data
(with empty filename). Rust only captured parts with filenames, so
when no part had a filename it fell back to the raw multipart body
bytes (including boundary delimiters), producing corrupt needle data.

* Set HasName and HasMime flags for empty values matching Go

Go's CreateNeedleFromRequest sets HasName and HasMime flags even when
the filename or MIME type is empty (len < 256 is true for len 0).
Rust skipped empty values, causing the on-disk needle format to
differ: Go-written needles include extra bytes for the empty name/mime
size fields, changing the serialized needle size in the idx entry.
This ensures binary format compatibility between Go and Rust servers.

* Add is_stopping guard to vacuum_volume_commit matching Go

Go's CommitCompactVolume (store_vacuum.go L53-54) checks
s.isStopping before committing compaction to prevent file
swaps during shutdown. The Rust handler was missing this
check, which could allow compaction commits while the
server is stopping.

* Remove disk_type from required status fields since Go omits it

Go's default DiskType is "" (HardDriveType), and protobuf's omitempty
tag causes empty strings to be dropped from JSON output.

* test: honor rust env in dual volume harness

* grpc: notify master after volume lifecycle changes

* http: proxy to replicas before download-limit timeout

* test: pass readMode to rust volume harnesses

* fix store free-location predicate selection

* fix volume copy disk placement and heartbeat notification

* fix chunk manifest delete replication

* fix write replication to survive client disconnects

* fix download limit proxy and wait flow

* fix crop gating for streamed reads

* fix upload limit wait counter behavior

* fix chunk manifest image transforms

* fix has_resize_ops to check width/height > 0 instead of is_some()

Go's shouldResizeImages condition is `width > 0 || height > 0`, so
`?width=0` correctly evaluates to false. Rust was using `is_some()`
which made `?width=0` evaluate to true, unnecessarily disabling
streaming reads for those requests.

* fix Content-MD5 to only compute and return when provided by client

Go only computes the MD5 of uncompressed data when a Content-MD5
header or multipart field is provided. Rust was always computing and
returning it. Also fix the mismatch error message to include size,
matching Go's format.

* fix save_vif to compute ExpireAtSec from TTL

Go's SaveVolumeInfo always computes ExpireAtSec = now + ttlSeconds
when the volume has a TTL. The save_vif path (used by set_read_only
and set_writable) was missing this computation, causing .vif files
to be written without the correct expiration timestamp for TTL volumes.

* fix set_writable to not modify no_write_can_delete

Go's MarkVolumeWritable only sets noWriteOrDelete=false and persists.
Rust was additionally setting no_write_can_delete=has_remote_file,
which could incorrectly change the write mode for remote-file volumes
when the master explicitly asks to make the volume writable.

* fix write_needle_blob_and_index to error on too-small V3 blob

Go returns an error when the needle blob is too small for timestamp
patching. Rust was silently skipping the patch and writing the blob
with a stale/zero timestamp, which could cause data integrity issues
during incremental replication that relies on AppendAtNs ordering.

* fix VolumeEcShardsToVolume to validate dataShards range

Go validates that dataShards is > 0 and <= MaxShardCount before
proceeding with EC-to-volume reconstruction. Without this check,
a zero or excessively large data_shards value could cause confusing
downstream failures.

* fix destroy to use VolumeError::NotEmpty instead of generic Io error

The dedicated NotEmpty variant exists in the enum but was not being
used. This makes error matching consistent with Go's ErrVolumeNotEmpty.

* fix SetState to persist state to disk with rollback on failure

Go's State.Update saves VolumeServerState to a state.pb file after
each SetState call, and rolls back the in-memory state if persistence
fails. Rust was only updating in-memory atomics, so maintenance mode
would be lost on server restart. Now saves protobuf-encoded state.pb
and loads it on startup.

* fix VolumeTierMoveDatToRemote to close local dat backend after upload

Go calls v.LoadRemoteFile() after saving volume info, which closes
the local DataBackend before transitioning to remote storage. Without
this, the volume holds a stale file handle to the deleted local .dat
file, causing reads to fail until server restart.

* fix VolumeTierMoveDatFromRemote to close remote dat backend after download

Go calls v.DataBackend.Close() and sets DataBackend=nil after removing
the remote file reference. Without this, the stale remote backend
state lingers and reads may not discover the newly downloaded local
.dat file until server restart.

* fix redirect to use internal url instead of public_url

Go's proxyReqToTargetServer builds the redirect Location header from
loc.Url (the internal URL), not publicUrl. Using public_url could
cause redirect failures when internal and external URLs differ.

* fix redirect test and add state_file_path to integration test

Update redirect unit test to expect internal url (matching the
previous fix). Add missing state_file_path field to the integration
test VolumeServerState constructor.

* fix FetchAndWriteNeedle to await all writes before checking errors

Go uses a WaitGroup to await all writes (local + replicas) before
checking errors. Rust was short-circuiting on local write failure,
which could leave replica writes in-flight without waiting for
completion.

* fix shutdown to send deregister heartbeat before pre_stop delay

Go's StopHeartbeat() closes stopChan immediately on interrupt, causing
the heartbeat goroutine to send the deregister heartbeat right away,
before the preStopSeconds delay. Rust was only setting is_stopping=true
without waking the heartbeat loop, so the deregister was delayed until
after the pre_stop sleep. Now we call volume_state_notify.notify_one()
to wake the heartbeat immediately.

* fix heartbeat response ordering to check duplicate UUIDs first

Go processes heartbeat responses in this order: DuplicatedUuids first,
then volume options (prealloc/size limit), then leader redirect. Rust
was applying volume options before checking for duplicate UUIDs, which
meant volume option changes would take effect even when the response
contained a duplicate UUID error that should cause an immediate return.

* the test thread was blocked

* fix(deps): update aws-lc-sys 0.38.0 → 0.39.0 to resolve security advisories

Bumps aws-lc-rs 1.16.1 → 1.16.2, pulling in aws-lc-sys 0.39.0 which
fixes GHSA-394x-vwmw-crm3 (X.509 Name Constraints wildcard/unicode
bypass) and GHSA-9f94-5g5w-gf6r (CRL Distribution Point scope check
logic error).

* fix: match Go Content-MD5 mismatch error message format

Go uses "Content-MD5 did not match md5 of file data expected [X]
received [Y] size Z" while Rust had a shorter format. Match the
exact Go error string so clients see identical messages.

* fix: match Go Bearer token length check (> 7, not >= 7)

Go requires len(bearer) > 7 ensuring at least one char after
"Bearer ". Rust used >= 7 which would accept an empty token.

* fix(deps): drop legacy rustls 0.21 to resolve rustls-webpki GHSA-pwjx-qhcg-rvj4

aws-sdk-s3's default "rustls" feature enables tls-rustls in
aws-smithy-runtime, which pulls in legacy-rustls-ring (rustls 0.21
→ rustls-webpki 0.101.7, moderate CRL advisory). Replace with
explicit default-https-client which uses only rustls 0.23 /
rustls-webpki 0.103.9.

* fix: use uploaded filename for auto-compression extension detection

Go extracts the file extension from pu.FileName (the uploaded
filename) for auto-compression decisions. Rust was using the URL
path, which typically has no extension for SeaweedFS file IDs.

* fix: add CRC legacy Value() backward-compat check on needle read

Go double-checks CRC: n.Checksum != crc && uint32(n.Checksum) !=
crc.Value(). The Value() path is a deprecated transform for compat
with seaweed versions prior to commit 056c480eb. Rust had the
legacy_value() method but wasn't using it in validation.

* fix: remove /stats/* endpoints to match Go (commented out since L130)

Go's volume_server.go has the /stats/counter, /stats/memory, and
/stats/disk endpoints commented out (lines 130-134). Remove them
from the Rust router along with the now-unused whitelist_guard
middleware.

* fix: filter application/octet-stream MIME for chunk manifests

Go's tryHandleChunkedFile (L334) filters out application/octet-stream
from chunk manifest MIME types, falling back to extension-based
detection. Rust was returning the stored MIME as-is for manifests.

* fix: VolumeMarkWritable returns error before notifying master

Go returns early at L200 if MarkVolumeWritable fails, before
reaching the master notification at L206. Rust was notifying master
even on failure, creating inconsistent state where master thinks
the volume is writable but local marking failed.

* fix: check volume existence before maintenance in MarkReadonly/Writable

Go's VolumeMarkReadonly (L239-241) and VolumeMarkWritable (L253-255)
look up the volume first, then call makeVolumeReadonly/Writable which
checks maintenance. Rust was checking maintenance first, returning
"maintenance mode" instead of "not found" for missing volumes.

* feat: implement ScrubVolume mark_broken_volumes_readonly (PR #8360)

Add the mark_broken_volumes_readonly flag from PR #8360:
- Sync proto field (tag 3) to local volume_server.proto
- After scrubbing, if flag is set, call makeVolumeReadonly on each
  broken volume (notify master, mark local readonly, notify again)
- Collect errors via joined error semantics matching Go's errors.Join
- Factor out make_volume_readonly helper reused by both
  VolumeMarkReadonly and ScrubVolume

Also refactors VolumeMarkReadonly to use the shared helper.

* fix(deps): update rustls-webpki 0.103.9 → 0.103.10 (GHSA-pwjx-qhcg-rvj4)

CRL Distribution Point matching logic fix for moderate severity
advisory about CRLs not considered authoritative.

* test: update integration tests for removed /stats/* endpoints

Replace tests that expected /stats/* routes to return 200/401 with
tests confirming they now fall through to the store handler (400),
matching Go's commented-out stats endpoints.

* docs: fix misleading comment about default offset feature

The comment said "4-byte offsets unless explicitly built with 5-byte
support" but the default feature enables 5bytes. This is intentional
for production parity with Go -tags 5BytesOffset builds. Fix the
comment to match reality.
2026-03-26 17:24:35 -07:00
Chris Lu
e5f72077ee fix: resolve CORS cache race condition causing stale 404 responses (#8748)
The metadata subscription handler (updateBucketConfigCacheFromEntry) was
making a separate RPC call via loadCORSFromBucketContent to load CORS
configuration. This created a race window where a slow CreateBucket
subscription event could re-cache stale data after PutBucketCors had
already cleared the cache, causing subsequent GetBucketCors to return
404 NoSuchCORSConfiguration.

Parse CORS directly from the subscription entry's Content field instead
of making a separate RPC. Also fix getBucketConfig to parse CORS from
the already-fetched entry, eliminating a redundant RPC call.

Fix TestCORSCaching to use require.NoError to prevent nil pointer
dereference panics when GetBucketCors fails.
2026-03-23 19:33:20 -07:00
Chris Lu
d5ee35c8df Fix S3 delete for non-empty directory markers (#8740)
* Fix S3 delete for non-empty directory markers

* Address review feedback on directory marker deletes

* Stabilize FUSE concurrent directory operations
2026-03-23 13:35:16 -07:00
Chris Lu
d6a872c4b9 Preserve explicit directory markers with octet-stream MIME (#8726)
* Preserve octet-stream MIME on explicit directory markers

* Run empty directory marker regression in CI

* Run S3 Spark workflow for filer changes
2026-03-21 19:31:56 -07:00
Chris Lu
80f3079d2a fix(s3): include directory markers in ListObjects without delimiter (#8704)
* fix(s3): include directory markers in ListObjects without delimiter (#8698)

Directory key objects (zero-byte objects with keys ending in "/") created
via PutObject were omitted from ListObjects/ListObjectsV2 results when no
delimiter was specified. AWS S3 includes these as regular keys in Contents.

The issue was in doListFilerEntries: when recursing into directories in
non-delimiter mode, directory key objects were only emitted when
prefixEndsOnDelimiter was true. Added an else branch to emit them in the
general recursive case as well.

* remove issue reference from inline comment

* test: add child-under-marker and paginated listing coverage

Extend test 6 to place a child object under the directory marker
and paginate with MaxKeys=1 so the emit-then-recurse truncation
path is exercised.

* fix(test): skip directory markers in Spark temporary artifacts check

The listing check now correctly shows directory markers (keys ending
in "/") after the ListObjects fix. These 0-byte metadata objects are
not data artifacts — filter them from the listing check since the
HeadObject-based check already verifies their cleanup with a timeout.
2026-03-19 15:36:11 -07:00
Chris Lu
9984ce7dcb fix(s3): omit NotResource:null from bucket policy JSON response (#8658)
* fix(s3): omit NotResource:null from bucket policy JSON response (#8657)

Change NotResource from value type to pointer (*StringOrStringSlice) so
that omitempty properly omits it when unset, matching the existing
Principal field pattern. This prevents IaC tools (Terraform, Ansible)
from detecting false configuration drift.

Add bucket policy round-trip idempotency integration tests.

* simplify JSON comparison in bucket policy idempotency test

Use require.JSONEq directly on the raw JSON strings instead of
round-tripping through unmarshal/marshal, since JSONEq already
handles normalization internally.

* fix bucket policy test cases that locked out the admin user

The Deny+NotResource test cases used Action:"s3:*" which denied the
admin's own GetBucketPolicy call. Scope deny to s3:GetObject only,
and add an Allow+NotResource variant instead.

* fix(s3): also make Resource a pointer to fix empty string in JSON

Apply the same omitempty pointer fix to the Resource field, which was
emitting "Resource":"" when only NotResource was set. Add
NewStringOrStringSlicePtr helper, make Strings() nil-safe, and handle
*StringOrStringSlice in normalizeToStringSliceWithError.

* improve bucket policy integration tests per review feedback

- Replace time.Sleep with waitForClusterReady using ListBuckets
- Use structural hasKey check instead of brittle substring NotContains
- Assert specific NoSuchBucketPolicy error code after delete
- Handle single-statement policies in hasKey helper
2026-03-16 12:58:26 -07:00
Chris Lu
8cde3d4486 Add data file compaction to iceberg maintenance (Phase 2) (#8503)
* Add iceberg_maintenance plugin worker handler (Phase 1)

Implement automated Iceberg table maintenance as a new plugin worker job
type. The handler scans S3 table buckets for tables needing maintenance
and executes operations in the correct Iceberg order: expire snapshots,
remove orphan files, and rewrite manifests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add data file compaction to iceberg maintenance handler (Phase 2)

Implement bin-packing compaction for small Parquet data files:
- Enumerate data files from manifests, group by partition
- Merge small files using parquet-go (read rows, write merged output)
- Create new manifest with ADDED/DELETED/EXISTING entries
- Commit new snapshot with compaction metadata

Add 'compact' operation to maintenance order (runs before expire_snapshots),
configurable via target_file_size_bytes and min_input_files thresholds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix memory exhaustion in mergeParquetFiles by processing files sequentially

Previously all source Parquet files were loaded into memory simultaneously,
risking OOM when a compaction bin contained many small files. Now each file
is loaded, its rows are streamed into the output writer, and its data is
released before the next file is loaded — keeping peak memory proportional
to one input file plus the output buffer.

* Validate bucket/namespace/table names against path traversal

Reject names containing '..', '/', or '\' in Execute to prevent
directory traversal via crafted job parameters.

* Add filer address failover in iceberg maintenance handler

Try each filer address from cluster context in order instead of only
using the first one. This improves resilience when the primary filer
is temporarily unreachable.

* Add separate MinManifestsToRewrite config for manifest rewrite threshold

The rewrite_manifests operation was reusing MinInputFiles (meant for
compaction bin file counts) as its manifest count threshold. Add a
dedicated MinManifestsToRewrite field with its own config UI section
and default value (5) so the two thresholds can be tuned independently.

* Fix risky mtime fallback in orphan removal that could delete new files

When entry.Attributes is nil, mtime defaulted to Unix epoch (1970),
which would always be older than the safety threshold, causing the
file to be treated as eligible for deletion. Skip entries with nil
Attributes instead, matching the safer logic in operations.go.

* Fix undefined function references in iceberg_maintenance_handler.go

Use the exported function names (ShouldSkipDetectionByInterval,
BuildDetectorActivity, BuildExecutorActivity) matching their
definitions in vacuum_handler.go.

* Remove duplicated iceberg maintenance handler in favor of iceberg/ subpackage

The IcebergMaintenanceHandler and its compaction code in the parent
pluginworker package duplicated the logic already present in the
iceberg/ subpackage (which self-registers via init()). The old code
lacked stale-plan guards, proper path normalization, CAS-based xattr
updates, and error-returning parseOperations.

Since the registry pattern (default "all") makes the old handler
unreachable, remove it entirely. All functionality is provided by
iceberg.Handler with the reviewed improvements.

* Fix MinManifestsToRewrite clamping to match UI minimum of 2

The clamp reset values below 2 to the default of 5, contradicting the
UI's advertised MinValue of 2. Clamp to 2 instead.

* Sort entries by size descending in splitOversizedBin for better packing

Entries were processed in insertion order which is non-deterministic
from map iteration. Sorting largest-first before the splitting loop
improves bin packing efficiency by filling bins more evenly.

* Add context cancellation check to drainReader loop

The row-streaming loop in drainReader did not check ctx between
iterations, making long compaction merges uncancellable. Check
ctx.Done() at the top of each iteration.

* Fix splitOversizedBin to always respect targetSize limit

The minFiles check in the split condition allowed bins to grow past
targetSize when they had fewer than minFiles entries, defeating the
OOM protection. Now bins always split at targetSize, and a trailing
runt with fewer than minFiles entries is merged into the previous bin.

* Add integration tests for iceberg table maintenance plugin worker

Tests start a real weed mini cluster, create S3 buckets and Iceberg
table metadata via filer gRPC, then exercise the iceberg.Handler
operations (ExpireSnapshots, RemoveOrphans, RewriteManifests) against
the live filer. A full maintenance cycle test runs all operations in
sequence and verifies metadata consistency.

Also adds exported method wrappers (testing_api.go) so the integration
test package can call the unexported handler methods.

* Fix splitOversizedBin dropping files and add source path to drainReader errors

The runt-merge step could leave leading bins with fewer than minFiles
entries (e.g. [80,80,10,10] with targetSize=100, minFiles=2 would drop
the first 80-byte file). Replace the filter-based approach with an
iterative merge that folds any sub-minFiles bin into its smallest
neighbor, preserving all eligible files.

Also add the source file path to drainReader error messages so callers
can identify which Parquet file caused a read/write failure.

* Harden integration test error handling

- s3put: fail immediately on HTTP 4xx/5xx instead of logging and
  continuing
- lookupEntry: distinguish NotFound (return nil) from unexpected RPC
  errors (fail the test)
- writeOrphan and orphan creation in FullMaintenanceCycle: check
  CreateEntryResponse.Error in addition to the RPC error

* go fmt

---------

Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-15 11:27:42 -07:00
Chris Lu
3d9f7f6f81 go 1.25 2026-03-09 23:10:27 -07:00
Chris Lu
992db11d2b iam: add IAM group management (#8560)
* iam: add Group message to protobuf schema

Add Group message (name, members, policy_names, disabled) and
add groups field to S3ApiConfiguration for IAM group management
support (issue #7742).

* iam: add group CRUD to CredentialStore interface and all backends

Add group management methods (CreateGroup, GetGroup, DeleteGroup,
ListGroups, UpdateGroup) to the CredentialStore interface with
implementations for memory, filer_etc, postgres, and grpc stores.
Wire group loading/saving into filer_etc LoadConfiguration and
SaveConfiguration.

* iam: add group IAM response types

Add XML response types for group management IAM actions:
CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser.

* iam: add group management handlers to embedded IAM API

Add CreateGroup, DeleteGroup, GetGroup, ListGroups, AddUserToGroup,
RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, and ListGroupsForUser handlers with
dispatch in ExecuteAction.

* iam: add group management handlers to standalone IAM API

Add group handlers (CreateGroup, DeleteGroup, GetGroup, ListGroups,
AddUserToGroup, RemoveUserFromGroup, AttachGroupPolicy, DetachGroupPolicy,
ListAttachedGroupPolicies, ListGroupsForUser) and wire into DoActions
dispatch. Also add helper functions for user/policy side effects.

* iam: integrate group policies into authorization

Add groups and userGroups reverse index to IdentityAccessManagement.
Populate both maps during ReplaceS3ApiConfiguration and
MergeS3ApiConfiguration. Modify evaluateIAMPolicies to evaluate
policies from user's enabled groups in addition to user policies.
Update VerifyActionPermission to consider group policies when
checking hasAttachedPolicies.

* iam: add group side effects on user deletion and rename

When a user is deleted, remove them from all groups they belong to.
When a user is renamed, update group membership references. Applied
to both embedded and standalone IAM handlers.

* iam: watch /etc/iam/groups directory for config changes

Add groups directory to the filer subscription watcher so group
file changes trigger IAM configuration reloads.

* admin: add group management page to admin UI

Add groups page with CRUD operations, member management, policy
attachment, and enable/disable toggle. Register routes in admin
handlers and add Groups entry to sidebar navigation.

* test: add IAM group management integration tests

Add comprehensive integration tests for group CRUD, membership,
policy attachment, policy enforcement, disabled group behavior,
user deletion side effects, and multi-group membership. Add
"group" test type to CI matrix in s3-iam-tests workflow.

* iam: address PR review comments for group management

- Fix XSS vulnerability in groups.templ: replace innerHTML string
  concatenation with DOM APIs (createElement/textContent) for rendering
  member and policy lists
- Use userGroups reverse index in embedded IAM ListGroupsForUser for
  O(1) lookup instead of iterating all groups
- Add buildUserGroupsIndex helper in standalone IAM handlers; use it
  in ListGroupsForUser and removeUserFromAllGroups for efficient lookup
- Add note about gRPC store load-modify-save race condition limitation

* iam: add defensive copies, validation, and XSS fixes for group management

- Memory store: clone groups on store/retrieve to prevent mutation
- Admin dash: deep copy groups before mutation, validate user/policy exists
- HTTP handlers: translate credential errors to proper HTTP status codes,
  use *bool for Enabled field to distinguish missing vs false
- Groups templ: use data attributes + event delegation instead of inline
  onclick for XSS safety, prevent stale async responses

* iam: add explicit group methods to PropagatingCredentialStore

Add CreateGroup, GetGroup, DeleteGroup, ListGroups, and UpdateGroup
methods instead of relying on embedded interface fallthrough. Group
changes propagate via filer subscription so no RPC propagation needed.

* iam: detect postgres unique constraint violation and add groups index

Return ErrGroupAlreadyExists when INSERT hits SQLState 23505 instead of
a generic error. Add index on groups(disabled) for filtered queries.

* iam: add Marker field to group list response types

Add Marker string field to GetGroupResult, ListGroupsResult,
ListAttachedGroupPoliciesResult, and ListGroupsForUserResult to
match AWS IAM pagination response format.

* iam: check group attachment before policy deletion

Reject DeletePolicy if the policy is attached to any group, matching
AWS IAM behavior. Add PolicyArn to ListAttachedGroupPolicies response.

* iam: include group policies in IAM authorization

Merge policy names from user's enabled groups into the IAMIdentity
used for authorization, so group-attached policies are evaluated
alongside user-attached policies.

* iam: check for name collision before renaming user in UpdateUser

Scan identities and inline policies for newUserName before mutating,
returning EntityAlreadyExists if a collision is found. Reuse the
already-loaded policies instead of loading them again inside the loop.

* test: use t.Cleanup for bucket cleanup in group policy test

* iam: wrap ErrUserNotInGroup sentinel in RemoveGroupMember error

Wrap credential.ErrUserNotInGroup so errors.Is works in
groupErrorToHTTPStatus, returning proper 400 instead of 500.

* admin: regenerate groups_templ.go with XSS-safe data attributes

Regenerated from groups.templ which uses data-group-name attributes
instead of inline onclick with string interpolation.

* iam: add input validation and persist groups during migration

- Validate nil/empty group name in CreateGroup and UpdateGroup
- Save groups in migrateToMultiFile so they survive legacy migration

* admin: use groupErrorToHTTPStatus in GetGroupMembers and GetGroupPolicies

* iam: short-circuit UpdateUser when newUserName equals current name

* iam: require empty PolicyNames before group deletion

Reject DeleteGroup when group has attached policies, matching the
existing members check. Also fix GetGroup error handling in
DeletePolicy to only skip ErrGroupNotFound, not all errors.

* ci: add weed/pb/** to S3 IAM test trigger paths

* test: replace time.Sleep with require.Eventually for propagation waits

Use polling with timeout instead of fixed sleeps to reduce flakiness
in integration tests waiting for IAM policy propagation.

* fix: use credentialManager.GetPolicy for AttachGroupPolicy validation

Policies created via CreatePolicy through credentialManager are stored
in the credential store, not in s3cfg.Policies (which only has static
config policies). Change AttachGroupPolicy to use credentialManager.GetPolicy()
for policy existence validation.

* feat: add UpdateGroup handler to embedded IAM API

Add UpdateGroup action to enable/disable groups and rename groups
via the IAM API. This is a SeaweedFS extension (not in AWS SDK) used
by tests to toggle group disabled status.

* fix: authenticate raw IAM API calls in group tests

The embedded IAM endpoint rejects anonymous requests. Replace
callIAMAPI with callIAMAPIAuthenticated that uses JWT bearer token
authentication via the test framework.

* feat: add UpdateGroup handler to standalone IAM API

Mirror the embedded IAM UpdateGroup handler in the standalone IAM API
for parity.

* fix: add omitempty to Marker XML tags in group responses

Non-truncated responses should not emit an empty <Marker/> element.

* fix: distinguish backend errors from missing policies in AttachGroupPolicy

Return ServiceFailure for credential manager errors instead of masking
them as NoSuchEntity. Also switch ListGroupsForUser to use s3cfg.Groups
instead of in-memory reverse index to avoid stale data. Add duplicate
name check to UpdateGroup rename.

* fix: standalone IAM AttachGroupPolicy uses persisted policy store

Check managed policies from GetPolicies() instead of s3cfg.Policies
so dynamically created policies are found. Also add duplicate name
check to UpdateGroup rename.

* fix: rollback inline policies on UpdateUser PutPolicies failure

If PutPolicies fails after moving inline policies to the new username,
restore both the identity name and the inline policies map to their
original state to avoid a partial-write window.

* fix: correct test cleanup ordering for group tests

Replace scattered defers with single ordered t.Cleanup in each test
to ensure resources are torn down in reverse-creation order:
remove membership, detach policies, delete access keys, delete users,
delete groups, delete policies. Move bucket cleanup to parent test
scope and delete objects before bucket.

* fix: move identity nil check before map lookup and refine hasAttachedPolicies

Move the nil check on identity before accessing identity.Name to
prevent panic. Also refine hasAttachedPolicies to only consider groups
that are enabled and have actual policies attached, so membership in
a no-policy group doesn't incorrectly trigger IAM authorization.

* fix: fail group reload on unreadable or corrupt group files

Return errors instead of logging and continuing when group files
cannot be read or unmarshaled. This prevents silently applying a
partial IAM config with missing group memberships or policies.

* fix: use errors.Is for sql.ErrNoRows comparison in postgres group store

* docs: explain why group methods skip propagateChange

Group changes propagate to S3 servers via filer subscription
(watching /etc/iam/groups/) rather than gRPC RPCs, since there
are no group-specific RPCs in the S3 cache protocol.

* fix: remove unused policyNameFromArn and strings import

* fix: update service account ParentUser on user rename

When renaming a user via UpdateUser, also update ParentUser references
in service accounts to prevent them from becoming orphaned after the
next configuration reload.

* fix: wrap DetachGroupPolicy error with ErrPolicyNotAttached sentinel

Use credential.ErrPolicyNotAttached so groupErrorToHTTPStatus maps
it to 400 instead of falling back to 500.

* fix: use admin S3 client for bucket cleanup in enforcement test

The user S3 client may lack permissions by cleanup time since the
user is removed from the group in an earlier subtest. Use the admin
S3 client to ensure bucket and object cleanup always succeeds.

* fix: add nil guard for group param in propagating store log calls

Prevent potential nil dereference when logging group.Name in
CreateGroup and UpdateGroup of PropagatingCredentialStore.

* fix: validate Disabled field in UpdateGroup handlers

Reject values other than "true" or "false" with InvalidInputException
instead of silently treating them as false.

* fix: seed mergedGroups from existing groups in MergeS3ApiConfiguration

Previously the merge started with empty group maps, dropping any
static-file groups. Now seeds from existing iam.groups before
overlaying dynamic config, and builds the reverse index after
merging to avoid stale entries from overridden groups.

* fix: use errors.Is for filer_pb.ErrNotFound comparison in group loading

Replace direct equality (==) with errors.Is() to correctly match
wrapped errors, consistent with the rest of the codebase.

* fix: add ErrUserNotFound and ErrPolicyNotFound to groupErrorToHTTPStatus

Map these sentinel errors to 404 so AddGroupMember and
AttachGroupPolicy return proper HTTP status codes.

* fix: log cleanup errors in group integration tests

Replace fire-and-forget cleanup calls with error-checked versions
that log failures via t.Logf for debugging visibility.

* fix: prevent duplicate group test runs in CI matrix

The basic lane's -run "TestIAM" regex also matched TestIAMGroup*
tests, causing them to run in both the basic and group lanes.
Replace with explicit test function names.

* fix: add GIN index on groups.members JSONB for membership lookups

Without this index, ListGroupsForUser and membership queries
require full table scans on the groups table.

* fix: handle cross-directory moves in IAM config subscription

When a file is moved out of an IAM directory (e.g., /etc/iam/groups),
the dir variable was overwritten with NewParentPath, causing the
source directory change to be missed. Now also notifies handlers
about the source directory for cross-directory moves.

* fix: validate members/policies before deleting group in admin handler

AdminServer.DeleteGroup now checks for attached members and policies
before delegating to credentialManager, matching the IAM handler guards.

* fix: merge groups by name instead of blind append during filer load

Match the identity loader's merge behavior: find existing group
by name and replace, only append when no match exists. Prevents
duplicates when legacy and multi-file configs overlap.

* fix: check DeleteEntry response error when cleaning obsolete group files

Capture and log resp.Error from filer DeleteEntry calls during
group file cleanup, matching the pattern used in deleteGroupFile.

* fix: verify source user exists before no-op check in UpdateUser

Reorder UpdateUser to find the source identity first and return
NoSuchEntityException if not found, before checking if the rename
is a no-op. Previously a non-existent user renamed to itself
would incorrectly return success.

* fix: update service account parent refs on user rename in embedded IAM

The embedded IAM UpdateUser handler updated group membership but
not service account ParentUser fields, unlike the standalone handler.

* fix: replay source-side events for all handlers on cross-dir moves

Pass nil newEntry to bucket, IAM, and circuit-breaker handlers for
the source directory during cross-directory moves, so all watchers
can clear caches for the moved-away resource.

* fix: don't seed mergedGroups from existing iam.groups in merge

Groups are always dynamic (from filer), never static (from s3.config).
Seeding from iam.groups caused stale deleted groups to persist.
Now only uses config.Groups from the dynamic filer config.

* fix: add deferred user cleanup in TestIAMGroupUserDeletionSideEffect

Register t.Cleanup for the created user so it gets cleaned up
even if the test fails before the inline DeleteUser call.

* fix: assert UpdateGroup HTTP status in disabled group tests

Add require.Equal checks for 200 status after UpdateGroup calls
so the test fails immediately on API errors rather than relying
on the subsequent Eventually timeout.

* fix: trim whitespace from group name in filer store operations

Trim leading/trailing whitespace from group.Name before validation
in CreateGroup and UpdateGroup to prevent whitespace-only filenames.
Also merge groups by name during multi-file load to prevent duplicates.

* fix: add nil/empty group validation in gRPC store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics and invalid persistence.

* fix: add nil/empty group validation in postgres store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil member access and empty-name row inserts.

* fix: add name collision check in embedded IAM UpdateUser

The embedded IAM handler renamed users without checking if the
target name already existed, unlike the standalone handler.

* fix: add ErrGroupNotEmpty sentinel and map to HTTP 409

AdminServer.DeleteGroup now wraps conflict errors with
ErrGroupNotEmpty, and groupErrorToHTTPStatus maps it to
409 Conflict instead of 500.

* fix: use appropriate error message in GetGroupDetails based on status

Return "Group not found" only for 404, use "Failed to retrieve group"
for other error statuses instead of always saying "Group not found".

* fix: use backend-normalized group.Name in CreateGroup response

After credentialManager.CreateGroup may normalize the name (e.g.,
trim whitespace), use group.Name instead of the raw input for
the returned GroupData to ensure consistency.

* fix: add nil/empty group validation in memory store

Guard CreateGroup and UpdateGroup against nil group or empty name
to prevent panics from nil pointer dereference on map access.

* fix: reorder embedded IAM UpdateUser to verify source first

Find the source identity before checking for collisions, matching
the standalone handler's logic. Previously a non-existent user
renamed to an existing name would get EntityAlreadyExists instead
of NoSuchEntity.

* fix: handle same-directory renames in metadata subscription

Replay a delete event for the old entry name during same-directory
renames so handlers like onBucketMetadataChange can clean up stale
state for the old name.

* fix: abort GetGroups on non-ErrGroupNotFound errors

Only skip groups that return ErrGroupNotFound. Other errors (e.g.,
transient backend failures) now abort the handler and return the
error to the caller instead of silently producing partial results.

* fix: add aria-label and title to icon-only group action buttons

Add accessible labels to View and Delete buttons so screen readers
and tooltips provide meaningful context.

* fix: validate group name in saveGroup to prevent invalid filenames

Trim whitespace and reject empty names before writing group JSON
files, preventing creation of files like ".json".

* fix: add /etc/iam/groups to filer subscription watched directories

The groups directory was missing from the watched directories list,
so S3 servers in a cluster would not detect group changes made by
other servers via filer. The onIamConfigChange handler already had
code to handle group directory changes but it was never triggered.

* add direct gRPC propagation for group changes to S3 servers

Groups now have the same dual propagation as identities and policies:
direct gRPC push via propagateChange + async filer subscription.

- Add PutGroup/RemoveGroup proto messages and RPCs
- Add PutGroup/RemoveGroup in-memory cache methods on IAM
- Add PutGroup/RemoveGroup gRPC server handlers
- Update PropagatingCredentialStore to call propagateChange on group mutations

* reduce log verbosity for config load summary

Change ReplaceS3ApiConfiguration log from Infof to V(1).Infof
to avoid noisy output on every config reload.

* admin: show user groups in view and edit user modals

- Add Groups field to UserDetails and populate from credential manager
- Show groups as badges in user details view modal
- Add group management to edit user modal: display current groups,
  add to group via dropdown, remove from group via badge x button

* fix: remove duplicate showAlert that broke modal-alerts.js

admin.js defined showAlert(type, message) which overwrote the
modal-alerts.js version showAlert(message, type), causing broken
unstyled alert boxes. Remove the duplicate and swap all callers
in admin.js to use the correct (message, type) argument order.

* fix: unwrap groups API response in edit user modal

The /api/groups endpoint returns {"groups": [...]}, not a bare array.

* Update object_store_users_templ.go

* test: assert AccessDenied error code in group denial tests

Replace plain assert.Error checks with awserr.Error type assertion
and AccessDenied code verification, matching the pattern used in
other IAM integration tests.

* fix: propagate GetGroups errors in ShowGroups handler

getGroupsPageData was swallowing errors and returning an empty page
with 200 status. Now returns the error so ShowGroups can respond
with a proper error status.

* fix: reject AttachGroupPolicy when credential manager is nil

Previously skipped policy existence validation when credentialManager
was nil, allowing attachment of nonexistent policies. Now returns
a ServiceFailureException error.

* fix: preserve groups during partial MergeS3ApiConfiguration updates

UpsertIdentity calls MergeS3ApiConfiguration with a partial config
containing only the updated identity (nil Groups). This was wiping
all in-memory group state. Now only replaces groups when
config.Groups is non-nil (full config reload).

* fix: propagate errors from group lookup in GetObjectStoreUserDetails

ListGroups and GetGroup errors were silently ignored, potentially
showing incomplete group data in the UI.

* fix: use DOM APIs for group badge remove button to prevent XSS

Replace innerHTML with onclick string interpolation with DOM
createElement + addEventListener pattern. Also add aria-label
and title to the add-to-group button.

* fix: snapshot group policies under RLock to prevent concurrent map access

evaluateIAMPolicies was copying the map reference via groupMap :=
iam.groups under RLock then iterating after RUnlock, while PutGroup
mutates the map in-place. Now copies the needed policy names into
a slice while holding the lock.

* fix: add nil IAM check to PutGroup and RemoveGroup gRPC handlers

Match the nil guard pattern used by PutPolicy/DeletePolicy to
prevent nil pointer dereference when IAM is not initialized.
2026-03-09 11:54:32 -07:00
Chris Lu
78a3441b30 fix: volume balance detection returns multiple tasks per run (#8559)
* fix: volume balance detection now returns multiple tasks per run (#8551)

Previously, detectForDiskType() returned at most 1 balance task per disk
type, making the MaxJobsPerDetection setting ineffective. The detection
loop now iterates within each disk type, planning multiple moves until
the imbalance drops below threshold or maxResults is reached. Effective
volume counts are adjusted after each planned move so the algorithm
correctly re-evaluates which server is overloaded.

* fix: factor pending tasks into destination scoring and use UnixNano for task IDs

- Use UnixNano instead of Unix for task IDs to avoid collisions when
  multiple tasks are created within the same second
- Adjust calculateBalanceScore to include LoadCount (pending + assigned
  tasks) in the utilization estimate, so the destination picker avoids
  stacking multiple planned moves onto the same target disk

* test: add comprehensive balance detection tests for complex scenarios

Cover multi-server convergence, max-server shifting, destination
spreading, pre-existing pending task skipping, no-duplicate-volume
invariant, and parameterized convergence verification across different
cluster shapes and thresholds.

* fix: address PR review findings in balance detection

- hasMore flag: compute from len(results) >= maxResults so the scheduler
  knows more pages may exist, matching vacuum/EC handler pattern
- Exhausted server fallthrough: when no eligible volumes remain on the
  current maxServer (all have pending tasks) or destination planning
  fails, mark the server as exhausted and continue to the next
  overloaded server instead of stopping the entire detection loop
- Return canonical destination server ID directly from createBalanceTask
  instead of resolving via findServerIDByAddress, eliminating the
  fragile address→ID lookup for adjustment tracking
- Fix bestScore sentinel: use math.Inf(-1) instead of -1.0 so disks
  with negative scores (high pending load, same rack/DC) are still
  selected as the best available destination
- Add TestDetection_ExhaustedServerFallsThrough covering the scenario
  where the top server's volumes are all blocked by pre-existing tasks

* test: fix computeEffectiveCounts and add len guard in no-duplicate test

- computeEffectiveCounts now takes a servers slice to seed counts for all
  known servers (including empty ones) and uses an address→ID map from
  the topology spec instead of scanning metrics, so destination servers
  with zero initial volumes are tracked correctly
- TestDetection_NoDuplicateVolumesAcrossIterations now asserts len > 1
  before checking duplicates, so the test actually fails if Detection
  regresses to returning a single task

* fix: remove redundant HasAnyTask check in createBalanceTask

The HasAnyTask check in createBalanceTask duplicated the same check
already performed in detectForDiskType's volume selection loop.
Since detection runs single-threaded (MaxDetectionConcurrency: 1),
no race can occur between the two points.

* fix: consistent hasMore pattern and remove double-counted LoadCount in scoring

- Adopt vacuum_handler's hasMore pattern: over-fetch by 1, check
  len > maxResults, and truncate — consistent truncation semantics
- Remove direct LoadCount penalty in calculateBalanceScore since
  LoadCount is already factored into effectiveVolumeCount for
  utilization scoring; bump utilization weight from 40 to 50 to
  compensate for the removed 10-point load penalty

* fix: handle zero maxResults as no-cap, emit trace after trim, seed empty servers

- When MaxResults is 0 (omitted), treat as no explicit cap instead of
  defaulting to 1; only apply the +1 over-fetch probe when caller
  supplies a positive limit
- Move decision trace emission after hasMore/trim so the trace
  accurately reflects the returned proposals
- Seed serverVolumeCounts from ActiveTopology so servers that have a
  matching disk type but zero volumes are included in the imbalance
  calculation and MinServerCount check

* fix: nil-guard clusterInfo, uncap legacy DetectionFunc, deterministic disk type order

- Add early nil guard for clusterInfo in Detection to prevent panics
  in downstream helpers (detectForDiskType, createBalanceTask)
- Change register.go DetectionFunc wrapper from maxResults=1 to 0
  (no cap) so the legacy code path returns all detected tasks
- Sort disk type keys before iteration so results are deterministic
  when maxResults spans multiple disk types (HDD/SSD)

* fix: don't over-fetch in stateful detection to avoid orphaned pending tasks

Detection registers planned moves in ActiveTopology via AddPendingTask,
so requesting maxResults+1 would create an extra pending task that gets
discarded during trim. Use len(results) >= maxResults as the hasMore
signal instead, which is correct since Detection already caps internally.

* fix: return explicit truncated flag from Detection instead of approximating

Detection now returns (results, truncated, error) where truncated is true
only when the loop stopped because it hit maxResults, not when it ran out
of work naturally. This eliminates false hasMore signals when detection
happens to produce exactly maxResults results by resolving the imbalance.

* cleanup: simplify detection logic and remove redundancies

- Remove redundant clusterInfo nil check in detectForDiskType since
  Detection already guards against nil clusterInfo
- Remove adjustments loop for destination servers not in
  serverVolumeCounts — topology seeding ensures all servers with
  matching disk type are already present
- Merge two-loop min/max calculation into a single loop: min across
  all servers, max only among non-exhausted servers
- Replace magic number 100 with len(metrics) for minC initialization
  in convergence test

* fix: accurate truncation flag, deterministic server order, indexed volume lookup

- Track balanced flag to distinguish "hit maxResults cap" from "cluster
  balanced at exactly maxResults" — truncated is only true when there's
  genuinely more work to do
- Sort servers for deterministic iteration and tie-breaking when
  multiple servers have equal volume counts
- Pre-index volumes by server with per-server cursors to avoid
  O(maxResults * volumes) rescanning on each iteration
- Add truncation flag assertions to RespectsMaxResults test: true when
  capped, false when detection finishes naturally

* fix: seed trace server counts from ActiveTopology to match detection logic

The decision trace was building serverVolumeCounts only from metrics,
missing zero-volume servers seeded from ActiveTopology by Detection.
This could cause the trace to report wrong server counts, incorrect
imbalance ratios, or spurious "too few servers" messages. Pass
activeTopology into the trace function and seed server counts the
same way Detection does.

* fix: don't exhaust server on per-volume planning failure, sort volumes by ID

- When createBalanceTask returns nil, continue to the next volume on
  the same server instead of marking the entire server as exhausted.
  The failure may be volume-specific (not found in topology, pending
  task registration failed) and other volumes on the server may still
  be viable candidates.
- Sort each server's volume slice by VolumeID after pre-indexing so
  volume selection is fully deterministic regardless of input order.

* fix: use require instead of assert to prevent nil dereference panic in CORS test

The test used assert.NoError (non-fatal) for GetBucketCors, then
immediately accessed getResp.CORSRules. When the API returns an error,
getResp is nil causing a panic. Switch to require.NoError/NotNil/Len
so the test stops before dereferencing a nil response.

* fix: deterministic disk tie-breaking and stronger pre-existing task test

- Sort available disks by NodeID then DiskID before scoring so
  destination selection is deterministic when two disks score equally
- Add task count bounds assertion to SkipsPreExistingPendingTasks test:
  with 15 of 20 volumes already having pending tasks, at most 5 new
  tasks should be created and at least 1 (imbalance still exists)

* fix: seed adjustments from existing pending/assigned tasks to prevent over-scheduling

Detection now calls ActiveTopology.GetTaskServerAdjustments() to
initialize the adjustments map with source/destination deltas from
existing pending and assigned balance tasks. This ensures
effectiveCounts reflects in-flight moves, preventing the algorithm
from planning additional moves in the same direction when prior
moves already address the imbalance.

Added GetTaskServerAdjustments(taskType) to ActiveTopology which
iterates pending and assigned tasks, decrementing source servers
and incrementing destination servers for the given task type.
2026-03-08 21:34:03 -07:00
Chris Lu
df5e8210df Implement IAM managed policy operations (#8507)
* feat: Implement IAM managed policy operations (GetPolicy, ListPolicies, DeletePolicy, AttachUserPolicy, DetachUserPolicy)

- Add response type aliases in iamapi_response.go for managed policy operations
- Implement 6 handler methods in iamapi_management_handlers.go:
  - GetPolicy: Lookup managed policy by ARN
  - DeletePolicy: Remove managed policy
  - ListPolicies: List all managed policies
  - AttachUserPolicy: Attach managed policy to user, aggregating inline + managed actions
  - DetachUserPolicy: Detach managed policy from user
  - ListAttachedUserPolicies: List user's attached managed policies
- Add computeAllActionsForUser() to aggregate actions from both inline and managed policies
- Wire 6 new DoActions switch cases for policy operations
- Add comprehensive tests for all new handlers
- Fixes #8506

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address PR review feedback for IAM managed policy operations

- Add parsePolicyArn() helper with proper ARN prefix validation, replacing
  fragile strings.Split parsing in GetPolicy, DeletePolicy, AttachUserPolicy,
  and DetachUserPolicy
- DeletePolicy now detaches the policy from all users and recomputes their
  aggregated actions, preventing stale permissions after deletion
- Set changed=true for DeletePolicy DoActions case so identity updates persist
- Make PolicyId consistent: CreatePolicy now uses Hash(&policyName) matching
  GetPolicy and ListPolicies
- Remove redundant nil map checks (Go handles nil map lookups safely)
- DRY up action deduplication in computeAllActionsForUser with addUniqueActions
  closure
- Add tests for invalid/empty ARN rejection and DeletePolicy identity cleanup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add integration tests for managed policy lifecycle (#8506)

Add two integration tests covering the user-reported use case where
managed policy operations returned 500 errors:

- TestS3IAMManagedPolicyLifecycle: end-to-end workflow matching the
  issue report — CreatePolicy, ListPolicies, GetPolicy, AttachUserPolicy,
  ListAttachedUserPolicies, idempotent re-attach, DeletePolicy while
  attached (expects DeleteConflict), DetachUserPolicy, DeletePolicy,
  and verification that deleted policy is gone

- TestS3IAMManagedPolicyErrorCases: covers error paths — nonexistent
  policy/user for GetPolicy, DeletePolicy, AttachUserPolicy,
  DetachUserPolicy, and ListAttachedUserPolicies

Also fixes DeletePolicy to reject deletion when policy is still attached
to a user (AWS-compatible DeleteConflictException), and adds the 409
status code mapping for DeleteConflictException in the error response
handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: nil map panic in CreatePolicy, add PolicyId test assertions

- Initialize policies.Policies map in CreatePolicy if nil (prevents panic
  when no policies exist yet); also handle filer_pb.ErrNotFound like other
  callers
- Add PolicyId assertions in TestGetPolicy and TestListPolicies to lock in
  the consistent Hash(&policyName) behavior
- Remove redundant time.Sleep calls from new integration tests (startMiniCluster
  already blocks on waitForS3Ready)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: PutUserPolicy and DeleteUserPolicy now preserve managed policy actions

PutUserPolicy and DeleteUserPolicy were calling computeAggregatedActionsForUser
(inline-only), overwriting ident.Actions and dropping managed policy actions.
Both now call computeAllActionsForUser which unions inline + managed actions.

Add TestManagedPolicyActionsPreservedAcrossInlineMutations regression test:
attaches a managed policy, adds an inline policy (verifies both actions present),
deletes the inline policy, then asserts managed policy actions still persist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: PutUserPolicy verifies user exists before persisting inline policy

Previously the inline policy was written to storage before checking if the
target user exists in s3cfg.Identities, leaving orphaned policy data when
the user was absent. Now validates the user first, returning
NoSuchEntityException immediately if not found.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: prevent stale/lost actions on computeAllActionsForUser failure

- PutUserPolicy: on recomputation failure, preserve existing ident.Actions
  instead of falling back to only the current inline policy's actions
- DeleteUserPolicy: on recomputation failure, preserve existing ident.Actions
  instead of assigning nil (which wiped all permissions)
- AttachUserPolicy: roll back ident.PolicyNames and return error if
  action recomputation fails, keeping identity consistent
- DetachUserPolicy: roll back ident.PolicyNames and return error if
  GetPolicies or action recomputation fails
- Add doc comment on newTestIamApiServer noting it only sets s3ApiConfig

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 14:18:07 -08:00
Chris Lu
10a30a83e1 s3api: add GetObjectAttributes API support (#8504)
* s3api: add error code and header constants for GetObjectAttributes

Add ErrInvalidAttributeName error code and header constants
(X-Amz-Object-Attributes, X-Amz-Max-Parts, X-Amz-Part-Number-Marker,
X-Amz-Delete-Marker) needed by the S3 GetObjectAttributes API.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: implement GetObjectAttributes handler

Add GetObjectAttributesHandler that returns selected object metadata
(ETag, Checksum, StorageClass, ObjectSize, ObjectParts) without
returning the object body. Follows the same versioning and conditional
header patterns as HeadObjectHandler.

The handler parses the X-Amz-Object-Attributes header to determine
which attributes to include in the XML response, and supports
ObjectParts pagination via X-Amz-Max-Parts and X-Amz-Part-Number-Marker.

Ref: https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObjectAttributes.html

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: register GetObjectAttributes route

Register the GET /{object}?attributes route for the
GetObjectAttributes API, placed before other object query
routes to ensure proper matching.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add integration tests for GetObjectAttributes

Test coverage:
- Basic: simple object with all attribute types
- MultipartObject: multipart upload with parts pagination
- SelectiveAttributes: requesting only specific attributes
- InvalidAttribute: server rejects invalid attribute names
- NonExistentObject: returns NoSuchKey for missing objects

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add versioned object test for GetObjectAttributes

Test puts two versions of the same object and verifies that:
- GetObjectAttributes returns the latest version by default
- GetObjectAttributes with versionId returns the specific version
- ObjectSize and VersionId are correct for each version

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: fix combined conditional header evaluation per RFC 7232

Per RFC 7232:
- Section 3.4: If-Unmodified-Since MUST be ignored when If-Match is
  present (If-Match is the more accurate replacement)
- Section 3.3: If-Modified-Since MUST be ignored when If-None-Match is
  present (If-None-Match is the more accurate replacement)

Previously, all four conditional headers were evaluated independently.
This caused incorrect 412 responses when If-Match succeeded but
If-Unmodified-Since failed (should return 200 per AWS S3 behavior).

Fix applied to both validateConditionalHeadersForReads (GET/HEAD) and
validateConditionalHeaders (PUT) paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add conditional header combination tests for GetObjectAttributes

Test the RFC 7232 combined conditional header semantics:
- If-Match=true + If-Unmodified-Since=false => 200 (If-Unmodified-Since ignored)
- If-None-Match=false + If-Modified-Since=true => 304 (If-Modified-Since ignored)
- If-None-Match=true + If-Modified-Since=false => 200 (If-Modified-Since ignored)
- If-Match=true + If-Unmodified-Since=true => 200
- If-Match=false => 412 regardless

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: document Checksum attribute as not yet populated

Checksum is accepted in validation (so clients requesting it don't get
a 400 error, matching AWS behavior for objects without checksums) but
SeaweedFS does not yet store S3 checksums. Add a comment explaining
this and noting where to populate it when checksum storage is added.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add s3:GetObjectAttributes IAM action for ?attributes query

Previously, GET /{object}?attributes resolved to s3:GetObject via the
fallback path since resolveFromQueryParameters had no case for the
"attributes" query parameter.

Add S3_ACTION_GET_OBJECT_ATTRIBUTES constant ("s3:GetObjectAttributes")
and a branch in resolveFromQueryParameters to return it for GET requests
with the "attributes" query parameter, so IAM policies can distinguish
GetObjectAttributes from GetObject.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: evaluate conditional headers after version resolution

Move conditional header evaluation (If-Match, If-None-Match, etc.) to
after the version resolution step in GetObjectAttributesHandler. This
ensures that when a specific versionId is requested, conditions are
checked against the correct version entry rather than always against
the latest version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: use bounded HTTP client in GetObjectAttributes tests

Replace http.DefaultClient with a timeout-aware http.Client (10s) in
the signedGetObjectAttributes helper and testGetObjectAttributesInvalid
to prevent tests from hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: check attributes query before versionId in action resolver

Move the GetObjectAttributes action check before the versionId check
in resolveFromQueryParameters. This fixes GET /bucket/key?attributes&versionId=xyz
being incorrectly classified as s3:GetObjectVersion instead of
s3:GetObjectAttributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: add tests for versioned conditional headers and action resolver

Add integration test that verifies conditional headers (If-Match,
If-None-Match) are evaluated against the requested version entry, not
the latest version. This covers the fix in 55c409dec.

Add unit test for ResolveS3Action verifying that the attributes query
parameter takes precedence over versionId, so GET ?attributes&versionId
resolves to s3:GetObjectAttributes. This covers the fix in b92c61c95.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: guard negative chunk indices and rename PartsCount field

Add bounds checks for b.StartChunk >= 0 and b.EndChunk >= 0 in
buildObjectAttributesParts to prevent panics from corrupted metadata
with negative index values.

Rename ObjectAttributesParts.PartsCount to TotalPartsCount to match
the AWS SDK v2 Go field naming convention, while preserving the XML
element name "PartsCount" via the struct tag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* s3api: reject malformed max-parts and part-number-marker headers

Return ErrInvalidMaxParts and ErrInvalidPartNumberMarker when the
X-Amz-Max-Parts or X-Amz-Part-Number-Marker headers contain
non-integer or negative values, matching ListObjectPartsHandler
behavior. Previously these were silently ignored with defaults.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 12:52:09 -08:00
blitt001
3d81d5bef7 Fix S3 signature verification behind reverse proxies (#8444)
* Fix S3 signature verification behind reverse proxies

When SeaweedFS is deployed behind a reverse proxy (e.g. nginx, Kong,
Traefik), AWS S3 Signature V4 verification fails because the Host header
the client signed with (e.g. "localhost:9000") differs from the Host
header SeaweedFS receives on the backend (e.g. "seaweedfs:8333").

This commit adds a new -s3.externalUrl parameter (and S3_EXTERNAL_URL
environment variable) that tells SeaweedFS what public-facing URL clients
use to connect. When set, SeaweedFS uses this host value for signature
verification instead of the Host header from the incoming request.

New parameter:
  -s3.externalUrl  (flag) or S3_EXTERNAL_URL (environment variable)
  Example: -s3.externalUrl=http://localhost:9000
  Example: S3_EXTERNAL_URL=https://s3.example.com

The environment variable is particularly useful in Docker/Kubernetes
deployments where the external URL is injected via container config.
The flag takes precedence over the environment variable when both are set.

At startup, the URL is parsed and default ports are stripped to match
AWS SDK behavior (port 80 for HTTP, port 443 for HTTPS), so
"http://s3.example.com:80" and "http://s3.example.com" are equivalent.

Bugs fixed:
- Default port stripping was removed by a prior PR, causing signature
  mismatches when clients connect on standard ports (80/443)
- X-Forwarded-Port was ignored when X-Forwarded-Host was not present
- Scheme detection now uses proper precedence: X-Forwarded-Proto >
  TLS connection > URL scheme > "http"
- Test expectations for standard port stripping were incorrect
- expectedHost field in TestSignatureV4WithForwardedPort was declared
  but never actually checked (self-referential test)

* Add Docker integration test for S3 proxy signature verification

Docker Compose setup with nginx reverse proxy to validate that the
-s3.externalUrl parameter (or S3_EXTERNAL_URL env var) correctly
resolves S3 signature verification when SeaweedFS runs behind a proxy.

The test uses nginx proxying port 9000 to SeaweedFS on port 8333,
with X-Forwarded-Host/Port/Proto headers set. SeaweedFS is configured
with -s3.externalUrl=http://localhost:9000 so it uses "localhost:9000"
for signature verification, matching what the AWS CLI signs with.

The test can be run with aws CLI on the host or without it by using
the amazon/aws-cli Docker image with --network host.

Test covers: create-bucket, list-buckets, put-object, head-object,
list-objects-v2, get-object, content round-trip integrity,
delete-object, and delete-bucket — all through the reverse proxy.

* Create s3-proxy-signature-tests.yml

* fix CLI

* fix CI

* Update s3-proxy-signature-tests.yml

* address comments

* Update Dockerfile

* add user

* no need for fuse

* Update s3-proxy-signature-tests.yml

* debug

* weed mini

* fix health check

* health check

* fix health checking

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Chris Lu <chris.lu@gmail.com>
2026-02-26 14:20:42 -08:00
Chris Lu
a92e9baddf Add integration test for multipart operations inheriting s3:PutObject permissions
TestS3MultipartOperationsInheritPutObjectPermissions verifies that multipart
upload operations (CreateMultipartUpload, UploadPart, ListParts,
CompleteMultipartUpload, AbortMultipartUpload, ListMultipartUploads) work
correctly when a user has only s3:PutObject permission granted.

This test validates the behavior where multipart operations are implicitly
granted when s3:PutObject is authorized, as multipart upload is an
implementation detail of putting objects in S3.
2026-02-25 13:38:14 -08:00
Chris Lu
b565a0cc86 Adds volume.merge command with deduplication and disk-based backend (#8441)
* Enhance volume.merge command with deduplication and disk-based backend

* Fix copyVolume function call with correct argument order and missing bool parameter

* Revert "Fix copyVolume function call with correct argument order and missing bool parameter"

This reverts commit 7b4a190643.

* Fix critical issues: per-replica writable tracking, tail goroutine cancellation via done channel, and debug logging for allocation failures

* Optimize memory usage with watermark approach for duplicate detection

* Fix critical issues: swap copyVolume arguments, increase idle timeout, remove file double-close, use glog for logging

* Replace temporary file with in-memory buffer for needle blob serialization

* test(volume.merge): Add comprehensive unit and integration tests

Add 7 unit tests covering:
- Ordering by timestamp
- Cross-stream duplicate deduplication
- Empty stream handling
- Complex multi-stream deduplication
- Single stream passthrough
- Large needle ID support
- LastModified fallback when timestamp unavailable

Add 2 integration validation tests:
- TestMergeWorkflowValidation: Documents 9-stage merge workflow
- TestMergeEdgeCaseHandling: Validates 10 edge case handling

All tests passing (9/9)

* fix(volume.merge): Use time window for deduplication to handle clock skew

The same needle ID can have different timestamps on different servers due to
clock skew and replication lag. Needles with the same ID within a 5-second
time window are now treated as duplicates (same write with timestamp variance).

Key changes:
- Add mergeDeduplicationWindowNs constant (5 seconds)
- Replace exact timestamp matching with time window comparison
- Use windowInitialized flag to properly detect window transitions
- Add TestMergeNeedleStreamsTimeWindowDeduplication test

This ensures that replicated writes with slight timestamp differences are
properly deduplicated during merge, while separate updates to the same file
ID (outside the window) are preserved.

All tests passing (10/10)

* test: Add volume.merge integration tests with 5 comprehensive test cases

* test: integration tests for volume.merge command

* Fix integration tests: use TripleVolumeCluster for volume.merge testing

- Created new TripleVolumeCluster framework (cluster_triple.go) with 3 volume servers
- Rebuilt weed binary with volume.merge command compiled in
- Updated all 5 integration tests to use TripleVolumeCluster instead of DualVolumeCluster
- Tests now properly allocate volumes on 2 servers and let merge allocate on 3rd
- All 5 integration tests now pass:
  - TestVolumeMergeBasic
  - TestVolumeMergeReadonly
  - TestVolumeMergeRestore
  - TestVolumeMergeTailNeedles
  - TestVolumeMergeDivergentReplicas

* Refactor test framework: use parameterized server count instead of hardcoded

- Renamed TripleVolumeCluster to MultiVolumeCluster with serverCount parameter
- Replaced hardcoded volumePort0/1/2 with slices for flexible server count
- Updated StartTripleVolumeCluster as backward-compatible wrapper calling StartMultiVolumeCluster(t, profile, 3)
- Made directory creation, port allocation, and server startup loop-based
- Updated accessor methods (VolumeAdminAddress, VolumeGRPCAddress, etc.) to support any server count
- All 5 integration tests continue to pass with new parameterized cluster framework
- Enables future testing with 2, 4, 5+ volume servers by calling StartMultiVolumeCluster directly

* Consolidate cluster frameworks: StartDualVolumeCluster now uses MultiVolumeCluster

- Made DualVolumeCluster a type alias for MultiVolumeCluster
- Updated StartDualVolumeCluster to call StartMultiVolumeCluster(t, profile, 2)
- Removed duplicate code from cluster_dual.go (now just 17 lines)
- All existing tests using StartDualVolumeCluster continue to work without changes
- Backward compatible: existing code continues to use the old function signatures
- Added wrapper functions in cluster_multi.go for StartTripleVolumeCluster
- Enables unified cluster management across all test suites

* Address PR review comments: improve error handling and clean up code

- Replace parse error swallow with proper error return
- Log cleanup and restoration errors instead of silently discarding them
- Remove unused offset field from memoryBackendFile struct
- Fix WriteAt buffer truncation bug to preserve trailing bytes
- All unit tests passing (10/10)
- Code compiles successfully

* Fix PR review findings: test improvements and code quality

- Add timeout to runWeedShell to prevent hanging
- Add server 1 readonly status verification in tests
- Assert merge fails when replicas writable (not just log output)
- Replace sleep with polling for writable restoration check
- Fix WriteAt stale data snapshot bug in memoryBackendFile
- Fix startVolume error logging to show current server log
- Fix volumePubPorts double assignment in port allocation
- Rename test to reflect behavior: DoesNotDeduplicateAcrossWindows
- Fix misleading dedup window comment

Unit tests: 10/10 passing
Binary: Compiles successfully

* Fix test assumption: merge command marks volumes readonly automatically

TestVolumeMergeReadonly was expecting merge to fail on writable volumes, but the
merge command is designed to mark volumes readonly as part of its operation. Fixed
test to verify merge succeeds on writable volumes and properly restores writable
state afterward. Removed redundant Test 2 code that duplicated the new behavior.

* fmt

* Fix deduplication logic to correctly handle same-stream vs cross-stream duplicates

The dedup map previously used only NeedleId as key, causing same-stream
overwrites to be incorrectly skipped as duplicates. Changed to track which
stream first processed each needle ID in the current window:

- Cross-stream duplicates (same ID from different streams, within window) are skipped
- Same-stream duplicates (overwrites from same stream) are kept
- Map now stores: needleId -> streamIndex of first occurrence in window

Added TestMergeNeedleStreamsSameStreamDuplicates to verify same-stream
overwrites are preserved while cross-stream duplicates are skipped.

All unit tests passing (11/11)
Binary compiles successfully
2026-02-25 10:12:09 -08:00
Chris Lu
e4b70c2521 go fix 2026-02-20 18:42:00 -08:00